Sunteți pe pagina 1din 65

Rahul Jaiswal (101630556)

Final Project on Homomorphic-deconvolution


(Use of cepstral analysis to deconvolve pitch information from vocal
tract information in speech production)

-by Rahul Jaiswal

Rahul Jaiswal (101630556)

Contents:

Topic
Aim
Motivation
Theory
MATLAB Program
Plots & Results
Applications

Page No:
3
3
4
8
10
65

Rahul Jaiswal (101630556)

There are a lot of algorithms which leads to detection of pitch of speech signals. The method I
have used is Cepstral Method which is more reliable than some other outdated, extensive
methods having multiple methods.
The main aim of this project is to understand motivation behind Cepstral Analysis of speech,
understand basic Cepstral Analysis approach which is used to perform vocal tract and source
information separation, understand liftering concept and develop a pitch determination
method.

Homomorphic-deconvolution is an algorithm designed to estimate the pitch or fundamental frequency


and vocal tract information of a digital recording of speech or a musical note or tone. Cepstral analysis is
the method used to achieve this. Any signal coming out from a system is due to the input excitation and
also the response of the system. From the signal processing point of view, the output of a system can be
treated as the convolution of the input excitation with the system response. At times, we need each of
the components separately for study and processing. The process of separating the two components is
termed as de-convolution.

Human speech is very closely related to the above concept. When a speech signal is produced,
it passes two steps. excitation ("source") and signal shaping ("filter"). The objective of cepstral
analysis is to separate the speech into its source and filter components without any prior
knowledge about source and filter so that the information can be used in various speech
processing applications.
There are two types of sounds Voiced and Unvoiced. We will mainly concentrate in the Voiced
segment and they mainly include the Vowels. So these voiced sounds are produced by exciting
the time varying system characteristics with periodic impulse sequence and unvoiced sounds
are produced by exciting the time varying system with a random noise sequence. The resulting
speech can be considered as the convolution of respective excitation sequence and vocal tract
filter characteristics. Let x(n) is the excitation sequence and h(n) is the vocal tract filter
sequence, then the speech sequence y(n) can be expressed as follows:

y(n)=x(n)* h(n)

(1)

Y(w) =X(w). H(w)

(2)

Or,

As we can see this is very similar to the basic convolution in time domain that we have been
doing, but in this case we dont have any kindof knowledge about either X(w) or H(w). This is

Rahul Jaiswal (101630556)

the reason why we use Cepstral analysis. As Cepstral converts convolution in time domain to
addition in quefrency domain. So, we get a linear combination of x(n) & h(n) in quefrency
domain.

Basic Principles of Cepstral analysis:

The main idea of Homomorphic-deconvolution is to convert the product X(w).H(w) [Y(w)] into
the sum by applying a logarithmic function. The complex cepstrum is defined as the inverse
Fourier transformation of the log-normalized Fourier transform of the input signal, which is
reverted to the time or the quefrency domain. So the return signal as be written in the
transform domain as
Y(w)= X(w) +H(w)

(3)

So now in the quefrency domain, the vocal tract components are represented by the slowly
varying components concentrated near the lower quefrency region and excitation components
(pitch) are represented by the fast varying components at the higher quefrency region.

The figure below gives a pictorial view of the steps to convert a speech signal to cepstral
domain representation.

y(n)

y_w(n)
Windowing

Y_W(w)
DFT

Log|Y_W(w)|
Log|y(w)|

c(n)
IDFT

In the above figure y(n) is the speech signal, y_w(n) is the windowed frame. We obtain the
windowed frame y_w(n) by multiplying the speech signal s(n) with the hamming window h(n).
Now we perform N-pts. DFT of the windowed frame to obtain Y_W(w). Log|Y_W(w)| is the log
magnitude spectrum, obtained by taking log of the |Y_W(w)|. Then we perform IDFT of
Log|Y_W(w)| to obtain the cepstrum of the speech signal s(n). The cepstrum thus obtained
contains vocal tract information linearly combined with the pitch information, which can be
separated using liftering technique, which will be discussed next.

Rahul Jaiswal (101630556)

Basic Principles of Liftering Technique:


Lifter is a filter which operates in quefrency domain. A low time lifter is similar to a low pass
filter in the frequency domain. It can be implemented by multiplying by a window in the
quefrency domain and when converted back to the frequency domain, resulting in a smoother
signal. We will use low time liftering to obtain smoother formant functions, and then estimate
formant frequencies from that function. High time lifter is used to obtain pitch information.
Lets discuss both of these in detail:

Pitch Estimation (High Time Liftering):


The pitch information typically appears as periodic peaks occurring after around
12-20 samples in the cepstrum spectrum. Making use of this assumption we use the high time
liftering window to separate out the pitch component from the cepstrum signal. The High Time
Liftering used here is just the opposite of the Low Time Liftering used later.
We represent the High time Lifter as H[n]. So,

H[n]=0 for n<=L0


H[n]=1 for n>L0 & n<5000

(if # samples=5000)

Rahul Jaiswal (101630556)

Figure shows the High Time Liftering Window.

We multiply this High time Liftering signal H[n] with the cepstrum c(n), and then look for the
highest peak in the graph. The sample contains many periodic peaks but the sample at which
the highest peak occurs, gives the pitch of the speech signal s(n). We can calculate the pitch
from this sample value, as we already know the sampling frequency.

Formant estimation (Low Time Liftering):


Low time liftering is applied on the cepstrum of the speech signal, to obtain the formant
estimation. Formants are defined as "the spectral peaks on the sound spectrum of the voice". It
is often measured as an amplitude peak in the frequency spectrum of the sound, using a
spectrogram.
The low-time liftering window that has been used in this project for extracting vocal tract
characteristics are:
L[n]=1 for n<=L0
L[n]=0 for n>L0& n<5000

Rahul Jaiswal (101630556)

L0 is typically in between 12-20 samples.


The characteristics of the window are drawn below in time domain:

We have taken L0=15 samples. So L=1 for first 15 samples and remains at 0 for the rest 5000
samples.
Cl[n]=c(n).L[n]
Vocal Tract characteristics can be obtained by multiplying the cepstrum signal c(n) with this
window L[n], followed by taking the DFT. Taking DFT on the low time liftered signal gives the
Log-Magnitude of the Vocal-Tract Spectrum. Here H(w) is the response of the vocal tract in
frequency domain.
So,
log|H(w)|= DFT[Cl[n]]
All the details about the vocal tract can be figured out from this vocal tract spectrum, like the
Formant Frequency. The spectrum has peaks at regular intervals, as already mentioned in the
first paragraph of this topic. Each of these local peaks represents different formant frequencies.
And, these frequencies are different for different vowels.

Rahul Jaiswal (101630556)

Main Function:
% *************************************************************************
% This function seperates the formant frequency (vocal tract information) *
% from the pitch information in the voiced human speech.
*
% parameters:
*
%
SYNOPSIS:
*
%
------------------------------------------------*
%
final_op = speech_synth(fs,p,N,rN,vowel,method)
*
%
------------------------------------------------*
%
*
*
% *************************************************************************
clear all;
clc;
close all;
%% Uploading the File Here.
[FileName,PathName] = uigetfile('*.wav','Select the WAV file to process');
ln=input('Please inter the liftering window length. The typical value is
between 10-40');
FilePath=strcat(PathName, FileName);
[y,fs]=audioread(FilePath);
[PATHSTR,NAME,EXT] = fileparts(FilePath);
%%Checking the validity of the file.
if ((strcmp(EXT,'.wav'))||(strcmp(EXT,'.mp3'))||(strcmp(EXT,'.wav')))
%%Dual Channel to single channel audio conversion.
y=y(:,1);
%%Taking a small frame of audio data.
y=y(2000:5000);
y=double(y);
N=length(y);
t=(0:N-1)/fs;
% time in seconds.
figure;
subplot(2,2,1);
plot(y);
legend('Speech Signal');
xlabel('Time (s)');
ylabel('Amplitude');
y_th=y./(1.1*abs(max(y)));
y_th=y_th(1:N);
subplot(2,2,2);
plot(t,y_th);
legend('Normalized Signal');
xlabel('Time (s)');
ylabel('Amplitude');
w=hamming(N);

% Speech Signal.

%Normalized & Framed Signal.

%Hamming Window.

Rahul Jaiswal (101630556)


y_w=y_th.*w;
subplot(2,2,3);
plot(t,y_w);
legend('Windowed Signal');
xlabel('Time (s)');
ylabel('Amplitude');
%axis([0,0.3,-2,2]);
y_fft=fft(y_w,N);
k=0:N-1;
subplot(2,2,4);
plot(k, y_fft);
legend('DFT');
xlabel('Frequency (Hz)');
ylabel('Amplitude');
%axis([0,10,-3,3]);
c = ifft(log(abs(y_fft)));
figure , plot(c);
axis([0,N/2,-1,1]);
% Since the cepstrum is always symmetric so we
have taken only half of its values.
legend('Cepstrum');
xlabel('Quefrency');
ylabel('Amplitude');
y_c=c(1:length(c)/2);
[y_formant,y_Pitch,p_frequency,Magnitude_F,formant_ceps,formant]=liftering(y_
c,fs,N,ln); % Function for performing liftering
t1=1:N/2;
figure, plot(t1,y_formant);
axis([0,N/2,-1,1]);
legend('Low Time Liftered Cepstrum');
xlabel('Quefrency ');
ylabel('Amplitude');
% Multiplying the cepstrum with the High Time
Lifter to obtain the Pitch estimation
figure, plot(t1,y_Pitch);
axis([0,N/2,-1,1]);
legend('High Time Liftered Cepstrum');
xlabel('Quefrency');
ylabel('Amplitude');
%%Formant Estimation Algorithm
figure, plot(formant_ceps);
hold on;
plot(formant,Magnitude_F, 'ko');
hold off;
legend('Formant Spectrum');
xlabel('Quefrency');
ylabel('Amplitude in LOG');
else
error('The file is invalid. Please upload only wav, mp3 or au file
formats');
end

Rahul Jaiswal (101630556)

Liftering Function:
function [y_formant,y_Pitch, p_frequency,
Magnitude_F,formant_ceps,formant]=liftering(c,fs,N,ln)
%% low quefrency lifter of length=ln quefrencies for obtaining vocal track
estimation.
t=1;
L=zeros(1,length(c));
L=L';
L(1:ln)=1;
y_formant=real(c.*L);
% Multiplying the cepstrum with the
Low Time Lifter to obtain the vocal tract estimation.
%%High Time Lifter
H=zeros(1,length(c));
H=H';
H(ln:N/2)=1;
y_Pitch=real(c.*H);
% Multiplying the cepstrum with the
High Time Lifter to obtain the pitch estimation.
[y_Pitchvalue, y_Pitchlocation]= max(y_Pitch); % Calculating the maximum
value in the y_Pitch Matrix to obtain the value of pitch.
p_period=y_Pitchlocation;
p_frequency =(1/p_period)*fs;

%%Formant Estimation
yy_formant=y_formant(1:ln);
formant_ceps=fft(y_formant,10000);
formant_ceps=formant_ceps(1:5000);
formant_ceps=real(formant_ceps);

for i=2:length(formant_ceps)-1
if(formant_ceps(i-1)<formant_ceps(i) & formant_ceps(i+1)<formant_ceps(i))
Magnitude_F(t)=formant_ceps(i);
formant(t)=i;
t=t+1;
else
continue;
end
end
return
end

I ran the program for different vowels in the english alphabets. Below I have attached the Pitch
Estimation Spectra, Formant (Vocal Tract) Estimation spectra along with their fundamental
frequencies systematically.

Rahul Jaiswal (101630556)

Vowel: A_Front
Symbol: a

Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 135.5932 Hz , F1= 850Hz, F2=1600Hz

Figure 1

Rahul Jaiswal (101630556)

Figure 2

Figure 3

Rahul Jaiswal (101630556)

Figure 4

Figure 5

Rahul Jaiswal (101630556)

Vowel: ae
Symbol:

Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 129.0323Hz , F1= 500Hz, F2=1400Hz

Figure 1

Rahul Jaiswal (101630556)

Figure 2

Figure 3

Rahul Jaiswal (101630556)

Figure 4

Figure 5

Rahul Jaiswal (101630556)

Vowel: BackwardSchwa
Symbol:

Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 140.3509Hz , F1= 250Hz, F2=1400Hz

Figure 1

Rahul Jaiswal (101630556)

Figure 2

Figure 3

Rahul Jaiswal (101630556)

Figure 4

Figure 5

Rahul Jaiswal (101630556)

Vowel: BackwardsEpsilon
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 139.1304Hz , F1= 350Hz, F2=1250Hz

Figure 1

Rahul Jaiswal (101630556)

Figure 2

Figure 3

Rahul Jaiswal (101630556)

Figure4

Figure 5

Rahul Jaiswal (101630556)

Vowel: Barred I
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 140.3509Hz , F1= 1500Hz

Figure 1

Rahul Jaiswal (101630556)

Figure 2

Figure 3

Rahul Jaiswal (101630556)

Figure 4

Figure 5

Rahul Jaiswal (101630556)

Vowel: BarredO
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 146.789Hz , F1= 350Hz, F2=1200Hz

Figure 1

Rahul Jaiswal (101630556)

Figure 2

Figure 3

Rahul Jaiswal (101630556)

Figure 4

Figure 5

Rahul Jaiswal (101630556)

Vowels: BarredU
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 148.1481Hz , F1= 1350Hz

Figure 1

Rahul Jaiswal (101630556)

Dsfdfd

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)

Vowels: CapitalOE
Symbol:

Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 139.1304Hz , F1= 370Hz, F2=1900Hz

Figure 1

Rahul Jaiswal (101630556)

Rahul

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)

Vowels: Capital U
Figure1: Original Signal, Normalized Signal, Windowed Signal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 148.1481Hz , F1= 400Hz, F2=1400Hz

Figure 1

Rahul Jaiswal (101630556)

Figure 2

Figure 3

Rahul Jaiswal (101630556)

Figure 4

Figure 5

Rahul Jaiswal (101630556)


Vowels: Capital Y
Pitch= 145.4545Hz , F1= 1350Hz

Figure 1

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)

Figure 5

Rahul Jaiswal (101630556)


Vowel: Caret
Symbol:
Pitch= 139.1304Hz , F1= 500Hz, F2=1500Hz

Figure 1

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)


Vowel: e
Symbol: e

Pitch= 141.5929Hz , F1= 390Hz, F2=2300Hz

Figure 1

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)


Vowels: i
Symbol: i
Pitch= 146.789Hz , F1= 240Hz, F2=2200Hz

Figure 1

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)


Vowels: o
Symbol : o
Pitch= 142.8571Hz , F1= 240Hz, F2=2200Hz

Figure 1

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)


Vowels: u
Symbol: u

Pitch= 149.5327Hz , F1= 250Hz, F2=595Hz

Figure 1

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)


Vowels: U (Girls Voice)
The sound clip was of a girls voice, so we expect higher pitch than that with a male voice tested earlier.
Pitch= 237.0968Hz , F1= 300Hz, F2=600Hz

Figure 1

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)


Vowels: U (My Voice)

Pitch= 146.5116Hz , F1= 250Hz, F2=600Hz

Figure 1

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)

Rahul Jaiswal (101630556)


Vowel: Y
Symbol: y

Pitch= 155.3398Hz , F1= 235Hz, F2=2100Hz

Figure 1

Rahul Jaiswal (101630556)

Figure 2

Figure 3

Rahul Jaiswal (101630556)

Figure 4

Figure 5

Rahul Jaiswal (101630556)

There are many applications to this speech deconvolution. It is being widely used in speech recognition
as different persons have different pitch, and we can easily recognize vowels by formant frequencies, as
shows above.
Pitch estimation is used to find out anger and neutral emotions, lie detection etc. generally pitch has
higher frequency for anger emotions.
The concept is also used in automatic music transcription.

References:
http://www.linguistics.ucla.edu/people/hayes/103/Charts/VChart/ for Vowels Frequencies
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.2453&rep=rep1&type=pdf for
Homomorphic-Deconvolution
https://www.wikipedia.org/ General definations
http://www.freesound.org/ Vowel .wav files

S-ar putea să vă placă și