Documente Academic
Documente Profesional
Documente Cultură
Contents:
Topic
Aim
Motivation
Theory
MATLAB Program
Plots & Results
Applications
Page No:
3
3
4
8
10
65
There are a lot of algorithms which leads to detection of pitch of speech signals. The method I
have used is Cepstral Method which is more reliable than some other outdated, extensive
methods having multiple methods.
The main aim of this project is to understand motivation behind Cepstral Analysis of speech,
understand basic Cepstral Analysis approach which is used to perform vocal tract and source
information separation, understand liftering concept and develop a pitch determination
method.
Human speech is very closely related to the above concept. When a speech signal is produced,
it passes two steps. excitation ("source") and signal shaping ("filter"). The objective of cepstral
analysis is to separate the speech into its source and filter components without any prior
knowledge about source and filter so that the information can be used in various speech
processing applications.
There are two types of sounds Voiced and Unvoiced. We will mainly concentrate in the Voiced
segment and they mainly include the Vowels. So these voiced sounds are produced by exciting
the time varying system characteristics with periodic impulse sequence and unvoiced sounds
are produced by exciting the time varying system with a random noise sequence. The resulting
speech can be considered as the convolution of respective excitation sequence and vocal tract
filter characteristics. Let x(n) is the excitation sequence and h(n) is the vocal tract filter
sequence, then the speech sequence y(n) can be expressed as follows:
y(n)=x(n)* h(n)
(1)
(2)
Or,
As we can see this is very similar to the basic convolution in time domain that we have been
doing, but in this case we dont have any kindof knowledge about either X(w) or H(w). This is
the reason why we use Cepstral analysis. As Cepstral converts convolution in time domain to
addition in quefrency domain. So, we get a linear combination of x(n) & h(n) in quefrency
domain.
The main idea of Homomorphic-deconvolution is to convert the product X(w).H(w) [Y(w)] into
the sum by applying a logarithmic function. The complex cepstrum is defined as the inverse
Fourier transformation of the log-normalized Fourier transform of the input signal, which is
reverted to the time or the quefrency domain. So the return signal as be written in the
transform domain as
Y(w)= X(w) +H(w)
(3)
So now in the quefrency domain, the vocal tract components are represented by the slowly
varying components concentrated near the lower quefrency region and excitation components
(pitch) are represented by the fast varying components at the higher quefrency region.
The figure below gives a pictorial view of the steps to convert a speech signal to cepstral
domain representation.
y(n)
y_w(n)
Windowing
Y_W(w)
DFT
Log|Y_W(w)|
Log|y(w)|
c(n)
IDFT
In the above figure y(n) is the speech signal, y_w(n) is the windowed frame. We obtain the
windowed frame y_w(n) by multiplying the speech signal s(n) with the hamming window h(n).
Now we perform N-pts. DFT of the windowed frame to obtain Y_W(w). Log|Y_W(w)| is the log
magnitude spectrum, obtained by taking log of the |Y_W(w)|. Then we perform IDFT of
Log|Y_W(w)| to obtain the cepstrum of the speech signal s(n). The cepstrum thus obtained
contains vocal tract information linearly combined with the pitch information, which can be
separated using liftering technique, which will be discussed next.
(if # samples=5000)
We multiply this High time Liftering signal H[n] with the cepstrum c(n), and then look for the
highest peak in the graph. The sample contains many periodic peaks but the sample at which
the highest peak occurs, gives the pitch of the speech signal s(n). We can calculate the pitch
from this sample value, as we already know the sampling frequency.
We have taken L0=15 samples. So L=1 for first 15 samples and remains at 0 for the rest 5000
samples.
Cl[n]=c(n).L[n]
Vocal Tract characteristics can be obtained by multiplying the cepstrum signal c(n) with this
window L[n], followed by taking the DFT. Taking DFT on the low time liftered signal gives the
Log-Magnitude of the Vocal-Tract Spectrum. Here H(w) is the response of the vocal tract in
frequency domain.
So,
log|H(w)|= DFT[Cl[n]]
All the details about the vocal tract can be figured out from this vocal tract spectrum, like the
Formant Frequency. The spectrum has peaks at regular intervals, as already mentioned in the
first paragraph of this topic. Each of these local peaks represents different formant frequencies.
And, these frequencies are different for different vowels.
Main Function:
% *************************************************************************
% This function seperates the formant frequency (vocal tract information) *
% from the pitch information in the voiced human speech.
*
% parameters:
*
%
SYNOPSIS:
*
%
------------------------------------------------*
%
final_op = speech_synth(fs,p,N,rN,vowel,method)
*
%
------------------------------------------------*
%
*
*
% *************************************************************************
clear all;
clc;
close all;
%% Uploading the File Here.
[FileName,PathName] = uigetfile('*.wav','Select the WAV file to process');
ln=input('Please inter the liftering window length. The typical value is
between 10-40');
FilePath=strcat(PathName, FileName);
[y,fs]=audioread(FilePath);
[PATHSTR,NAME,EXT] = fileparts(FilePath);
%%Checking the validity of the file.
if ((strcmp(EXT,'.wav'))||(strcmp(EXT,'.mp3'))||(strcmp(EXT,'.wav')))
%%Dual Channel to single channel audio conversion.
y=y(:,1);
%%Taking a small frame of audio data.
y=y(2000:5000);
y=double(y);
N=length(y);
t=(0:N-1)/fs;
% time in seconds.
figure;
subplot(2,2,1);
plot(y);
legend('Speech Signal');
xlabel('Time (s)');
ylabel('Amplitude');
y_th=y./(1.1*abs(max(y)));
y_th=y_th(1:N);
subplot(2,2,2);
plot(t,y_th);
legend('Normalized Signal');
xlabel('Time (s)');
ylabel('Amplitude');
w=hamming(N);
% Speech Signal.
%Hamming Window.
Liftering Function:
function [y_formant,y_Pitch, p_frequency,
Magnitude_F,formant_ceps,formant]=liftering(c,fs,N,ln)
%% low quefrency lifter of length=ln quefrencies for obtaining vocal track
estimation.
t=1;
L=zeros(1,length(c));
L=L';
L(1:ln)=1;
y_formant=real(c.*L);
% Multiplying the cepstrum with the
Low Time Lifter to obtain the vocal tract estimation.
%%High Time Lifter
H=zeros(1,length(c));
H=H';
H(ln:N/2)=1;
y_Pitch=real(c.*H);
% Multiplying the cepstrum with the
High Time Lifter to obtain the pitch estimation.
[y_Pitchvalue, y_Pitchlocation]= max(y_Pitch); % Calculating the maximum
value in the y_Pitch Matrix to obtain the value of pitch.
p_period=y_Pitchlocation;
p_frequency =(1/p_period)*fs;
%%Formant Estimation
yy_formant=y_formant(1:ln);
formant_ceps=fft(y_formant,10000);
formant_ceps=formant_ceps(1:5000);
formant_ceps=real(formant_ceps);
for i=2:length(formant_ceps)-1
if(formant_ceps(i-1)<formant_ceps(i) & formant_ceps(i+1)<formant_ceps(i))
Magnitude_F(t)=formant_ceps(i);
formant(t)=i;
t=t+1;
else
continue;
end
end
return
end
I ran the program for different vowels in the english alphabets. Below I have attached the Pitch
Estimation Spectra, Formant (Vocal Tract) Estimation spectra along with their fundamental
frequencies systematically.
Vowel: A_Front
Symbol: a
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 135.5932 Hz , F1= 850Hz, F2=1600Hz
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Vowel: ae
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 129.0323Hz , F1= 500Hz, F2=1400Hz
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Vowel: BackwardSchwa
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 140.3509Hz , F1= 250Hz, F2=1400Hz
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Vowel: BackwardsEpsilon
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 139.1304Hz , F1= 350Hz, F2=1250Hz
Figure 1
Figure 2
Figure 3
Figure4
Figure 5
Vowel: Barred I
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 140.3509Hz , F1= 1500Hz
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Vowel: BarredO
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 146.789Hz , F1= 350Hz, F2=1200Hz
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Vowels: BarredU
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 148.1481Hz , F1= 1350Hz
Figure 1
Dsfdfd
Vowels: CapitalOE
Symbol:
Figure1: Original Signal, Normalized Signal, Windowed Singal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3,
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 139.1304Hz , F1= 370Hz, F2=1900Hz
Figure 1
Rahul
Vowels: Capital U
Figure1: Original Signal, Normalized Signal, Windowed Signal and the DFT transform of the
signal.
Figure 2: Cepstrum (c(n)) of the original signal
Figure 3: Low Time Liftered Version of the Cepstrum, used for estimating formant freq.
Figure4: High Time Liftered version of the Cepstrum, used for Pitch estimation
Figure 5: Formant Spectra, containing peaks corresponding to Formant frequencies F1, F2, F3
From figure 2 and figure 5 we can determine the pitch and formant freq. and for this case
Pitch= 148.1481Hz , F1= 400Hz, F2=1400Hz
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 1
Figure 5
Figure 1
Figure 1
Figure 1
Figure 1
Figure 1
Figure 1
Figure 1
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
There are many applications to this speech deconvolution. It is being widely used in speech recognition
as different persons have different pitch, and we can easily recognize vowels by formant frequencies, as
shows above.
Pitch estimation is used to find out anger and neutral emotions, lie detection etc. generally pitch has
higher frequency for anger emotions.
The concept is also used in automatic music transcription.
References:
http://www.linguistics.ucla.edu/people/hayes/103/Charts/VChart/ for Vowels Frequencies
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.2453&rep=rep1&type=pdf for
Homomorphic-Deconvolution
https://www.wikipedia.org/ General definations
http://www.freesound.org/ Vowel .wav files