Documente Academic
Documente Profesional
Documente Cultură
ABSTRACT
This unified research describes spoken Urdu numbers investigative analysis from siffar (zero) to
nau (nine) for making a concrete foundation in recognition of Urdu language. Sound samples
from multiple speakers were utilized to extract different features using Fourier descriptor and
Neural networks. Initial processing of data, i.e. normalizing and time-slicing was done using a
combination of Simulink in MATLAB. Afterwards, the MATLAB tool box commands were used for
calculation of Fourier descriptions and correlations. The correlation allowed comparison of the
same words spoken by the same and different speakers.
The analysis presented in this paper laid a foundation step in exploring Urdu language in
developing an Urdu speech recognition system. In this paper the speech recognition feed-forward
neural network models in Matlab were developed. The models and algorithm exhibited high
training and testing accuracies. Our major goal work involves in the future use of TI
TMS320C6000 DSK series or linear predictive coding. Such a system can be potentially utilized in
implementation of a voice-driven help setup in different systems. Such as multi media, voice
controlled tele-customer services.
Keywords: Spoken number, Fourier, Correlation, Feature extraction, Feed-forward neural networks, Learning rate
1.
INTRODUCTION
Volume 3 Number 2
Page 73
www.ubicc.org
Volume 3 Number 2
Page 74
REVIEW OF DISCRETE
TRANSFORMATION & ITS MATLAB
IMPLEMENTATION
[R
(k ) + I 2 (k )
1/ 2
(2.1)
www.ubicc.org
(2.3)
n=0
N 1
x (n T ) e
jk 2 n / N
Volume 3 Number 2
(2.6)
(2.4)
n=0
X=fft(x)
X=fft(x, N)
1
FFT computing time
=
log 2 N
DFT computing time 2 N
Page 75
www.ubicc.org
rxy =
(n) y (n)
(2.7)
n =0
FDATool
From Wave File
s1_w1.wav Out
(22050Hz/1Ch/16b)
One
S-Function
rxx =
N 1
x (n) x (n)
Digital
Filter Design1
To Wave
Device
(2.8)
yout
Signal T o
Workspace
n=0
3.
FEEDFORWARD VS RECURRENT
NETWORKS
|FFT|
Magnitude
FFT
yout1
Signal T o
Workspace1
Volume 3 Number 2
Page 76
www.ubicc.org
FDATool
yout2
Signal To
Workspace2
Digital
Filter Design1
One
S-Function
|FFT|
From Wave File
s10_w2.wav Out
(22050Hz/1Ch/16b)
Two
Freq
Magnitude
Vector
FFT2
Scope2
Autocorr
A
LPC
Display5
Time
|FFT|
Horiz Cat
To
Frame
Three
Vector
Scope
yout1
Signal To
Magnitude
Matrix
Buffer Frame Conversion
Workspace1
FFT1
Concatenation
0.0000000
Four
0.00000000000000
Autocorrelation
LPC
S-Function
To Wave
Device2
Display
Mean
To
Frame
Waterfall
Scope
U( : )
4.1 Pre-Emphasis
Five
0.000000000000000e+000
From Wave File
s10_w6.wav Out
(22050Hz/1Ch/16b)
Display1
Median
Matrix
Viewer
Six
From Wave File
s10_w7.wav Out
(22050Hz/1Ch/16b)
0.0000000
RMS
Matrix
Viewer
Column
Sum
Seven
0.0000000
From Wave File
s10_w8.wav Out
(22050Hz/1Ch/16b)
Eight
From Wave File
s10_w9.wav Out
(22050Hz/1Ch/16b)
Display2
RMS
Standard
Deviation
CONV
Convolution
Matrix
Sum
0.0000000
Display6
Display3
0.0000000
Display4
Nine
From Wave File
s10_w10.wav Out
(22050Hz/1Ch/16b)
Ten
spoken numbers
Volume 3 Number 2
Page 77
www.ubicc.org
in
Fig
48
respectively.
4.4 Windowing
Speech signal analysis also involves application of a
window with a time less than the complete signal.
The window first starts with beginning of the signal
and then shifted until it reaches the end. Each
application of the window to the part of the speech
signal results in a spectral vector.
Fig.4 Correlation of the spoken Urdu number
aik (one)
Fourier Transform
fft2(x) =fft(fft(x), );
Thus the two dimensional FFT is computed by first
computing the FFT of x, that is, the FFT of each
column of x, and then computing the FFT of each
row of the result. Note that as the application of fft2
command produced even symmetric data, we only
show lower half of the frequency spectrum in our
graphs.
4.7 Correlation
Calculations for correlation coefficients of different
speakers were performed [2]-[3]. As expected, the
cross-correlation of the same speaker for the same
word did come out to be 1. The correlation matrix of
a spoken number was generated in a threedimensional form for generating different
simulations and graphs.
Figures 4 - 8 show the combined correlation
extracted for fifteen different speakers for the Urdu
number siffar (zero) to paanch (five), that
differentiate the number spoken by different speaker
for detailed analysis.
Volume 3 Number 2
Page 78
www.ubicc.org
600
500
400
300
200
100
0
0
50
100
150
200
250
300
SPEAKER: s1 w3
700
600
500
400
300
SPEAKER: s1 w0
800
200
700
100
600
0
50
100
150
200
250
300
500
400
300
200
100
SPEAKER: s1 w4
400
50
100
150
200
250
350
300
300
250
200
150
100
SPEAKER: s1 w1
50
500
0
50
100
150
200
250
300
450
400
350
300
250
200
SPEAKER: s1 w5
500
150
450
100
400
50
0
350
300
50
100
150
200
250
300
250
200
150
100
50
0
0
50
100
150
200
250
300
Volume 3 Number 2
Page 79
www.ubicc.org
SPEAKER ONE: s1 w4
900
800
700
600
500
400
300
200
100
0
SPEAKER ONE: s1 w0
2000
4000
6000
8000
10000
12000
1800
1600
1400
1200
1000
800
SPEAKER ONE: s1 w5
1200
600
1000
400
200
800
2000
4000
6000
8000
10000
12000
600
400
200
2000
4000
6000
8000
10000
12000
SPEAKER TWO: s2 w0
600
500
400
300
100
0.8
0.6
2000
4000
6000
8000
10000
0.4
12000
0.2
0
15
15
10
10
5
5
0
SPEAKER THREE:s3 w0
1800
1600
1400
1200
4.8
1000
Creating a Network
800
600
400
200
0
0
2000
4000
6000
8000
10000
12000
Volume 3 Number 2
Page 80
www.ubicc.org
1;],
Volume 3 Number 2
Page 81
www.ubicc.org
s15_w0 = wavread('s15_w9.wav');
s15_w0_norm = s2_w0/max(s15_w9);
s15_w0_fft0 = abs(fft2(s15_w9_norm));
min_arr_size=min(min_arr_size,
size(s15_w9_norm))
alldat(2,1) = 1;
alldat(2,2) = 2;
alldat(2,3:258) = s1_w2_fft0(1:256);
alldat(1,1) = 1;
alldat(1,2) = 1;
alldat(1,3:258) = s1_w1_fft0(1:256);
alldat(3,1) = 1;
alldat(3,2) = 3;
alldat(3,3:258) = s1_w3_fft0(1:256);
cutoff = 64;
for i=1:150;
for j=3:cutoff+2;
ptrans(i,j-2) = log(alldat(i,j));
end
end
p = transpose(ptrans); % inputs
t = transpose(alldat(:,2)); % target array
for i = 1:cutoff
minmax_of_p(i,1) = min(p(i,:));
end
for i = 1:cutoff
minmax_of_p(i,2) = max(p(i,:));
end
Volume 3 Number 2
Page 82
www.ubicc.org
alldat(1,1) = 1;
alldat(1,2) = 1;
alldat(1,3:258) = s1_w1_fft0(1:256);
alldat(2,1) = 1;
alldat(2,2) = 2;
alldat(2,3:258) = s1_w2_fft0(1:256);
alldat(3,1) = 1;
alldat(3,2) = 3;
alldat(3,3:258) = s1_w3_fft0(1:256);
% min_arr_size = min_arr_size/2;
min_arr_size = 11000;
% 15 speakers have spoken the same word
pronounced as "siffar".
% To find min the array length so all arrays are
clipped to it.
10
10
Training-Blue Goal-Black
s1_w0_fft(1:min_arr_size,1)=s1_w0_fft0(1:min_arr_
size, 1);
s2_w0_fft(1:min_arr_size,1)=s2_w0_fft0(1:min_arr_
size, 1);
s3_w0_fft(1:min_arr_size,1)=s3_w0_fft0(1:min_arr_
size, 1);
s4_w0_fft(1:min_arr_size,1)=s4_w0_fft0(1:min_arr_
size, 1);
10
-1
10
-2
10
-3
10
4
8 Epochs
Page 83
10
10
Training-Blue Goal-Black
Volume 3 Number 2
10
10
-1
10
-2
10
-3
10
5
6
10 Epochs
10
www.ubicc.org
10
Training-Blue Goal-Black
10
10
10
10
10
Training-Blue Goal-Black
10
-1
-2
10
10
20
30
65 Epochs
40
50
60
10
10
10
-2
10
5
6
11 Epochs
10
11
10
10
500
1000
1500
2459 Epochs
2000
-1
-2
-3
500
1000
2285 Epochs
1500
2000
10
Training-Blue Goal-Black
Training-Blue Goal-Black
10
-3
10
-2
-3
10
10
10
10
10
10
-1
-1
Training-Blue Goal-Black
Training-Blue Goal-Black
10
10
-3
10
10
10
10
10
10
10
10
-1
10
-2
10
-1
-2
-3
500
1000
1500
1569 Epochs
-3
10
6
12 Epochs
10
12
Volume 3 Number 2
Page 84
www.ubicc.org
10
Training-Blue Goal-Black
10
10
-1
10
-2
10
-3
10
200
400
600
1246 Epochs
800
1000
1200
S
No
No of
Number of
Accurac
Training
Testing
y Rate %
speakers
Speaker
1
15
5
94
2
12
4
95
3
10
3
96.7
4
7
2
95
5
5
2
100
Table 2. Training of Neural Network with
different number of speakers showing accuracy
rate.
10
Training-Blue Goal-Black
10
10
-1
10
-2
10
-3
10
200
400
600
800
1243 Epochs
1000
1200
Neuron count
64, 10, 10
64, 20, 10
64, 30, 10
64, 40, 10
64, 60, 10
64, 80, 10
Volume 3 Number 2
Learning
accuracy
81.48%
81.48%
72.59%
83.70%
62.96%
45.93%
CONCLUSION
Page 85
www.ubicc.org
REFERENCES
Volume 3 Number 2
Page 86
www.ubicc.org