Combination of LPC and ANN For Speaker Recognition

Combination of LPC & ANN for Speaker
Recognition
Rohini B. Shinde
1
, Dr. V. P. Pawar
2

Abstract: Speech processing system plays vital role in man-machine-interface. Speaker Recognition technology can be used
for restricting services to authorized persons. This technique is used for identification of speakers voice. It is used to control
display of information, reservation services, financial transactions, entrance into reserved areas, buildings etc. This paper deals
with technique for developing continuous speech database of Marathi Language for speaker recognition. For the experiment
speech of speaker is recorded in Marathi Language. We have also discussed design and methodology of collection of
database for speakers recognition. We have tested recognition rate of speaker through Linear Prediction Coefficient. As a
result recognition rate is 98.5%.
Keywords: ANN (Artificial Neural Network), Discrete Cosine Transform (DCT), Fast Fourier Transform (FFT), Linear Predictor
Coefficients (LPC), Speaker Recognition System (SRS),

1. Introduction:
peaker Recogni ti on System (SRS) promotes the
Voi ce Bi ometri c Technique. Thi s servi ce i s uni que to
each person i ncl udi ng twi ns, and cannot be exactl y
repl i cated. Speech i ncludes two components a
physi ol ogi cal component and a behavi oral component.
The physi ol ogi cal component deal s wi th the voi ce tract
and a behavi oral component deal s wi th the accent of
speech.[2] It i s al most i mpossi bl e to i mi tate anyones
voi ce exactl y as the voi ce of ori gi nal speaker. SRS can
di scri mi nate between two very si mi l ar voi ces. The voi ce
pri nt generated upon enrol ment i s characteri zed by the
vocal tract, which i s a uni que physi ol ogi cal trai t. A
col dness coul d not affect the vocal tract, so there wil l be
no adverse effect on accuracy rate. Adverse effect may
cause due to the l aryngi ti s i t wi ll prevent the user from
usi ng the system.
There are many l anguages al l over the worl d
used for communi cati on. In Indi a, Indi an consti tuti on has
recogni zed fol l owi ng 17 l anguages as regi onal l anguages:
1) Assamese 2) Tami l 3) Mal ayal am 4) Gujarati 5) Tel ugu
6) Ori ya 7) Urdu 8) Bengali 9) Sanskri t 10) Kashmi ri 11)
Si ndhi 12) Punjabi 13) Konkani 14) Marathi 15) Mani puri
16) Kannada and 17) Nepal i [2] Marathi i s one of the
recogni zed regi onal l anguage i n Indi a. It i s an Indo-
Aryan l anguage spoken by 90 mi l l i on peopl e al l over the
worl d & mai nl y used i n Maharashtra state i n Indi a.[1]
There i s a l ot of scope to devel op system usi ng Indi an-
l anguages of different aspects and vari ati ons; some of the
works are done i n the directi on of Isol ated words i n
l anguages li ke Bengali , Tami l , Tel ugu, Marathi, and
Hi ndi . The method used for col l ecti on of data i n thi s
research, focuses on conti nuous speech i n Marathi
Language & prel i mi nary way of Speaker Recogni ti on.

The paper has fi ve secti ons,
1) Introducti on
2) Conti nuous speech database generati on
3) Speaker Recogni ti on System
4) Resul t
5) Concl usi on
2. Continuous Speech Database
Generation:
The fi rst step we fol l owed, i n creati on of conti nuous
speech database, for bui l di ng SRS, i s the preparati on of
the textual sentences to be recorded from the nati ve
Marathi speaker. The sel ected sentences were mi ni mal i n
number havi ng enough occurrences of each sound. In
thi s secti on, the vari ous stages i nvol ved i n the generati on
of the opti mal text these are descri bed as foll ows.[2]
2.1. Text corpus collection
It i s very di ffi cul t to sel ect a set of phoneti call y correct
sentence for any research study. For mi ni mi zi ng adverse
effect i n pronunci ati on and bei ng on a correct path, we
have sel ected nati onal pl edge of Indi a i n neutral mode.
So that there wi ll not be a l ittl e adverse effect of
pronunci ati on of col l oqui al l anguage of nati ve speakers
i n vari ous regi on of Maharashtra.

2.2. Speaker selection
The speakers, who can speak and read Marathi language
i n appropri ate manner, are sel ected by us. We have
del i beratel y mai ntai ned vari ety of gender and age i n the
process of sel ecti on of speakers. The age group of speaker
i s from 18 years ol d to 65 years ol d. The total numbers of
speaker parti ci pated i n experi ment were 13. Out
S
--------------- ---------- ---------- ---------- ------
- Rohi ni B. Shi nde i s cur r ent l y wor ki ng as l ect ur er i n t he Col l ege of
Comput er Sci ence and I nfor mat i on Technol ogy, Lat ur , M ahar asht r a. She
i s l eadi ng t o PhD degr ee i n S.R.T.M . U ni ver si t y, N anded.
- D r . V. P. Pawar i s seni or I EEE member and ot her r eput ed soci et y
member . Cur r ent l y wor ki ng as a A ssoci at e Pr ofessor i n CS D ept of
SRTM U , N anded.
JOURNAL OF COMPUTING, VOLUME 4, ISSUE 5, MAY 2012, ISSN 2151-9617
https://sites.google.com/site/journalofcomputing
WWW.JOURNALOFCOMPUTING.ORG 142

Fi g. 1: Bl ock st r uct ur e for t he SRS

of 13, 6 were mal e & 7 were femal e speakers. For the
experi ment 5 Sampl es of speech were taken from each
speaker. So we got 65 sampl es for research work.
Stati sti cal Data of Speaker sel ecti on i s as fol l owi ng.
Tabl e 1
Gender Age No of speaker
Mal e
18 4
44 1
30 1
Femal e
16 1
18 1
20 1
24 1
29 1
58 1
65 1
A ge wi se and gender wi se speaker di st r i but i on.
2.3. Speech Acquisition Setup
To achi eve the good qual i ty of audi o-resul ts, we recorded
the sentences of speakers i n col l ege computer l aboratory,
after compl eti on of col l ege work to avoi d noi sy
surroundi ng. The speakers were rel ax i n the chair havi ng
equal di stance of mi ce from thei r mouth. The sampl i ng
frequency for al l recordi ng was 11025 Hz. We col l ected
speech data wi th the hel p of sound recordi ng software.
Sound fil es are recorded i n the PCM format and saved
wi th the extensi on .wav. The fil es saved are l abel ed
properl y and these fi l es are stored i n memory for further
processi ng.
3. Building Speaker Recognition System
(SRS):
Duri ng the fi rst step, i .e. speech acqui si ti on, speech
sampl es are obtai ned from the speaker i n real ti me, and
stored i n memory for preprocessi ng. Bui l di ng SRS
compri ses mai nl y two mai n parts, fi rst part deal s wi th
the feature extracti on and second part deal s wi th the
cl assi fi er. Fol l owi ng are the steps for the feature
extracti on & cl assi fi er and these are presented by the Fi g.
1.
3.1. Feature Extraction
Feature extracti on i s the most i mportant phase i n the
speech processi ng. Speaker recogni ti on i s the process of
automati cal l y recogni zi ng who i s speaki ng based on
uni que characteri stics contai ned i n speech waves. Thi s
technique makes i t possi bl e to use the speaker's voi ce to
veri fy thei r i denti ty and control access to servi ces such as
voi ce di ali ng, database access servi ces, i nformati on
servi ces, voi ce mai l , and security control for confi denti al
i nformati on areas, and remote access to computers.[12]
There are many techni ques used to parametri call y
represent a voi ce si gnal for speaker recogni ti on tasks.
These techni ques i ncl ude Li near Predi ctor Coeffi ci ents
(LPC), Audi tory Spectrum-Based Speech Feature (ASSF),
and the Mel - Frequency Cepstrum Coeffi ci ents (MFCC).
[8],[7],[10] The LPC technique i s used i n thi s paper.
Fol l owi ng are the i nvol ved i n the feature extracti on.[9]
1. Speech acqui si ti on : Speech acqui si ti on process i s
al ready expl ai ned i n the speech acqui si ti on setup
2. Anal og to di gi tal convertor: for converti ng the anal og
speech to di gi tal si gnal we use the headphones &
sound recordi ng software.
3. Pre-emphasi s wave: The di gi ti zed speech si gnal , s
(n), i s put through a l ow-order di gi tal system to
spectrall y fl atten the si gnal and to make i t l ess
suscepti bl e to fi ni te preci si on effects l ater i n the
si gnal processi ng. Pre-emphasi s network i s the fi xed
first-order-system.
1
( ) 1 H z az
= , 0.9 s a s 1.0 (0.1)

In thi s case, the output of the pre-emphasi s network,
s(n), i s rel ated to the i nput to the network, s(n) by the
di fference equati on
( ) ( ) ( 1) s n s n as n = (1.2)
The most common val ue for a i s around 0.95. A
si mpl e exampl e of a fi rst-order adapti ve pre-
emphasi zer i s the transfer functi on.
1
( ) 1
n
H z a z
= , (1.3)
Where
n
a
changes wi th ti me (n) accordi ng to the
chosen adapti on cri teri on.[10]
Sampl i ng: Sampl i ng i s a process of converti ng a
conti nuous-ti me si gnal i nto a di screte-ti me si gnal . It
Speech
A/D
Pre-
emphasis
Frame
Blocki ng
Windowi ng
FFT LPC
DCT
Stat i sti cal
Parameter
Feat ures
Extracted
ANN Result s
i s conveni ent to represent the sampl i ng operati on by
a fi cti ti ous swi tch. The swi tch cl oses for a very short
i nterval of ti me T, duri ng whi ch the si gnal presents
at the output. The ti me i nterval between successi ve
sampl es i s T seconds and the sampl i ng frequency if
gi ven by
1
T
f = Hz. (1.4)
4. Frami ng: The process of segmenti ng the speech
sampl es obtai ned from an ADC i nto a smal l frame
wi th the length wi thi n the range of 20to 10 msec. The
voi ce si gnal i s di vi ded i nto frames of N sampl es.
Adjacent frames are bei ng separated by M .To
avoi di ng the frame overl appi ng probl em the frame i s
shi fted every 10 sampl es. The used val ues for N & M
are 200ms & 10ms when the sampl i ng rate of speech
i s 11025 Hz.
5. Wi ndowi ng: Next process i s to appl y wi ndow to
each i ndi vi dual frame so as to mi ni mi ze the si gnal
di sconti nui ti es at the begi nni ng and end of each
frame. A Hammi ng wi ndow i s used for
autocorrel ati on method i n LPC. Hammi ng wi ndow
has the form as gi ven bel ow.
2
( ) 0.54 0.46cos ,
1
n
w n
N
| |
=
|
\ .
(1.5)

0 s n s N -1.
6. Fast Fouri er Transform: To convert each frame of N
sampl es from ti me domai n i nto frequency domai n
FFT i s bei ng used. FFT i s used to convert the
convol uti on of the gl ottal pul se and the vocal tract
i mpul se response i n the ti me domai n. The equati on
(1.6) obtai ns the val ue of FFT.
( 1)( 1)
1
( ) ( )
N
j k
N
j
S n s j

=
=

(1.6)
( 1)( 1)
1
( ) (1/ ) ( )
N
j k
N
k
s j N S k

=
=

(1.7)
Where
( 2 )/ i N
N
e

=
i s an Nth root of uni ty.

7. Li near predi cted Co-effici ent: LPC determi nes the
coeffi ci ents of a forward l i near predi ctor by
mi ni mi zi ng the predi cti on error i n the l east squares
sense. It fi nds the coeffici ents of a nth-order l i near
predi ctor that predicts the current val ue of the real -
val ued ti me seri es s(n) based on past sampl es.[16]
( ) (2) * ( 1) (3) * ( 2) ....
( 1) * ( )
s n A X n A X n
A N X n N
=
+

(1.8)

n i s the order of the predi cti on fil ter pol ynomi al ,
a = [1 a(2) ... a(p+1)]. If n i s unspeci fi ed, LPC uses as a
defaul t n = l ength(x)-1. If x i s a matrix contai ni ng a
separate si gnal i n each col umn, LPC returns a model
esti mate for each col umn i n the rows of matri x and a
col umn vector of predi cti on error variances. The
l ength of n must be l ess than or equal to the l ength of
x.
8. Di screte Cosi ne Transform: DCT can be used to
achi eve the coeffi ci ents. DCT reconstruct a sequence
very accuratel y from onl y a few DCT coeffici ents, a
useful property for appl i cati ons requi ri ng data
reducti on. DCT returns the di screte cosi ne transform
of X. The vector Y i s the same si ze as X and contai ns
the di screte cosi ne transform coeffici ents[15]

1
(2 1)( 1)
( ) ( )cos
2
n
n
n k
y k wk x n
N
=

=

k=1 N. (1.9)

Where
1
2
N
wk
N
1
2
k
k N
=
s s

9. Appl yi ng Stati sti cal parameters: For thi s work we
use a si mpl e stati sti cal parameter Standard devi ati on.
After appl yi ng the DCT on the speech si gnal a matri x
i s produced and general l y i ts very di fficul t to
operate on such a l arge data. So we reduce that data
by usi ng standard devi ati on & we extract the 10
features for further work. Now we have total 65
sampl es from the 13 different speakers. In thi s way
we get total 845 features from our 13speakers.

3.2. Artificial Neural Network
Artificial Neural Networks are composed of si mpl e
el ements operati ng i n paral l el . These el ements are
i nspi red by bi ol ogi cal nervous system. Commonl y,
neural networks are adjusted, or trai ned, so that a
parti cular i nput l eads to a speci fi c target output. In thi s
paper we used a two-l ayer feed-forward network, wi th
si gmoi d hi dden and output neurons, can cl assi fy vectors
arbi trari l y wel l , gi ven enough neurons i n i ts hi dden
l ayer. The network wi ll be trai ned wi th scal ed conjugate
gradi ent back propagati on. In thi s al gori thm, i nput
vectors and the correspondi ng target vectors are used to
trai n a network unti l i t can approxi mate a functi on,
associ ate i nput vectors wi th a speci fi c output vectors, or
cl assi fy that vectors i n an approxi mate way. The
architecture of network used for a two-l ayer feed-
forward network al gori thm has been gi ven i n Fi g 2. An
el ementary neuron wi th X1, X2, i nputs has been
shown. Each i nput i s wei ghted wi th an appropri ate value
w
ij
.

Fi g. 2: A t wo-l ayer feed-for war d net wor k wi t h 20 hi dden neur on.
4. Results:
The speaker recogni ti on resul ts were obtai ned usi ng the
generated database. 10 features from each speakers
speech have been extracted usi ng LPC. Total 65 sampl es
are used for pattern recogni ti on.65 sampl es are cl assi fi ed
i nto 13 cl asses. In pattern recogni ti on probl ems a neural
network i s used to cl assi fy i nputs i nto a set of target
categori es. The proposed features have been tested on a
Artificial Neural Network Usi ng a MATLAB tool . The
Neural Network Pattern Recogni ti on Tool wi ll hel p to
sel ect data, create and trai n a network, and evaluate i ts
performance usi ng mean square error and confusi on
matri ces. The resul t of the Speaker recogni ti on i s shown
bel ow Fi g 3(a), Fi g 3(b) respecti vel y i n the form of
confusi on matri x and mean square error.

Fi g. 3(a): Conf usi on M at r i x di spl ayi ng 98.5% r ecogni t i on r at e

Fi g. 3(b) : M ean Squar e Er r or ( M SE) di spl ayi ng best val i dat i on
per for mance at epoch 102
From the confusi on matri x i t i s cl eared that the 65
sampl es from the 13 speaker each i s correctl y cl assi fi ed
i nto 13 classes onl y one sampl e i s get mi scl assi fi ed &
recogni ti on rate i s obtai ned 98.5% & 1.5% error rate.
Mean Squared Error i s the average squared difference
between outputs and targets. Lower val ues are better.
Zero means no error.
5. Conclusion:
Resul t of the experi ment:- The recogni ti on rate of person
through sel ected SRS system i s 98.5%.
Appl i cati on of the resul t:- The sel ected SRS system can be
appl i ed for all Marathi language peopl e who can speak
and read Marathi l anguage i n appropri ate form al l the
worl d. We hope the SRS created wi ll serve as basel i ne
system for further research on i mprovi ng.

6. Reference:
[1] Gopal akri shna Anumanchi pal li , Rahul Chitturi ,
Development of Indi an Language Speech Dat abases for
Large Vocabul ary Speech Recognit ion Systems
[2] www.omni glot.com
[3] Si ngh, S. P., et al Bui l ding Large Vocabulary Speech
Recogniti on Systems for Indi an Languages ,Internat ional
Conference on Nat ural Language -Processing, 1:245-254, 2004.
[4] Evgeniy Gabri lovi ch, Alberto D. Berst in: Speaker
recogniti on: usi ng a vector quanti zation approach f or robust
text-independent speaker i dent ifi cati on , Techni cal report
DSPG-95-9-001, September 1995.
[5] Tri di besh Dutta, Text dependent speaker i dentif i cat ion
based on spectrograms , Proceedings of Image and vi sion
comput i ng , New Zeal and 2007.
[6] D. A. Reynol ds, An overview of aut omati c speaker
recogniti on technology, Proc. IEEE Int. Conf. Acoust., Speech
Si gnal Process. (I CASSP02), 2002, pp. I V-4072IV-4075.
[7] Jamel Pri ce and Al i Eydgahi, Desi gn of Mat l ab-Based
Automat i c Speaker Recognit ion Systems, 9th Internat ional
Conference on Engineering Educat ion T4J-1, Jul y 23 28,
2006.
[8] Isol ated Word, Speech Recogniti on usi ng Dynami c Ti me
Warping. Dynamic Time Warping. 14 June 2005.
[9] Jiehua Dai Zhengzhe Wei, St udy and Implementat ion of
Feat ure Extraction and Compari son In Voice Recognition
[10] Bharti W. Gawal i, Sant osh Gai kwad, Pravi n Yannawar,
Suresh C. Mehrotra, Marat hi I sol ated Word Recognit ion
System using MFCC and DTW Features , Proc. of Int . Conf .
on Advances in Computer Science 2010
[11] Qi li , frank k. Soong, and ol i vier si ohan, A High-
Performance Auditory Feat ure For Robust Speech
Recogniti on
[12] Transform Dr. H B Kekre1, Vai shal i Kulkarni, Speaker
I dent ifi cati on using Row Mean of DCT and Wal sh
Hadamard I nternati onal Journal on Computer Science and
Engineeri ng (IJCSE), I SSN : 0975-3397 Vol. 3 No. 3 Mar 2011
[13] Digit al Signal Processing By-P.Ramesh Babu Scitech
Publ i cations (I ndi a) PVT, LTD.
[14] Fundamental of Speech Recognit ion By-Lawrence Rabiner ,
Bi ing-Hwang Juang, Publ i shed by Pearson Educat ion
(Si ngapore) Pte. Lt d. Indi an Branch.

R. B. Shinde received the M.Sc.(CS) degree from Dr. B. A. M.
University, Aurangabad, in the year 2001. She is currently working
as lecturer in the College of Computer Science and Information
Technology, Latur, Maharashtra. She is leading to Ph. D degree in
S.R.T.M. University, Nanded.

Dr. Vrushsen V. Pawar received MS, Ph.D.(Computer) Degree
from Dept .CS & IT, Dr. B. A. M. University & PDF from ES,
University of Cambridge, UK. Also Received MCA (SMU), MBA
(VMU) degrees respectively. He has received prestigious fellowship
from DST, UGRF (UGC), Sakaal foundation, ES London, ABC
(USA) etc. He has published 90 and more research papers in
reputed national international Journals & conferences. He has
recognized Ph. D Guide from University of Pune, S. R. T. M.
University & Singhaniya University (India). He is senior IEEE
member and other reputed society member. Currently working as a
Associate Professor in CS Dept of SRTMU, Nanded.


Combination of LPC and ANN For Speaker Recognition

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Combination of LPC and ANN For Speaker Recognition

Încărcat de

Drepturi de autor:

Formate disponibile

Combination of LPC & ANN for Speaker

= , 0.9 s a s 1.0 (0.1)

S-ar putea să vă placă și