Sunteți pe pagina 1din 28

I.V.R (Interactive Voice Response) with Pattern Recognition.

ABSTRACT

In the present era of information technology, information nowadays is just a telephone call away. However, applications such as telephone banking etc. need extra security for making it a reliable service for the people. Application of PIN code/Password via telephone is not enough and additional user specific information is required that can protect the user identity in a more effective way. In this paper, we propose an approach that uses Interactive Voice Response (IVR) with pattern recognition based on Neural Networks. In this case, after entering the correct password the user is asked to input his voice sample which is used to verify his identity. The addition of voice pattern recognition in the authentication process can potentially further enhance the security level. As both are simultaneously applied so there is lesser probability of misuse. The developed system is fully compliant with landline phone system.

INTRODUCTION

INTRODUCTION
In telephony, Interactive Voice Response, or IVR, is a computerized system that allows a person, typically a telephone caller, to select an option from a voice menu. In some of the applications like bank account balances, transfers and accessing databases of strategic organizations etc. require high level of security. In such applications the information to be provided is made secure by the use of Personal Identification Number (PIN). However, this approach is not secure and is prone to tampering and misuse.

To overcome this problem a pattern recognition approach based on neural network is proposed. User specific patterns such as fingerprint, retina, facial features, DNA sequence identification and voice etc. can be used for authentication. However, among these, voice authentication is readily available and most suitable for this application. The speaker recognition area has a long and rich scientific basis with over 30 years of research, development and evaluations.

Inherent in attempts at speaker identity verification, it is the general assumption that at some level of scrutiny, no two individuals have exactly the same voice characteristics. In the proposed approach, besides entering the PIN code, the user will also be asked to get himself recognized through his voice signatures which further enhance the secure access to various applications.

The results are promising based on false accept and false reject criteria offering quick response time. It can potentially play an effective role in the existing authentication techniques used for identity verification to access secured services through telephone or similar media. In the proposed model, speaker specific features are extracted and Multilayer Perceptron (MLP) is used for feature matching.

INTERACTIVE VOICE RESPONSE SYSTEM


2. WHAT IS IVRS
INTERACTIVE VOICE RESPONSE SYSTEM (IVRS) is an important development in the field of interactive communication which makes use of the most modern technology available today. IVRS is a unique blend of both the communication field and the software field, incorporating the best features of both these streams of technology. IVRS is an electronic device through which information is available related to any topic about a particular organization with the help of telephone lines anywhere in the world.

IVRS provides a friendly and faster self service alternative to speaking with customer service agents. It finds a large scale use in enquiry systems of railways, banks, universities, tourism, industry etc. It is the easiest and most flexible mode of interactive communication because pressing a few numbers on the telephone set provides the user with a wide range of information on the topic desired. IVRS reduces the cost of servicing customers. In telecommunications, IVRS allows customers to interact with a companys database via a telephone keypad or by speech recognition, after which they can service their own inquiries by following the IVR dialogue. IVR systems can respond with prerecorded or dynamically generated audio to further direct users on how to proceed. IVR applications can be used to control almost any function where the interface can be broken down into a series of simple interactions. IVR systems deployed in the network are sized to handle large call volumes. The use of IVR and voice automation enables a company to improve its customer service and lower its costs, due to the fact that callers' queries can be resolved without the need for queuing and incurring the cost of a live agent who, in turn, can be directed
4

to deal with more demanding areas of the service. If the caller does not find the information they need, or requires further assistance, the call can then be transferred to an agent. This makes for a more efficient system in which agents have more time to deal with complex interactions. When an IVR system answers multiple phone numbers the use of DNIS ensures that the correct application and language is executed. A single large IVR system can handle calls for thousands of applications, each with its own phone numbers and script. IVR also enables customer prioritization. In a system wherein individual customers may have a different status, the service will automatically prioritize the individual's call and move customers to the front of a specific queue. Prioritization could also be based on the DNIS and call reason. IVR technology is also being introduced into automobile systems for hands-free operation.

2.1 IVRS Block Diagram

Fig. 2.1

The IVRS on the whole consists of the user telephone, the telephone connection between the user and the IVRS and the personal computer which stores the data base. The interactive voice response system consists of the following parts.

2.1.1 Hardware Section 1. Relay: For switching between the ring detector and the DTMF decoder. 2. Ring detector: To detect the presence of incoming calls. 3. DTMF decoder: To convert the DTMF tones to 4 bit BCD codes. 4. Micro controller: To accept the BCD calls, process them and transmit them serially to the PC. 5. Level Translator: To provide the interface between PC and micro Controller. 6. Personal Computer: To store the data base and to carry out the text to speech conversion. 7. Audio Amplifier: To provide audio amplification to standard output and to act as a buffer between the telephone line and sound card.

2.1.2 Software Selection 1. Visual Basics 6.0 2. Oracle 8.0 3. Microsoft Agent

2.2 Operations of IVRS


The user dials the phone number connected to the IVRS. The call is taken over by the IVRS after a delay of 12 seconds during which the call can be attended by the operator. After 12 seconds if the ring detector output is low, it is ensured that the phone has not been picked up by the operator. The microcontroller then switches the relay to the DTMF and sends a signal via RS 232 to the pc to run the wave file welcoming the user to the IVRS. The user is also informed of the various codes present in the system, which the user dial in order to access the necessary information. Thirty seconds are given to the user to press the codes, failure of which results in switch back of the relay. The DTMF decoder converts the codes pressed by the user to BCD. It is then pressed to the input pins of the microcontroller and is stored in the microcontroller memory. After these codes have been received, they are transmitted serially to the serial port of the PC via max232 IC. Any hardware failure in transmission falls in the lightning of a LED and the relay is switched back. The serial port of the PC is continually polled by the software used such as Visual Basics and Microsoft Agent program and the received code words are put in the text box from the input buffer. The received personal identification number (PIN) is compared with the stored data base to determine the result. The corresponding wave file is played by the sound blaster card. It is coupled to the telephone line through the Audio Amplifier, which is connected between the sound blaster and the telephone line to amplify the blaster output, drive the telephone line acts as the buffer for sound blaster.

2.3 Advantages of IVRS


1. The addition of speech recognition capabilities help IVRS owners derive more benefit from their investment in existing IVRS resource. 2. Motivating organizations to embrace speech solutions is the potential for dramatic reductions in operational cost. 3. Increased automation frees the customer service agents from any routine administrative tasks and reduces cost related to customer service staffing. That is fewer agents are able to serve more customers. 4. Resources that have been developed to support an internet presence can support an IVRS as well. Thus organizations can use some of the same data modules bid for speech enabled IVRS application for their intranets. This could deliver a high degree of code reuse.

PATTERN RECOGNITION

INTRODUCTION
Automatic (machine) recognition, description, classification, and grouping of patterns are important problems in a variety of engineering and scientific disciplines such as biology, psychology, medicine, marketing, computer vision, artificial intelligence, and remote sensing. A pattern could be a fingerprint image, a handwritten cursive word, a human face, or a speech signal. Given a pattern, its recognition/classification may consist of one of the following two tasks: 1) supervised classification (e.g., discriminant analysis) in which the input pattern is identified as a member of a predefined class, 2) unsupervised classification (e.g., clustering) in which the pattern is assigned to a hitherto unknown class. The recognition problem here is being posed as a classification or categorization task, where the classes are either defined by the system designer (in supervised classification) or are learned based on the similarity of patterns (in unsupervised classification). These applications include data mining (identifying a pattern, e.g., correlation, or an outlier in millions of multidimensional patterns), document classification (efficiently searching text documents), financial forecasting, organization and retrieval of multimedia databases, and biometrics. The rapidly growing and available computing power, while enabling faster processing of huge data sets, has also facilitated the use of elaborate and diverse methods for data analysis and classification. At the same time, demands on automatic pattern recognition systems are rising enormously due to the availability of large databases and stringent performance requirements (speed, accuracy, and cost). The design of a pattern recognition system essentially involves the following three aspects: 1) data acquisition and preprocessing, 2) data representation, and 3) decision making.

The problem domain dictates the choice of sensor(s), preprocessing technique, representation scheme, and the decision making model. It is generally agreed that a well-defined and sufficiently constrained recognition problem (small intraclass variations and large interclass variations) will lead to a compact pattern representation and a simple decision making strategy. Learning from a set of examples (training set) is an important and desired attribute of most pattern recognition systems. The four best known approaches for pattern recognition are: 1) template matching, 2) statistical classification, 3) syntactic or structural matching, and 4) neural networks.

3.1 Voice Recognition


VOICE recognition is different than Speech recognition. Speech recognition (also known as automatic speech recognition or computer speech recognition) converts spoken words to text. The term "voice recognition" (also called as Speaker recognition) is used to refer to recognition systems that must be trained to a particular speaker. Speaker recognition, which can be classified into identification and verification, is the process of automatically recognizing who is speaking on the basis of individual information included in speech waves. This technique makes it possible to use the speaker's voice to verify their identity and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail, security control for confidential information areas, and remote access to computers.

10

Fig. 3.1 shows the basic components of speaker identification and verification systems. Speaker identification is the process of determining which registered speaker provides a given utterance. Speaker verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker. Most applications in which a voice is used as the key to confirm the identity of a speaker are classified as speaker verification . Speaker recognition methods can also be divided into text-dependent and textindependent methods. The former require the speaker to say key words or sentences having the same text for both training and recognition trials, whereas the latter do not rely on a specific text being spoken.

(a) Speaker identification

(b) Speaker Verification

11

Fig. 3.1. Basic structure of speaker recognition systems. Both text-dependent and independent methods share a problem however. These systems can be easily deceived because someone who plays back the recorded voice of a registered speaker saying the key words or sentences can be accepted as the registered speaker. To cope with this problem, there are methods in which a small set of words, such as digits, are used as key words and each user is prompted to utter a given sequence of key words that is randomly chosen every time the system is used. Yet even this method is not completely reliable, since it can be deceived with advanced electronic recording equipment that can reproduce key words in a requested order. Therefore, a textprompted speaker recognition method has recently been proposed.

12

NEURAL NETWORKS
4.1 What is a Neural Network?
An Artificial Neural Network (ANN) is an information processing paradigm that is inspired by the way biological nervous systems, such as the brain, process information. The key element of this paradigm is the novel structure of the information processing system. It is composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems. ANNs, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons. This is true of ANNs as well.

4.2 Why use neural networks?


Neural networks, with their remarkable ability to derive meaning from complicated or imprecise data, can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained neural network can be thought of as an "expert" in the category of information it has been given to analyse. This expert can then be used to provide projections given new situations of interest and answer "what if" questions. Other advantages include: 1. Adaptive learning: An ability to learn how to do tasks based on the data given for training or initial experience. 2. Self-Organization: An ANN can create its own organization or representation of the information it receives during learning time. 3. Fault Tolerance via Redundant Information Coding: Partial destruction of a network leads to the corresponding degradation of performance. However, some network capabilities may be retained even with major network damage.
13

4.3 Pattern Recognition - an example


An important application of neural networks is pattern recognition. Pattern recognition can be implemented by using a feed-forward (figure 1) neural network that has been trained accordingly. During training, the network is trained to associate outputs with input patterns. When the network is used, it identifies the input pattern and tries to output the associated output pattern. The power of neural networks comes to life when a pattern that has no output associated with it, is given as an input. In this case, the network gives the output that corresponds to a taught input pattern that is least different from the given pattern.

Fig. 4.1. For example: The network of figure 1 is trained to recognise the patterns T and H. The associated patterns are all black and all white respectively as shown below.

14

If we represent black squares with 0 and white squares with 1 then the truth tables for the 3 neurones after generalisation are; X11: X12: X13: 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1

OUT:

Top neuron

X21: X22: X23:

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

OUT:

0/1

0/1

0/1

0/1

Middle neuron X21: X22: X23: 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1

15

OUT:

Bottom neuron

From the tables it can be seen the following associations can be extracted:

In this case, it is obvious that the output should be all blacks since the input pattern is almost the same as the 'T' pattern.

Here also, it is obvious that the output should be all whites since the input pattern is almost the same as the 'H' pattern.

16

Here, the top row is 2 errors away from the a T and 3 from an H. So the top output is black. The middle row is 1 error away from both T and H so the output is random. The bottom row is 1 error away from T and 2 away from H. Therefore the output is black. The total output of the network is still in favor of the T shape.

4.4 Feed-forward networks


Feed-forward ANNs (figure 1) allow signals to travel one way only; from input to output. There is no feedback (loops) i.e. the output of any layer does not affect that same layer. Feed-forward ANNs tend to be straight forward networks that associate inputs with outputs. They are extensively used in pattern recognition. This type of organisation is also referred to as bottom-up or top-down.

4.5 The Back-Propagation Algorithm


In order to train a neural network to perform some task, we must adjust the weights of each unit in such a way that the error between the desired output and the actual output is reduced. This process requires that the neural network compute the error derivative of the weights (EW). In other words, it must calculate how the error changes as each weight is increased or decreased slightly. The back propagation algorithm is the most widely used method for determining the EW. The back-propagation algorithm is easiest to understand if all the units in the network are linear. The algorithm computes each EW by first computing the EA, the rate at which the error changes as the activity level of a unit is changed. For output units, the EA is simply the difference between the actual and the desired output. To compute the EA for a hidden unit in the layer just before the output layer, we first identify all the weights between that hidden unit and the output units to which it is connected. We then multiply those weights by the EAs of those output units and add the products. This sum equals the EA for the chosen hidden unit. After calculating all the EAs in the hidden layer just before the output layer, we can compute in like fashion the EAs for other
17

layers, moving from layer to layer in a direction opposite to the way activities propagate through the network. This is what gives back propagation its name. Once the EA has been computed for a unit, it is straight forward to compute the EW for each incoming connection of the unit. The EW is the product of the EA and the activity through the incoming connection. Note that for non-linear units, the back-propagation algorithm includes an extra step. Before back-propagating, the EA must be converted into the EI, the rate at which the error changes as the total input received by a unit is changed.

SPEAKER RECOGNITION USING ANN

5.1 Modes
At highest level, all speaker recognition systems contain two modules: Feature Extraction and Feature Matching. Similarly they operate in two modes: Training and Recognition/Testing modes. Both training and recognition modes include Feature Extraction and Feature Matching. In training mode speaker models are created for database. This is also called enrollment mode in which speakers are enrolled in the database. In this mode, useful features from speech signal are extracted and model is trained.

The objective of the model is generalization of the speaker's voice beyond the training material. so that any unknown speech signal can be classified as intended speaker or imposter. In recognition mode, system makes decision about the unknown speaker's identity claim. In this mode features are extracted from the speech signal of the unknown speaker using the same technique as in the training mode. And then the speaker model from the database is used to calculate the similarity score. Finally decision is made based

18

on the similarity score. For speaker verification, the decision is either accepted or rejected for the identity claim.

Two types of errors occur in speaker verification system, False Reject (FR) and False Accept (FA). When a true speaker is rejected by the speaker recognition system, it is called FR. Similarly FA occurs when imposter is recognized as a true speaker. The input pattern used for verification can be either text dependent or textindependent. For the text-dependent speech pattern, the speaker is asked to utter a prescribed text. However in text-independent case the user is free to speak any text. Textindependent speech pattern is considered more flexible as the user doesn't require to memorize the text.

5.2 Methodology Adopted

Fig. 5.1 Methodology Adopted

19

5.3 Feature Extractor


The speech samples from a single speaker are recorded. Five samples for each word are used for training the neural networks. The LPC Cepstrum coefficients of each word are extracted and the K-means vector quantization is applied to get the reduced trajectories. The feature extraction consists of the following steps:

5.3.1 Speech Sampling: The speech was recorded and sampled using an off-the-shelf relatively inexpensive dynamic microphone and a standard PC sound card. The incoming signal was sampled at 22,050 Hertz with 16 bits of precision.

5.3.2 Endpoint Detection: A fast and robust technique for accurately locating the endpoints of isolated words has been used. This technique utilizes frame energy to acquire the reference points. The algorithm takes frames of size 100 samples and calculates the energy for each frame and averages it over all the frames to get the reference value of the energy. The energy per frame is calculated as: P[i] = Sum k=1...j (s[k]) (1) where s [k] are the speech data in the frame. Similarly P is calculated for all the frames and an average is taken for the final energy value [E]. E= [Sum k=l....m (p[k] )]/m (2) The threshold is set at (constant* E), as the detecting criterion.

5.3.3 Pre-emphasis: As is common in speech recognizers, a pre-emphasis filter was applied to the digitized speech to spectrally flatten the signal and diminish the effects of finite numeric precision in further calculations. This type of filter boosts the magnitude of the high frequency components, leaving relatively untouched the lower ones.

5.3.4 Framing and Windowing:

20

After the signal was sampled, the utterances were isolated, and the spectrum was flattened, each signal was divided into a sequence of data blocks, each block spanning 300 samples, and separated by 100 samples. Next, each block was multiplied by a Hamming window, which had the same width as that of the block, to lessen the leakage effects.

5.3.5 LPC Analysis: Then, a vector of 12 Linear Predicting Coding (LPC) Cepstrum coefficients was obtained from each data block using Durbins method.

5.3.6 Vector Quantization: The dimensionality of the LPC Cepstrum vectors is reduced using Vector Quantization Technique. A total of 36 coefficients are obtained after the vector quantization. For the vector quantization the K-means algorithm is used. The way in which a set of L training vectors can be clustered into a set of M codebook vectors is the following: 1. Initialization: Arbitrarily choose M vectors (initially out of the training set of L vectors) as the initial set of code words in the codebook. 2. Nearest-Neighbor Search: For each training vector, find the code word in the current codebook that is closest (in terms of spectral distance) and assign that vector to the corresponding cell. 3. Centroid Update: Update the code word in each cell using the centroids of the training vectors assigned to that cell. 4. Iteration: Repeat the steps 2 and 3 until the average distance falls below a preset threshold.

After the VQ stage only 3 vectors of size 12 are left. The output of this last stage is the final feature used throughout.

21

5.4 Recognizer
The recognizer block is built using the neural network approach. The 2 types of Neural networks used are the Multi layer Perceptrons and Recurrent Neural Networks. A neural network is a collection of layers of "neurons," simulating the human brain structure. Each neuron takes input from each neuron in the previous layer (or from the outside world, if it is in the first layer). Then, it adds this input up, and passes it to the next layer. Each connection between layers, however, has a certain weight. Every time the neural network processes some input, it adjusts these weights to make the output closer to a given desired value for the output. After several repetitions of this (each repetition is an iteration), the network can produce the correct output given a loose approximation of the input.

5.5 Results
The 2 different approaches were used for recognition. For each word 5 different training samples were used and the networks are trained. Then the recognition accuracies were calculated by recording more samples of the words.

5.5.1

MLP Approach

22

Fig. 5.2 Architecture of a multi-layer perceptron with two hidden layers

The MLP had 36 input nodes, 36 hidden neurons, and 1 output neuron. The output of the neurons was inside the interval [-1,+1]. The transit was used as the threshold function. Each neuron had an extra connection, whose input was kept constant and equal to one (the literature usually refers to this connection as bias or threshold). The weights were initialized with random values selected within the small interval. The MLP was trained using the Error back propagation method.

TEST: Around fifty seconds of speech data of the intended speaker was collected for training the neural network. In testing phase, 10% tolerance is present for the intended speaker i.e. if the output of the network is 10% less or greater than 10%, still the speaker is recognized as the intended speaker otherwise rejected. The test data consists of fifty (50) speech samples (other than those used for training the neural network) of the speaker for whom network is trained and 125 samples of imposter speech. The imposter speech data was collected from 13 persons (male). Out of 50 samples of the intended speaker

23

41 were recognized. So false reject is only 18%. Similarly for imposter data out of 125 trials only 17 were falsely accepted. Thus false accept is about 14%. Table 1 summarizes the details of test results.

The performance measure is expressed as half total error rate (HTER). It is defined as: HTER = (FA + FR)/2 (2) where FA and FR are the false acceptance and false rejection, respectively. It was introduced into practice in the international speaker recognition evaluation campaigns organized by the NIST and NFI/TNO. In our case, the HTER comes to be 16% which is very promising.

5.5.2 Recurrent Neural Network Approach

24

Fig. 5.3 Structure of Recurrent neural network

While training an Elman network the following occurs. At each epoch: 1. The entire input sequence is presented to the network, and its outputs are calculated and compared with the target sequence to generate an error sequence. 2. For each time step, the error is back propagated to find gradients of errors for each weight and bias. This gradient is actually an approximation since the contributions of weights and biases to errors via the delayed recurrent connection are ignored. 3. This gradient is then used to update the weights with the back prop training function chosen by the user.

5.6 Comparison
25

Some comments concerning the MLP approach can also be made: 1. Its recognition accuracies were better than the ones obtained with the RNN approach. Even though its performance was better, it is still below the limits required for practical applications. 2. The input layer consists of 36 neurons. The hidden layer was defined by 36 hidden neurons, having a total of 1296 weights and totaling 1296 floating point values. The output layer consisted of only 1 neuron, with 36 weights totaling 36 floating point values. In total, each MLP required 1332 floating point values.

A few comments concerning the RNN can also be made in the light of the items used for comparison above: 1. It achieved only 80% of recognition accuracy. 2. In terms of memory requirement, it is the best. The fully connected RNN with 10 hidden neurons and 1 output neuron requires only 360 floating point values.

CONCLUSION
26

An Interactive Voice Response (IVR) based on neural network approach has been proposed that incorporates user specific features in terms of voice extracted while MultiLayer Perceptron (MLP) is used for feature matching. The preliminary results shows promise of the approach that can potentially bring added security in applications involving access to bank services etc. via telephone. Further work focuses on improving the error in the patterns recognition on criteria based on false accept and false reject.

BIBLIOGRAPHY

27

[1] Abhinav Sharma Surinder P Singh Vipin Kumar, "Text-independent speaker identification using backpropagation MLP network classifier for a closed set ofspeakers", Proceedings of IEEE International Symposium on Signal Processing and Information Technology, Indian Institute Of Information Technology Allahabad, INDIA, 2005. [2] David Benenaty. BELL LABS TECHANICAL JOURNAL- Vol- 1-Jan 2002-Wiley Publishers [3] "Interactive Voice Response with Pattern Recognition Based on Artificial Neural Network Approach", Syed Ayaz Ali Shah, Azzam ul Asar, Syed Waqar Shah. [4] J .Tebelskis, Speech Recognition Using Neural Networks, PhD Dissertation Carnegie Mellon University, 1995. [5] K. J. Lang and A. H. Waibel, A Time-Delay Neural network Architecture for Isolated Word Recognition, Neural networks, Vol.3, 1990 [6] Neural Networks From Wikipedia, the free encyclopedia [7] www.fuzzytech.com

28

S-ar putea să vă placă și