Sunteți pe pagina 1din 4

Voice Command Based Wheelchair

Subtitle as needed (paper subtitle)

Authors Name/s per 1st Affiliation (Author) Authors Name/s per 2nd Affiliation (Author)
line 1 (of Affiliation): dept. name of organization line 1 (of Affiliation): dept. name of organization
line 2-name of organization, acronyms acceptable line 2-name of organization, acronyms acceptable
line 3-City, Country line 3-City, Country
line 4-e-mail address if desired line 4-e-mail address if desired

Abstract— This paper describes our project that aims to the nueron activitions in each layer is given by the following
control a wheelchair based on voice commands. The program is matrix.[1]
able to recognize voice basic commands that are required to move
the wheelchair. This is done using a dataset that has been trained
by convolutional neural networks. The voice input is compared to
find the best match to the predefined set and wheelchair moves
accordingly.

Keywords—convolutional neural networks; wheelchair;


voice; speech; dataset;

B. CONVULUTIONAL NEURAL NETWORKS


I. INTRODUCTION
Convolutional Neural Networks are very similar to
The project aims to benefit the persons with disabilities or ordinary Neural Networks. they are made up of neurons that
injuries who use wheelchair and find it difficult to control it by have learnable weights and biases. Each neuron receives some
themselves by providing a smart voice command recognition inputs, performs a dot product and optionally follows it with a
system that is interfaced with the motors of the wheelchair. non-linearity.
The two main aspects of this project is the training and testing
of the input voice data. The voice recognition is done by
Convolutional neural networks for small keyboard footprint
spotting. This method is quick to train and easy to understand.
For training we use the speech commands dataset, was which
was collected and released by Google(under CC by license).
Using this dataset, we build a model that classifies the input
audio sample into either silence or basic commands required
to move the wheelchair like “left”, “right” and “stop”. This
model is taken and run using “Arduino”.

II. NEURAL NETWORKS

A. A deep nueral network is a feedforward nueral network


with more than one hidden layer. Every layer consists of
nuerons . Each of these take all outputs of lower layers as
their input. It is then multiplied by a weight vector , the
result is summed and passed through a non linear
activation function
a) CNN architecture
A CNN consists of three types of layers,
convolutional layers, pooling layers, and fully-connected
layers. In a convolutional layer, each neuron takes inputs from
also represented as a small rectangular section of the previous layer, multiplying
those local inputs against the weight matrix W. The weight
matrix, or the localized filter, will be replicated across the
entire input space to detect a specific kind of local pattern. All
neurons sharing the same weights compose a feature map. A
complete convolutional layer is composed of many feature
maps, generated with different localized filters, to extract
multiple kinds of local patterns at every location. In our case
for speech recognition, the input space is a 2- D plane with
frequency and time axis. We only apply convolution along the
frequency axis, leaving HMMs to handle temporal variations
because most recent works show that shift invariance in
frequency is more important than shift invariance in time [6]
[7].
After each convolutional layer, there may be a
pooling layer. The pooling layer similarly takes inputs from a
local region of the previous convolutional layer and down-
samples it to produce a single output from that region. One
common pooling operator used for CNNs is max-pooling,
which outputs the maximum value within each sub-region. By
down sampling, we not only reduce the computational
complexity for the upper layer but also achieve a degree of
robustness to slight position change of local patterns.[13]
Finally, after one or more convolutional-pooling
building blocks, fully connected layers will take the output of
all neurons from the previous layer and apply high-level
reasoning on these “invariant” features. While it is possible to
stack multiple building blocks of convolutional and pooling
layers, our experiments found that additional convolutional Finally we have a posterior handling module in which we
combine individual frame-level posterior scores to form a
blocks does not result in further improvement. Therefore our
single score for the keyword.[6]
architecture adopts one convolutional layer followed by one
max-pooling layer and then four fully-connected layers.
IV. WORKING
III. KEYBOAED SPOTTING TASK[2] The method of convolutional neural network used is
In the feature extraction module 40 dimensional log-Mel filter similar to that used for image recognition. But since Audio is a
features are extracted.[3] This is computed every 25 ms with a one-dimensional signal we need to convert it into a 2-D
time shift of 10 ms. 23 frames are stacked to the left and 8 to matrix. We therefore define a time window that is long enough
the right. The is given as an input to the DNN. Then there is a to fit the spoken words and convert the audio in that window
baseline deep neural network architecture which contains 3 into an image. the incoming audio signal is grouped in these
layers that are hidden and 128 hidden units/layer and a soft windows, that is few millimetres long and the strength of
frequencies across a set of bands is calculated. The strength of
max layer. [4] frequencies is treated like a vector and every vector is
arranged in time to form a 2-D matrix which can be
considered as a single channel image. This image is called a
spectrogram. [7]
Human ear does not show the same sensitivity to all the
frequency. Therefore, we further process the matrix and turn
the values into MFCC or Mel- Frequency Cepstral
Coefficients. This is because MFCC mimics the characteristics
of human ear. [2]

In each layer we use a rectified linear unit ( ReLU ) non This image is then given to a multi layered convolutional
linearity. In the output layer , there is one output target for neural network. [8]
each of the sound in the keyboard phrase. [5]
Using asynchronous gradient descent , we train the < insert the spectrogram images>
network weights to optimize a cross entropy criterion.
V. CONFUSION MATRIX VII. RESULT
To analyse the mistakes that the network is doing , we use <insert image of result>
the confusion matrix.[9] Each column represents the set of <write some conclusion from result>
samples that were predicted to be each label and each row
represents their ground truth value. For example in this model
the first column represents all clips that were predicted as VIII.CONCLUSION
"stop" and the first row contains all the clips that were "stop"
If the model is ideal, the confusion matrix produced will Thus, in this project we have been able to successfully use
contain all zeroes except the diagonal elements
convolutional neural networks to build a model that can be
used to recognize basic voice commands. The model was
trained to identify the commands like “left”, “right”, “stop”.
This was then interfaced with Arduino to control the motors of
the wheelchair

Acknowledgment
Perfect and precision guidance, hard work ,
dedication and full encouragement are needed to complete a
project successfully. In the life of every student illumination of
project work is like engraving a diamond.
We take this opportunity on the successful
completion of our project to thank all the staff members for
their valuable guidance, for devoting their precious time-
VI. STREAMING ACCURACY sharing their knowledge and their co-operation throughout the
course of development of our project and academic years of
Our model is based on individual clips but the audio education.
recognition applications run on a continuous stream . We own a deep guidance to our project guide Dr. M.
Therefore a general way to use a model in this system is by Deshpande. whose valuable guidance, which has been a key
applying this repeatedly with different offsets in time. By
factor in the successful completion of project.
averaging the results over a short window a smooth prediction
can be produced.
Our input is like an image; therefore, we need a series of References
images sampled at a high rate to increases the chances of
having an alignment that captures most of the spoken word in [1] A. K. Jain, Jianchang Mao and K. M. Mohiuddin, "Artificial neural
a single time window that we feed into the model. networks: a tutorial," in Computer, vol. 29, no. 3, pp. 31-44, Mar 1996.
doi: 10.1109/2.485891
<insert pics of accuracy >
[2] H. Bahi and N. Benati, "A new keyword spotting approach," 2009
By modifying the average signal parameters we can International Conference on Multimedia Computing and Systems,
produce the desired results as per our application. For Ouarzazate, 2009, pp. 77-80.
doi: 10.1109/MMCS.2009.5256728
example, some applications may require a high recall value
[3] S. K. Kopparapu and M. Laxminarayana, "Choice of Mel filter bank in
whereas the others may require a high precision. Generating computing MFCC of a resampled speech," 10th International
an ROC curve can aid in understanding. [10] Conference on Information Science, Signal Processing and their
Applications (ISSPA 2010), Kuala Lumpur, 2010, pp. 121-124.
doi: 10.1109/ISSPA.2010.5605491
[4] L. Thomas, Manoj Kumar M V and Annappa B, "Discovery of optimal
neurons and hidden layers in feed-forward Neural Network," 2016 IEEE
International Conference on Emerging Technologies and Innovative
Business Practices for the Transformation of Societies (EmergiTech),
Balaclava, 2016, pp. 286-291.
[5] M. D. Zeiler et al., "On rectified linear units for speech processing,"
2013 IEEE International Conference on Acoustics, Speech and Signal
Processing, Vancouver, BC, 2013, pp. 3517-3521.
doi: 10.1109/ICASSP.2013.6638312.
[6] S. Soldo, M. Magimai. -Doss, J. Pinto and H. Bourlard, "Posterior
features for template-based ASR," 2011 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), Prague, 2011, pp.
4864-4867.
doi: 10.1109/ICASSP.2011.5947445
[7] J. Dennis, H. D. Tran and H. Li, "Spectrogram Image Feature for Sound vol. 6, no. 2, pp. 215-222, June 2008.
Event Classification in Mismatched Conditions," in IEEE Signal doi: 10.1109/TLA.2008.4609920.
Processing Letters, vol. 18, no. 2, pp. 130-133, Feb. 2011. [11] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu,
doi: 10.1109/LSP.2010.2100380 “Convolutional neural networks for speech recognition,” IEEE
[8] S. Albawi, T. A. Mohammed and S. Al-Zawi, "Understanding of a Transactions on Audio, Speech, and Language Processing, vol.22, no.1,
convolutional neural network," 2017 International Conference on pp.1533-1545, 2014.
Engineering and Technology (ICET), Antalya, 2017, pp. 1-6. [12] T.N. Sainath, A. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep
doi: 10.1109/ICEngTechnol.2017.8308186 convolutional neural networks for LVCSR,” in Proc IEEE ICASSP,
[9] N. D. Marom, L. Rokach and A. Shmilovici, "Using the confusion 2013.
matrix for improving ensemble classifiers," 2010 IEEE 26-th [13] T.N. Sainath, B. Kingsbury, A. Mohamed, G.E. Dahl, G. Saon, H.
Convention of Electrical and Electronics Engineers in Israel, Eliat, Soltau, T. Beran, A.Y. Aravkin, and B. Ramabhadran, “Improvements to
2010, pp. 000555-000559. deep convolutional neural networks for LVCSR,” in Proc IEEE ASRU,
doi: 10.1109/EEEI.2010.5662159 2013.
[10] R. C. Prati, G. E. A. P. A. Batista and M. C. Monard, "Evaluating
Classifiers Using ROC Curves," in IEEE Latin America Transactions,

S-ar putea să vă placă și