Documente Academic
Documente Profesional
Documente Cultură
ROORKEE
DEPARTMENT OF ELECTRICAL ENGINEERING
Project No - 8
Objectives:
Build neural networks which can classify different kinds of urban
sounds.
Compare the performance of each neural network for the classification.
Introduction
Unlike speech and audio signals, urban sounds are usually unstructured sounds.
They include real life noises generated by human activities ranging from
transportation to leisure activities. Automatic urban sound classification could
identify the noise source, benefitting the urban livability, such as noise control,
audio surveillance, soundscape assessment and acoustic environment planning.
PAGE 1
Dataset:
We need a labelled dataset that we can feed into machine learning algorithm.
Fortunately, some researchers published urban sound dataset. It contains 8,732
labelled sound clips (4 seconds each) from ten classes: air conditioner, car horn,
children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and
street music. The dataset by default is divided into 10-folds. In this dataset, the
sound files are in .wav format but if you have files in another format such as .mp3,
then it’s good to convert them into .wav format. It’s because .mp3 is lossy music
compression technique.
Data Handling:
As with all unstructured data formats, audio data has a couple of preprocessing
steps which have to be followed before it is presented for analysis. The first step
is to actually load the data into a machine understandable format. For this, we
simply take values after every specific time steps. For example; in a 2 second
audio file, we extract values at half a second. This is called sampling of audio
data, and the rate at which it is sampled is called the sampling rate. Another
way of representing audio data is by converting it into a different domain of data
representation, namely the frequency domain. When we sample an audio data,
we require much more data points to represent the whole data and also, the
sampling rate should be as high as possible. On the other hand, if we represent
PAGE 2
audio data in frequency domain, much less computational space is required.
Feature Extraction:
To extract the useful features from sound data, we will use Librosa library. It
provides several methods to extract different features from the sound clips with
default sampling rate to be 22050Hz. We are going to use below mentioned
methods to extract various features:
PAGE 3
Using librosa package for feature extraction and matplotlib package for plotting
we can differentiate different kinds of sounds using these images below.
After extracting the feature data from the sounds using librosa library we will
label the input data using “One Hot Encoder” .Once it is done will split the data
into train and test sets randomly with less than 70%. We then trained the data
with the neural network with random weights and consisting of 280 hidden units
in the first layer which uses the activation function of “tanh” and output of this
PAGE 4
hidden layer is then feedforwarded to 2nd hidden layer which consists of 300
hidden units which uses the activation function called “sigmoid function” and the
output layer uses the softmax function. We then used the gradient descent
optimizer with learning rate of 0.01 to minimize cost function (target*log
(output)). Here is the plot of cost vs iterations to measure the accuracy.
PAGE 5
Architecture of Convolutional Neural Networks
PAGE 6
Future Work and Goals:
Once we are able to create a NN we will try to classify using different kinds of
neural networks and compare their performance. Further we may try to create a
new neural network architecture by combining MLP and CNN architectures and
analyze the performance of that network.
Conclusion:
We successfully classified the data using Multilayer Perceptron Model and
Convolutional Neural Network. Due to huge amount of data used to train CNN,
we could not achieve the same with MLP due to performance issues
PAGE 7
References:
CNN ARCHITECTURES FOR LARGE-SCALE AUDIO CLASSIFICATION by
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren
Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan
Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson Google, Inc., New York,
NY, and Mountain View, CA, USA
https://www.kaggle.com/pavansanagapati/urban-sound-classification
https://towardsdatascience.com/urban-sound-classification-part-1-
99137c6335f9
PAGE 8