Sunteți pe pagina 1din 6

8th IEEE International Conference Humanoid, Nanotechnology, Information Technology

Communication and Control, Environment and Management (HNICEM)


The Institute of Electrical and Electronics Engineers Inc. (IEEE) – Philippine Section
9-12 December 2015 Waterfront Hotel, Cebu City, Philippines

Neural Network Classification for Detecting


Abnormal Events in a Public Transport Vehicle
Cristina P. Dadula and Elmer P. Dadios
Gokongwei College of Engineering
De La Salle University
Manila, Philippines
cristina_dadula@dlsu.edu.ph
elmer.dadios@dlsu.edu.ph

Abstract - A method to detect an abnormal situation inside a some cases. In such situation, audio surveillance perfectly
public transport bus using audio signals is presented. Mel complements video surveillance. Audio sensors are cheap,
Frequency Cepstral Coefficients (MFCC) were used as a feature small in size and has often less power consumption, compared
vector and a multilayer backpropagation neural network as a to video cameras, and can easily be installed.
classifier. Audio samples were taken inside the bus running along
Most audio analysis researches have been focused on
Epifanio Delos Santos Avenue (EDSA), Metro Manila, Philippines.
The audio samples depict sounds under normal operation inside speech recognition and identification. For more than a decade,
the bus. The abnormal situation was represented by attention has been given to detection of abnormal events for the
superimposing the sound of normal operation and the sounds of development of audio surveillance. Basically the detection of
gunshots, crowd in panic and screams for signal to noise ratio of abnormal events is an audio classification problem that consists
10, 20, 30, and 40dB. The sounds were divided into 3-second audio of two main processes: feature extraction and classification.
clips. The audio clips were divided into frames and each 3-second Features varied from techniques in time domain and frequency
audio clip produced 594 frames. Each frame is represented by 12 domain.
MFCCs. The accuracy of the system was tested for all frames and A study to detect shout events in a public transport
all 3-second audio clips. The accuracy of the system when
particularly in a railway embedded environment was proposed
measurement was based on the number of frames that were
correctly classified divided by the total number of frames tested is by [1]. They classified noisy acoustical segments similar to
99.41 %. When the measurement was based on 3-second audio audio indexing framework and used Gaussian Mixture Model
clips, the proposed system correctly classified all the events in 20, (GMM) and Support Vector Machine (SVM) for classification.
30 and 40 dB signal-to-noise ratios. Errors occurred in the Different numbers of Mel Frequency Cepstral Coefficients
classification of abnormal events at 10 dB signal-to-noise ratio the (MFCC) were used as feature vector. The two classification
classification of normal events. The accuracy is 97% and 93%, methods, GMM and SVM, in detecting shout events were
respectively. compared and both approaches achieved promising
performance.
Keywords—event detection, neural network backpropagation,
audio classification, public transport bus, MFCC
Detection of gunshots or normal condition was proposed by
[2]. The following sound features: short time energy, first 8
I. INTRODUCTION MFCCs, spectral centroid, and spectral spread were used with
A surveillance system is a necessity in every public place GMM as classifier. Classification of non-speech normal and
especially those places where crimes and tragedies are abnormal audio events was presented by [3]. The abnormal
common. Traditionally, a video surveillance system is used events include glass breaking, dog barks, screams, and gunshots
wherein several cameras are installed in strategic locations while the normal events include engine noise and rain noise.
which are being monitored simultaneously in one room. MFFC and pitch range-based features were used with artificial
Therefore, the efficiency of the system relies on the person in neural network as classifier.
charge of monitoring. However, these days, video surveillance A method to detect audio events in highly noisy
has been developed to employ artificial intelligence. Now, due environments using bag of words approach was proposed by
to its advanced ability, the system will be capable of classifying, [4]. The approach is commonly used to classify textual
tracking and counting objects in real time as well as analyzing documents. The audio signal is divided into some milliseconds
the behavior of the crowd for possible abnormal situations. using a time window. As the time window moves in the audio
Video surveillance systems having artificial intelligence are stream, a histogram of the occurrences of the aural words is
valuable yet very expensive. Therefore, in order to cover most formed. This histogram is used as a feature vector to be fed to a
of the areas at a minimum cost, cameras must be installed in the pool of Support Vector Machine (SVM) classifier.
best strategic locations. Hence, installation is complicated in This paper addresses an event detection of normal and
abnormal events for a public transport vehicle. Normal events

978-1-5090-0360-0/15/$31.00 @2015 IEEE


8th IEEE International Conference Humanoid, Nanotechnology, Information Technology
Communication and Control, Environment and Management (HNICEM)
The Institute of Electrical and Electronics Engineers Inc. (IEEE) – Philippine Section
9-12 December 2015 Waterfront Hotel, Cebu City, Philippines

are sounds recorded inside the public transport bus travelling 1


along Epifanio Delos Santos Avenue (EDSA), Metro Manila. 𝑋𝑟𝑚𝑠 = √ ∑𝑁
𝑛=1|𝑥(𝑛)|
2 (2)
𝑁
Abnormal events include screams and gunshots with
background noise of normal events. Different signal-to-noise The 𝑆𝑁𝑅 in dB,
ratios of abnormal events were considered: 10, 20, 30, and 40 𝑃𝑠𝑖𝑔𝑛𝑎𝑙
𝑆𝑁𝑅𝑑𝐵 = 10 log(𝑆𝑁𝑅)10 log ( )
dB. The output of this work will serve as one module on audio 𝑃𝑛𝑜𝑖𝑠𝑒
surveillance system for a public transport vehicle. 𝑆𝑟𝑚𝑠
= 20 𝑙𝑜𝑔 ( ) (3)
The next section is organized as follows: section 2 is 𝑁𝑟𝑚𝑠
methodology that discusses data gathering, feature extraction, The amplification or attenuation K in order to come up with
neural network design and testing; section 3, results and a certain value of SNR in dB is
discussion, and section 4 is conclusion.
𝑆𝑟𝑚𝑠
𝑆𝑁𝑅𝑑𝐵 = 20 𝑙𝑜𝑔 (𝐾 ) (4)
II. METHODOLOGY 𝑁𝑟𝑚𝑠

A. Data Gathering Extracting K from (4),


Normal sounds were obtained from sounds recorded inside 𝑁𝑟𝑚𝑠
𝐾 = 10𝑆𝑁𝑅(𝑑𝐵)/20 ( ) (5)
an air-conditioned public transport buses traveling along at 𝑆𝑟𝑚𝑠

Epifanio Delos Santos Avenue (EDSA), Metro Manila, using an


AlcatelTM mobile phone. The recording was done on Saturday The recorded sounds were cut into 3-second audio clips.
where traffic is expected to be less congested. This is to capture The sounds representing abnormal situations at 10, 20, 30 and
more transport active sounds wherein the buses are really 40 dB are also in 3-second audio clips.
moving. The recorded sound depicts the following: conductor
asking passengers who do not have tickets yet, passengers TABLE 1. POWER AND AMPLITUDE RATIO
talking, opening and closing of bus door, other vehicles passing
dB Power ratio Amplitude ratio
by, bus accelerating and slowing down, Metro Rail Transit
(MRT) train passing overhead and in some cases a Television 10 10 3.162
set playing a movie or TV program. 20 100 10
There were 22 recorded files in 8 different buses. A total 30 1000 31.62
length of 93 minutes was recorded. The files were in 3gpp
format, stereo, and 48000 Hz sampling frequency. The files 40 10000 100
were down-sampled to 44100 Hz using free online converter.
The recorded sounds represent the normal condition of the
public transport bus. These sounds serve as the background B. Feature Extraction
sounds of abnormal events. The process of reducing data into a smaller representation is
referred to as feature extraction. In audio processing, features
The sound of screams, gunshots, and crowd in panic were
can be extracted in time or spectral domain. The feature used to
downloaded for free from the internet. The energy of these
represent the audio signal is MFCC. This is the most popular
sounds were changed so that they would have the same energy
feature extraction method commonly used in speech recognition
and then they were concatenated. They were attenuated or
and identification. Obtaining MFCCs of a discrete audio signal
amplified such that the Signal-to-Noise Ratios (SNRs) would be
involves stages shown in Fig. 1. In the first stage, the signal
10dB, 20dB, 30dB and 40dB. They were superimposed on the
undergoes pre-emphasis where the high frequency components
normal sounds recorded inside the bus to represent abnormal
of the signal are boosted.
events. The amplification or attenuation is done to ensure that
the energy of screams, gunshots and crowd in panic are greater
than that of the background noise which would normally be the Discrete Pre-
Framing Windowing DFT
case. Table 1 shows the amplitude ratio and power ratio for a audio signal Emphasis
given dB value. The amplification or attenuation constant can be
derived from the formula of signal- to-noise ratio, Mel filter DCT Delta energy
MFCCs
bank and spectrum
𝑃𝑠𝑖𝑔𝑛𝑎𝑙 𝑆𝑟𝑚𝑠 2
𝑆𝑁𝑅 = =( ) , (1)
𝑃𝑛𝑜𝑖𝑠𝑒 𝑁𝑟𝑚𝑠
Fig. 1. Extracting MFCCs
where 𝑃𝑠𝑖𝑔𝑛𝑎𝑙 and 𝑃𝑛𝑜𝑖𝑠𝑒 are the signal and noise power; and
𝑆𝑟𝑚𝑠 and 𝑁𝑟𝑚𝑠 are root mean square values of signal and noise, The signal then is divided into frames with duration of 25
respectively.The root mean square of the signal x(n) of length N ms. An overlap between frames is an option. A 5-ms frame shift
is is used. Each frame can have N samples with a frame shift of M
samples. For a sampling rate of 44100 Hz, 25-ms frames
duration and 5-ms frame shift, M = 1102 and N=220. Typically
8th IEEE International Conference Humanoid, Nanotechnology, Information Technology
Communication and Control, Environment and Management (HNICEM)
The Institute of Electrical and Electronics Engineers Inc. (IEEE) – Philippine Section
9-12 December 2015 Waterfront Hotel, Cebu City, Philippines

M is less than N. The next step is windowing. It is an important weights is called learning. Learning is achieved by training the
process to correct the problem called leakage in the power neural network. One of the most popular training techniques is
spectrum of a non-periodic signal. Windowing is important in backpropagation.
determining the frequency content of a signal. The window type
used in this step is hamming window defined by
2𝜋𝑛
𝑤(𝑛) = 0.54 − 0.46 cos ( ) ,0 ≤ 𝑛 ≤ 𝑁 (6) bias
𝑁−1

where N is the window length. Then Fast Fourier Transform


(FFT) is applied to convert the signal x[n] to frequency domain
X(w). Then the spectrum X(w) is fed into mel filter banks. Each
filter has a center frequency called mel frequency defined as
𝑓
𝑚𝑒𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 = 2595 log(1 + ) (7)
700

The output of mel filter bank is the sum of spectral


components. The aggregate of the output represents an array
called mel spectrum. Next the Discrete Cosine Transform (DCT)
is applied to mel spectrum. The output coefficients of DCT are
Output layer
called Mel Frequency Cepstral Coefficients (MFCC).
Input layer Hidden layer
MFCC features were obtained using the source code written
by K. Wojcicki [5]. The code was implemented in Matlab. Fig. 2. Neural network basic architecture
Table 2 shows the necessary input parameters used in obtaining
MFCCs. Twelve coefficients are kept in each frame.
TABLE 2.MFCC PARAMETERS bias
x1
Parameter Symbol Value w1j
Analysis Frame Duration (ms) Tw 25 x2 𝑢𝑖
Analysis Frame Shift (ms) Ts 5 w2j
∑ 𝑓(𝑢)
Pre-emphasis Coefficient alpha 0.97 w3j
Frequency Range to consider R [100 5000] x3

Number of Filterbank ⋮
Channels M 20 xn wnj
Number of Cepstral
Coefficients C 13 Fig. 3. Neuron as a computing element
Cepstral Sine Lifter Parameter L 22

The design of neural network was implemented in


C. Neural Network
MatlabTM. The size of the network is 12 x 12 x 1. This means
A neural network computing is a model of how the human that there are 12 inputs or neurons in the input layer, 1 hidden
brain processes information. It has been known to solve various layer with 12 neurons and one output neuron. The architecture
non-linear problems such as curve fitting, pattern recognition is the same as in Fig. 2. The transfer function defined in hidden
and classification. The model consists of three layers: input and output layers is tansig. The tansig function is of this form,
layer, hidden layer, and output layer. Each layer consists of
neurons represented by nodes. Each node is connected to all 1
𝑓(𝑥) = 𝑡𝑎𝑛𝑠𝑖𝑔(𝑥) = (1+𝑒 −2𝑥 )−1
(8)
nodes in the next layer. The connection is defined a scalar value
called weights. The basic architecture is shown in Fig. 2. The
nodes in the input layer are passive. They just copy the value of The type of network set was set to feed forward
the input and propagate that value to next node on the next backpropagation, learning rate of 0.001, and the performance
layer. The nodes in the hidden and output layer are active nodes. is measured as mean square error.
They are computing units with a function illustrated in Fig. 3. D. Training and Testing
It computes the value at each node by summing the product of
the signal from the previous node and the line weights plus a The 93-minute recorded sound was cut into 3-second audio
constant called bias. Then this value is fed to a transfer function clips. A total of 1756 clips were saved. The same number of
defined for given layer. The output of the transfer function clips were created for 10dB, 20dB, 30dB and 40dB abnormal
serves as the node's activation value. This value is propagated events. When the features were extracted from a 3-second audio
to the next node or if the neuron is a neuron in the output layer clip using the parameters shown in Table 2, 594 frames were
this value is part of the output. The process of adjusting the produced. Table 3 shows the number of audio clips and number
8th IEEE International Conference Humanoid, Nanotechnology, Information Technology
Communication and Control, Environment and Management (HNICEM)
The Institute of Electrical and Electronics Engineers Inc. (IEEE) – Philippine Section
9-12 December 2015 Waterfront Hotel, Cebu City, Philippines

of frames for normal and abnormal conditions. Each frame is


represented by 12 MFCCs. Also shown is the number of frames
used for training, testing and the corresponding desired output.
The desired output is 1 for normal condition and -1 for abnormal
condition.
The test was conducted by using the idea that if the output is Fig. 5. A 12 x 12 x1 neural network
less than or equal to zero, the audio input is classified as
abnormal condition, otherwise it is a normal condition. All the TABLE 4. NUMBER OF TRAINING AND TESTING SAMPLES
frames were tested and the percent accuracy was computed as Frames
𝑛𝑜. 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑓𝑟𝑎𝑚𝑒𝑠 Audio Frames
correctly % accuracy
%𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦𝑓𝑟𝑎𝑚𝑒𝑠 = (9) classification tested
identified
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑟𝑎𝑚𝑒𝑠
Normal 521532 521061 99.907
Also tested was the accuracy of the system when
considering a 3-second audio clip. The number of audio clips not Abnormal
perfectly classified was also recorded. Those clips that were not 10 dB 521532 506669 97.150
perfectly classified were examined by looking at the percentage
of number of frames that were correctly classified. 20 dB 521532 521532 100

30 dB 521532 521532 100

40 dB 521532 521505 99.995


TABLE 3. NUMBER OF TRAINING AND TESTING SAMPLES
Average accuracy 99.41
Audio 3-
classification secon Frames Frames
Total Desired
d used for used for
Frames output
audio training testing
clips
Normal 1756 1043064 1 10000 521532

Abnormal

10 dB 1756 1043064 -1 10000 521532

20 dB 1756 1043064 -1 10000 521532

30 dB 1756 1043064 -1 10000 521532

40 dB 1756 1043064 -1 10000 521532

Total 50000 2607660

Fig. 6. Neural network performance

III. RESULTS AND DISCUSSION


A sample 3-second audio clips in time domain for normal
and abnormal sound with SNR of 10dB is shown in Fig. 4.The
time domain representations of 20, 30 and 40 dB are not shown
because the shape is just similar to time domain representation
of 10 dB. They differ only in magnitude.
The neutral network created in Matlab is shown in Fig. 5.
Its performance is shown Fig. 6. The best validation
performance is at epoch 14. The result of simulation of the
neural network classifier using all frames for testing is shown
in Table 4. The accuracy of the system in classifying normal
Fig. 4. A 3-second clip of normal and abnormal sound in time domain conditions is 99.9 %. The 0.1 % error maybe due to the sound
in normal condition that is similar to abnormal conditon. As
observed during the recording process inside the bus, the
sounds that are similar to abnormal sounds are the sounds
8th IEEE International Conference Humanoid, Nanotechnology, Information Technology
Communication and Control, Environment and Management (HNICEM)
The Institute of Electrical and Electronics Engineers Inc. (IEEE) – Philippine Section
9-12 December 2015 Waterfront Hotel, Cebu City, Philippines

produced by passenger or conductor signaling the driver to stop When the system was tested with 3-second audio clips ,
by knocking on the metal part of the bus using a coin. This almost all frames in a 3-second audio clips were perfectly
sound produced a short high frequency sound. As illustrated in classified. The number of audio clips which were not classified
Fig. 7, the spectrum of sound under normal condition consist perfectly is indicated in Table 5. All clips representing
mostly of low frequencies (c) , while the spectrum of gunshots abnormal situations with a SNR of 20 and 30 dB were perfectly
crowd in panic, and screams consist of high frequency classified. The number of audio clips that were perfectly
components, illustrated in (a) and (b), respectively. classified at 40 dB abnormal sound is 98.86%. Only 10 clips
were not perfectly classified. However, upon inspecting the
frames in 10 3-second audio clips, at least 99 % frames were
correctly classfied. Hence, the 40 dB abnormal sound can be
considered as correctly classified. Similarly, the sound
representing normal condition although 93% are perfectly
classifed, the minimum accuracy of the 3-second audio clips
that were classified correctly is 87.71%, which can also be
considered as correct. The notable incorrect classified frames
were in the 10 dB abnormal sound. The 26 3-second audio clips
which were not classified correctly have an accuracy shown in
Fig. 8. This shows the percentage of frames that are correctly
classified in a 3-second audio clip. Of the 26 3-second audio
clips only 2 have an accuracy greater than 50%, the rest are very
low. In this case, it can be considered that the system is weak in
classifying abnormal sounds with 10dB signal-to-noise ratio.
Fig. 7. Spectra of normal and abnormal sounds
IV. CONCLUSION
TABLE 5. SIMULATION RESULTS OF 3-SECOND AUDIO CLIPS The proposed method of classifying audio events in a public
Number of transport bus was successfully implemented where the inputs are
Number % 3-second 12 MFCCs of a 25-ms frame, 5-ms frame overlap, and a 12x 12
3-second
Audio of 3- audio clips
classification second
audio clips
perfectly x 1 neural network classifier. The abnormal events are gunshots,
not perfectly crowd in panic and and screams at 10, 20, 30 , and 40 dB signal-
clips classified
classified
to-noise ratios where the background noise is the sound recorded
Normal 878 61 93
inside the public transport buses traveling along EDSA. The
Abnormal accuracy of the system when measurement is based on the
number of frames that were correctly classified divided by the
10 dB 878 26 97
total number of frames tested is 99.41 %. When measurement is
20 dB 878 0 100 based on 3-second audio clips, the proposed system correctly
classified all the events in 20, 30 and 40 dB signal-to-noise
30 dB 878 0 100
ratios. Errors occurred in 10 dB signal-to-noise ratio where the
40 dB 878 10 98.86 accuracy is 97% for abnormal events and 93% for normal
events. The errors may be due to the fact that some normal
Average accuracy 97.72
events appeared to be abnormal short events and some abnormal
10 dB events appeared to be normal events.
ACKNOWLEDGMENT
The authors would like to acknowledge the financial support
extended by the Commission on Higher Education (CHED), De
La Salle University-Manila, and Mindanao State University-
General Santos City.

V. REFERENCES

[1] J. L. Rouas, J. Louradour and S. Ambellouis, "Audio events detection


in public transport vehicle.," in Intelligent Transportation Systems
Conference, 2006.

Fig. 8. Accuracy of 26 frames not perfectly classified


8th IEEE International Conference Humanoid, Nanotechnology, Information Technology
Communication and Control, Environment and Management (HNICEM)
The Institute of Electrical and Electronics Engineers Inc. (IEEE) – Philippine Section
9-12 December 2015 Waterfront Hotel, Cebu City, Philippines

[2] C. Clavel, T. Ehrette and G. Richard. , "Events detection for an audio- [9] R. Maher and J. Studniarz, "Automatic Search of Sound Sources in
based surveillance system," in International Conference on Multimedia Long-term Surveillance Recordings," in AES 4th International
and Expo, 2005. Conference, Denver USA, 2012.
[3] B. Barkana, B. Uzkent and I. Saricicek, "Normal and Abnormal Non- [10] G. Valenzise, L. Gerosa, M. Tagliasacchi and F. Anton, "Scream and
Speech Audio Event Detection Using MFCC and PR-Based Features. gunshot detection and localization foraudio-surveillance systems," in
() Vol. 601, pp," Advanced Materials Research, pp. 200-208, 2012. IEEE Conference on Advanced Video and SignalBased Surveillance,
[4] P. Foggia, N. Petkov, A. Saggese and N. Strisciuglo, , "Reliable 2007. AVSS 2007., London, 2007..
Detection of Audio Events in Highly Noisy Environments," Elsevier [11] S. L. Pinjare and A. Kumar, "Implementation of Neural Network Back
Pattern Recognition Letters, pp. 22-28, 2015. Propagation Training Algorithm on FPGA, , Vol 52, No. 6,,"
[5] V. Carletti, P. Foggia, G. Percannella and A. Saggese, "Audio International Journal in Computer Applications, pp. 1-7, 2012.
surveillance using a bag of aural words classifie," in 10th IEEE [12] H. Yu and B. Bogdan,, "C++ Implementation of Neural Networks
International Conference on Advanced Video and Signal Based Trainer," in International Conference on Intelligent Engineering
Surveillance (AVSS, Krakow, 2013. Systems, 2009.
[6] K. Wojcicki, "HTK MFCC MATLAB [Source code]," 30 July 2015.
[Online]. Available:
http://www.mathworks.com/matlabcentral/fileexchange/32849-htk-
mfcc-matlab.
[7] B. Liang, H. Yaali, L. Songyang, C. Jianyun and W. Li, "Feature
analysis and extraction for audio automatic classification," in
International Conference on Man and Cybernetics Systems, 2005.
[8] M. Navratil, P. Dostalek and V. Kresalek,, "Classification of Audio
Sources Using Neural Network Applicable in Security or Military
Industry," 2010.

S-ar putea să vă placă și