Documente Academic
Documente Profesional
Documente Cultură
Abstract - A method to detect an abnormal situation inside a some cases. In such situation, audio surveillance perfectly
public transport bus using audio signals is presented. Mel complements video surveillance. Audio sensors are cheap,
Frequency Cepstral Coefficients (MFCC) were used as a feature small in size and has often less power consumption, compared
vector and a multilayer backpropagation neural network as a to video cameras, and can easily be installed.
classifier. Audio samples were taken inside the bus running along
Most audio analysis researches have been focused on
Epifanio Delos Santos Avenue (EDSA), Metro Manila, Philippines.
The audio samples depict sounds under normal operation inside speech recognition and identification. For more than a decade,
the bus. The abnormal situation was represented by attention has been given to detection of abnormal events for the
superimposing the sound of normal operation and the sounds of development of audio surveillance. Basically the detection of
gunshots, crowd in panic and screams for signal to noise ratio of abnormal events is an audio classification problem that consists
10, 20, 30, and 40dB. The sounds were divided into 3-second audio of two main processes: feature extraction and classification.
clips. The audio clips were divided into frames and each 3-second Features varied from techniques in time domain and frequency
audio clip produced 594 frames. Each frame is represented by 12 domain.
MFCCs. The accuracy of the system was tested for all frames and A study to detect shout events in a public transport
all 3-second audio clips. The accuracy of the system when
particularly in a railway embedded environment was proposed
measurement was based on the number of frames that were
correctly classified divided by the total number of frames tested is by [1]. They classified noisy acoustical segments similar to
99.41 %. When the measurement was based on 3-second audio audio indexing framework and used Gaussian Mixture Model
clips, the proposed system correctly classified all the events in 20, (GMM) and Support Vector Machine (SVM) for classification.
30 and 40 dB signal-to-noise ratios. Errors occurred in the Different numbers of Mel Frequency Cepstral Coefficients
classification of abnormal events at 10 dB signal-to-noise ratio the (MFCC) were used as feature vector. The two classification
classification of normal events. The accuracy is 97% and 93%, methods, GMM and SVM, in detecting shout events were
respectively. compared and both approaches achieved promising
performance.
Keywords—event detection, neural network backpropagation,
audio classification, public transport bus, MFCC
Detection of gunshots or normal condition was proposed by
[2]. The following sound features: short time energy, first 8
I. INTRODUCTION MFCCs, spectral centroid, and spectral spread were used with
A surveillance system is a necessity in every public place GMM as classifier. Classification of non-speech normal and
especially those places where crimes and tragedies are abnormal audio events was presented by [3]. The abnormal
common. Traditionally, a video surveillance system is used events include glass breaking, dog barks, screams, and gunshots
wherein several cameras are installed in strategic locations while the normal events include engine noise and rain noise.
which are being monitored simultaneously in one room. MFFC and pitch range-based features were used with artificial
Therefore, the efficiency of the system relies on the person in neural network as classifier.
charge of monitoring. However, these days, video surveillance A method to detect audio events in highly noisy
has been developed to employ artificial intelligence. Now, due environments using bag of words approach was proposed by
to its advanced ability, the system will be capable of classifying, [4]. The approach is commonly used to classify textual
tracking and counting objects in real time as well as analyzing documents. The audio signal is divided into some milliseconds
the behavior of the crowd for possible abnormal situations. using a time window. As the time window moves in the audio
Video surveillance systems having artificial intelligence are stream, a histogram of the occurrences of the aural words is
valuable yet very expensive. Therefore, in order to cover most formed. This histogram is used as a feature vector to be fed to a
of the areas at a minimum cost, cameras must be installed in the pool of Support Vector Machine (SVM) classifier.
best strategic locations. Hence, installation is complicated in This paper addresses an event detection of normal and
abnormal events for a public transport vehicle. Normal events
M is less than N. The next step is windowing. It is an important weights is called learning. Learning is achieved by training the
process to correct the problem called leakage in the power neural network. One of the most popular training techniques is
spectrum of a non-periodic signal. Windowing is important in backpropagation.
determining the frequency content of a signal. The window type
used in this step is hamming window defined by
2𝜋𝑛
𝑤(𝑛) = 0.54 − 0.46 cos ( ) ,0 ≤ 𝑛 ≤ 𝑁 (6) bias
𝑁−1
Abnormal
produced by passenger or conductor signaling the driver to stop When the system was tested with 3-second audio clips ,
by knocking on the metal part of the bus using a coin. This almost all frames in a 3-second audio clips were perfectly
sound produced a short high frequency sound. As illustrated in classified. The number of audio clips which were not classified
Fig. 7, the spectrum of sound under normal condition consist perfectly is indicated in Table 5. All clips representing
mostly of low frequencies (c) , while the spectrum of gunshots abnormal situations with a SNR of 20 and 30 dB were perfectly
crowd in panic, and screams consist of high frequency classified. The number of audio clips that were perfectly
components, illustrated in (a) and (b), respectively. classified at 40 dB abnormal sound is 98.86%. Only 10 clips
were not perfectly classified. However, upon inspecting the
frames in 10 3-second audio clips, at least 99 % frames were
correctly classfied. Hence, the 40 dB abnormal sound can be
considered as correctly classified. Similarly, the sound
representing normal condition although 93% are perfectly
classifed, the minimum accuracy of the 3-second audio clips
that were classified correctly is 87.71%, which can also be
considered as correct. The notable incorrect classified frames
were in the 10 dB abnormal sound. The 26 3-second audio clips
which were not classified correctly have an accuracy shown in
Fig. 8. This shows the percentage of frames that are correctly
classified in a 3-second audio clip. Of the 26 3-second audio
clips only 2 have an accuracy greater than 50%, the rest are very
low. In this case, it can be considered that the system is weak in
classifying abnormal sounds with 10dB signal-to-noise ratio.
Fig. 7. Spectra of normal and abnormal sounds
IV. CONCLUSION
TABLE 5. SIMULATION RESULTS OF 3-SECOND AUDIO CLIPS The proposed method of classifying audio events in a public
Number of transport bus was successfully implemented where the inputs are
Number % 3-second 12 MFCCs of a 25-ms frame, 5-ms frame overlap, and a 12x 12
3-second
Audio of 3- audio clips
classification second
audio clips
perfectly x 1 neural network classifier. The abnormal events are gunshots,
not perfectly crowd in panic and and screams at 10, 20, 30 , and 40 dB signal-
clips classified
classified
to-noise ratios where the background noise is the sound recorded
Normal 878 61 93
inside the public transport buses traveling along EDSA. The
Abnormal accuracy of the system when measurement is based on the
number of frames that were correctly classified divided by the
10 dB 878 26 97
total number of frames tested is 99.41 %. When measurement is
20 dB 878 0 100 based on 3-second audio clips, the proposed system correctly
classified all the events in 20, 30 and 40 dB signal-to-noise
30 dB 878 0 100
ratios. Errors occurred in 10 dB signal-to-noise ratio where the
40 dB 878 10 98.86 accuracy is 97% for abnormal events and 93% for normal
events. The errors may be due to the fact that some normal
Average accuracy 97.72
events appeared to be abnormal short events and some abnormal
10 dB events appeared to be normal events.
ACKNOWLEDGMENT
The authors would like to acknowledge the financial support
extended by the Commission on Higher Education (CHED), De
La Salle University-Manila, and Mindanao State University-
General Santos City.
V. REFERENCES
[2] C. Clavel, T. Ehrette and G. Richard. , "Events detection for an audio- [9] R. Maher and J. Studniarz, "Automatic Search of Sound Sources in
based surveillance system," in International Conference on Multimedia Long-term Surveillance Recordings," in AES 4th International
and Expo, 2005. Conference, Denver USA, 2012.
[3] B. Barkana, B. Uzkent and I. Saricicek, "Normal and Abnormal Non- [10] G. Valenzise, L. Gerosa, M. Tagliasacchi and F. Anton, "Scream and
Speech Audio Event Detection Using MFCC and PR-Based Features. gunshot detection and localization foraudio-surveillance systems," in
() Vol. 601, pp," Advanced Materials Research, pp. 200-208, 2012. IEEE Conference on Advanced Video and SignalBased Surveillance,
[4] P. Foggia, N. Petkov, A. Saggese and N. Strisciuglo, , "Reliable 2007. AVSS 2007., London, 2007..
Detection of Audio Events in Highly Noisy Environments," Elsevier [11] S. L. Pinjare and A. Kumar, "Implementation of Neural Network Back
Pattern Recognition Letters, pp. 22-28, 2015. Propagation Training Algorithm on FPGA, , Vol 52, No. 6,,"
[5] V. Carletti, P. Foggia, G. Percannella and A. Saggese, "Audio International Journal in Computer Applications, pp. 1-7, 2012.
surveillance using a bag of aural words classifie," in 10th IEEE [12] H. Yu and B. Bogdan,, "C++ Implementation of Neural Networks
International Conference on Advanced Video and Signal Based Trainer," in International Conference on Intelligent Engineering
Surveillance (AVSS, Krakow, 2013. Systems, 2009.
[6] K. Wojcicki, "HTK MFCC MATLAB [Source code]," 30 July 2015.
[Online]. Available:
http://www.mathworks.com/matlabcentral/fileexchange/32849-htk-
mfcc-matlab.
[7] B. Liang, H. Yaali, L. Songyang, C. Jianyun and W. Li, "Feature
analysis and extraction for audio automatic classification," in
International Conference on Man and Cybernetics Systems, 2005.
[8] M. Navratil, P. Dostalek and V. Kresalek,, "Classification of Audio
Sources Using Neural Network Applicable in Security or Military
Industry," 2010.