Documente Academic
Documente Profesional
Documente Cultură
concept 1 predictions 1
+ classifier
violence
training &
optimizing
ground truth
... violence/
feature violence
extractor
frame-wise
training &
optimizing
+ classifier
non-violence
decision
movie descriptors
concept N
+ classifier
predictions N
concept N
ground truth
• thanks to the fusion of mid-level concept predictions, in Section 5) and ground truth related to the actual violence
the method is feature-independent in the sense that it segments. We used the data set provided in [2].
does not require the design of adapted features; The mid-level concept detection consists of a bank of clas-
sifiers that are trained to respond to each of the target
• violence is predicted at frame level which facilitates violence-related concepts. At this level, the response of
detection of arbitrary length segments, not only fixed the classifier is optimized for best performance. Tests are
length (e.g., shots); repeated for different parameter setups until the classifier
yields the highest accuracy. Each classifier state is then
• evaluation is carried out on a standard data set [2] saved. With this step, initial features are therefore trans-
making the results both relevant and reproducible. formed into concept predictions (real valued between [0;1]).
The high-level concept detection is ensured by a final clas-
3. PROPOSED APPROACH sifier that is fed with the previous concept predictions and
Instead of focusing on engineering a best content descrip- acts as a final fusion scheme. The output of the classifier
tion approach suitable for this task, as most of the existing is thresholded to achieve the labeling of each frame as “vio-
approaches do, we propose a novel perspective. Given the lent” or “non-violent” (yes/no decision). As in the previous
high variability in appearance of violent scenes in movies case, we use the violence ground truth to tune the classifier
and the low amount of training data that is usually avail- to its optimal results (e.g., setting the best threshold). The
able, training a classifier to predict violent frames directly classifier state is again conserved.
from visual and auditory features seems rather ineffective. 3.2 Violence classification
We propose instead to use high-level concept ground-truth
obtained from manual annotation to infer mid-level concepts Equipped with a violence frame predictor, we may proceed
as a stepping stone towards the final goal. Predicting mid- to label new unseen video sequences. Based on the previ-
level concepts from low-level features should be more feasi- ous classifier states, new frames are now labeled as “violent”
ble than directly predicting all forms of violence (highly se- or “non-violent”. Depending on the final usage, aggrega-
mantic). Also, predicting violence from mid-level concepts tion into segments can be performed at two levels: arbitrary
should be easier than using directly the low-level content length segments and video shot segments. The segment-level
features. tagging forms segments of consecutive frames our predictor
A diagram of the proposed method is shown in Figure 1. tagged as violent or non-violent and the shot-level tagging
First, we perform feature extraction. Features are extracted uses preliminary shot boundary detection [20].
at frame level (see Section 5.2). The resulting data is then For both approaches, each segment (whether obtained at
fed into a multi-classifier framework that operates in two the level of frames or shots) is assigned a violence score cor-
steps. The first step consists of training the system using responding to the highest predictor output for any frame
ground truth data. Once we captured data characteristics within the segment. The segments are then tagged as “vi-
we may classify unseen video frames into one of the two olent” or “non-violent” depending on whether their violence
categories: “violence” and “non-violence”. Consecutively, vi- score exceeds the optimal threshold found previously in the
olence frames are aggregated to segments. Each of the two training of the violence classifier.
steps is presented in the sequel.
4. NEURAL NETWORK CLASSIFIER
3.1 Concept and violence training To choose the right classification scheme for this particular
To train the system we use ground truth data at two levels: fusion task, we conducted several preliminary experimental
ground truth related to concepts that are usually present in tests using a broad variety of classifiers, from functional-
the violence scenes, such as presence of “fire”, presence of based (e.g., Support Vector Machines), decision trees to neu-
“gunshots”, or “gory” scenes (more information is presented ral networks. Most of the classifiers failed in providing rele-
out in the context of the 2012 MediaEval Benchmarking
Initiative for Multimedia Evaluation, Affect task: Violent
−50% Scenes Detection [17]. It proposes a corpus of 18 Hollywood
movies of different genres, from extremely violent movies to
movies without violence. Movies are divided into a devel-
opment set, consisting of 15 movies: “Armageddon”, “Billy
−20% Elliot”, “Eragon”, “Harry Potter 5”, “I am Legend”, “Leon”,
“Midnight Express”, “Pirates of the Caribbean 1”, “Reser-
(a) a feed-forward neu- (b) the same network with voir Dogs”, “Saving Private Ryan”, “The Sixth Sense”, “The
ral network randomly dropped units
Wicker Man”, “Kill Bill 1”, “The Bourne Identity”, and “The
Figure 2: Illustrating random dropouts of network units: 2a Wizard of Oz” (total duration of 27h 58min, 26,108 video
shows the full classifier, 2b is one of the possible versions shots and violence duration ratio 9.39%); and a test set con-
trained on a single example. sisting of 3 movies - “Dead Poets Society”, “Fight Club”, and
“Independence Day” (total duration 6h 44min, 6,570 video
shots and violence duration ratio 4.92%). Overall the entire
data set contains 1,819 violence segments [2].
vant results when coping with high amount of input data,
Ground truth is provided at two levels. Frames are an-
i.e. labeling of individual frames rather than video segments.
notated according to 10 violence related high-level concepts,
The inherent parallel architecture of neural networks fitted
namely: “presence of blood”, “fights”, “presence of fire”, “pres-
well these requirements, in particular the use of multi-layer
ence of guns”, “presence of cold weapons”, “car chases” and
perceptrons. Therefore, for the concept and violence classi-
“gory scenes” (for the video modality); “presence of screams”,
fiers (see Figure 1) we employ a multi-layer perceptron with
“gunshots” and “explosions” (for the audio modality) [1];
a single hidden layer of 512 logistic sigmoid units and as
and frame segments are labeled as “violent” or “non-violent”.
many output units as required for the respective concept.
Ground truth was created by 9 human assessors.
Some of the concepts from [2] consist of independent tags,
For evaluation, we use classic precision and recall:
for instance “fights” encompasses the five tags “1vs1”, “de-
fault”, “distant attack”, “large” and “small” (in which case TP TP
precision = , recall = (1)
we use five independent output units). TP + FP TP + FN
Networks are trained by gradient descent on the cross- where T P stands for true positives (good detections), F P
entropy error with backpropagation [26], using a recent idea are the false positives (false detections) and F N the false
by Hinton et al. [25] to improve generalization: For each negatives (the misdetections). To have a global measure of
presented training case, a fraction of input and hidden units performance, we also report F1 -scores:
is omitted from the network and the remaining weights are
scaled up to compensate. Figure 2 visualizes this for a small precision · recall
F1 -score = 2 · (2)
network, with 20% of input units and 50% of hidden units precision + recall
“dropped out”. The set of dropped units is chosen at random Values are averaged over all experiments.
for each presentation of a training case, such that many dif-
ferent combinations of units will be trained during an epoch. 5.1 Parameter tuning and preproceesing
This helps generalization in the following way: By ran- The proposed approach involves the choice of several pa-
domly omitting units from the network, a higher-level unit rameters and preprocessing steps.
cannot rely on all lower-level units being present and thus In what concerns the multi-layer perceptron, we follow the
cannot adapt to very specific combinations of a few parent dropout scheme in [25, Sec. A.1] with minor modifications:
units only. Instead, it is driven to find activation patterns of Weights are initialized to all zeroes, mini-batches are 900
a larger group of correlated units, such that dropping a frac- samples, the learning rate starts at 1.0, momentum is in-
tion of them does not hinder recognizing the pattern. For creased from 0.45 to 0.9 between epochs 10 and 20, and we
example, when recognizing written digits, a network trained train for 100 epochs only. We use a single hidden layer of
without dropouts may find that three particular pixels are 512 units. These settings worked well in preliminary exper-
enough to tell apart ones and sevens on the training data. A iments on 5 movies.
network trained with input dropouts is forced to take into To avoid multiple scale values for the various content fea-
account several correlated pixels per hidden unit and will tures, the input data of the concept predictors is normal-
learn more robust features resembling strokes. ized by subtracting the mean and dividing by the standard
Hinton et al. [25] showed that features learned with drop- deviation of each input dimension. Also, as the concept
outs generalize better, improving test set performance on predictors are highly likely to yield noisy outputs, we em-
three very different machine learning tasks. This encouraged ploy a sliding median filter for temporal smoothing of the
us to try their idea for our data as well, and indeed we predictions. Trying a selection of filter lengths, we end up
observed an improvement of up to 5%-points F1 -score in all smoothing over 125 frames, i.e. 5 seconds.
our experiments. As an additional benefit, a network trained
with dropouts does not severely overfit to the training data, 5.2 Video descriptors
eliminating the need for early stopping on a validation set We experimented with several video description approaches
to regularize training. that proved to perform well in various video and audio clas-
sification scenarios [17, 22, 23, 27]. Given the specificity
of the task, we derive audio, color, feature description and
5. EXPERIMENTAL RESULTS temporal structure information. Descriptors are extracted
The experimental validation of our approach was carried at frame level as follows:
• audio (196 dimensions): we use a general-purpose Table 1: Evaluation of concept predictions.
set of audio descriptors: Linear Predictive Coefficients
(LPCs), Line Spectral Pairs (LSPs), MFCCs, Zero- concept visual audio precision recall F1 -score
Crossing Rate (ZCR), and spectral centroid, flux, rolloff, blood X 7% 100% 12%
and kurtosis, augmented with the variance of each fea- coldarms X 11% 100% 19%
ture over a window of 0.8 s centered at the current firearms X 17% 45% 24%
frame [24, 27]; gore X 5% 33% 9%
gunshots X 10% 14% 12%
• color (11 dimensions): to describe global color con- screams X 8% 19% 12%
tents, we use the Color Naming Histogram proposed carchase X X 1% 8% 1%
in [22]. It maps colors to 11 universal color names: explosions X X 8% 17% 11%
“black”, “blue”, “brown”, “grey”, “green”, “orange”, “pink”, fights X X 14% 29% 19%
“purple”, “red”, “white”, and “yellow”; fire X X 24% 30% 26%
• features (81 values): we use a 81-dimensional His-
togram of Oriented Gradients (HoG) [23].
concept predictions will likely yield poor results - the system
• temporal structure (single dimension): to account would learn, for example, to associate blood with violence,
for temporal information we use a measure of visual then provide inaccurate violence predictions on the test set
activity. We use the cut detector in [21] that measures where we only have highly inaccurate blood predictions. In-
visual discontinuity by means of difference between stead, we train on the real-valued concept predictor out-
color histograms of consecutive frames. To account for puts obtained during the cross-validation described in Sec-
a broader range of significant visual changes, but still tion 5.3.1. This allows the system to learn which predictions
rejecting small variations, we lower the threshold used to trust and which to ignore.
for cut detection. Then, for each frame we determine The violence predictor achieves precision and recall values
the number of detections in a certain time window cen- of 28.29% and 37.64%, respectively and an F1 -score of 32.3%
tered at the current frame (e.g., for violence detection (results are obtained for the optimal binarization threshold).
good performance is obtained with 2 s windows). High The results are very promising considering the difficulty of
values of this measure will account for important visual the task and the diversity of movies. The fact that we ob-
changes that are typically related to action. tain better performance compared to the detection of indi-
vidual concepts may be due to the fact that violent scenes
5.3 Cross-validation training results often involve the occurrence of several concepts, not only
In this experiment we aim to train and evaluate the perfor- one, which may compensate for some low concept detection
mance of the neural network classifiers according to concept performance. A comparison with other techniques is pre-
and violence ground truth. We used the development set of sented in the following section.
15 movies. Training and evaluation are performed using a
leave-one-movie-out cross-validation approach. 5.4 MediaEval 2012 results
In the final experiment we present a comparison of the per-
5.3.1 Concept prediction formance of our violence predictor in the context of the 2012
First, we train 10 multi-layer perceptrons to predict each MediaEval Benchmarking Initiative for Multimedia Evalua-
of the violence-related high-level concepts. The results of tion, Affect task: Violent Scenes Detection [2, 17].
the cross-validation are presented in Table 1. For each con- In this task, participants were provided with the devel-
cept, we list the input features (visual, auditory, or both) opment data set (15 movies) for training their approaches
and average precision, recall and F1 -score at the binariza- while the official evaluation was carried out on 3 test movies:
tion threshold giving the best F1 -score (real valued outputs “Dead Poets Society” (34 violent scenes), “Fight Club” (310
of the perceptrons are thresholded to achieve yes/no deci- violent scenes) and “Independence Day”(371 violence scenes)
sions). - a total of 715 violence scenes (ground truth for the test
Results show that the highest precision and recall are up set was released after the competition). A total number
to 24% and 100%, respectively, while the highest F1 -score is of 8 teams participated providing 36 runs. Evaluation was
of 26%. Detection of fire performs best, presumably because conducted both at video shot and segment level (arbitrary
it is always accompanied by prominent yellow tones captured length). The results are discussed in the sequel.
well by the visual features. The purely visual concepts (first
four rows) obtain high F1 -score only because they are so 5.4.1 Shot-level results
rare that setting a low threshold gives a high recall without In this experiment, video shots (shot segmentation was
hurting precision. Manually inspecting some concept predic- provided by organizers [2, 1]) are tagged as being “violent”
tions shows that fire and explosions are accurately detected, or “non-violent”. Frame-to-shot aggregation is carried out
screams and gunshots are mostly correct (although singing as presented in Section 3.2. Performance assessment is con-
is frequently mistaken for screaming, and accentuated fist ducted on a per-shot basis. To highlight the contribution
hits in fights are often mistaken for gunshots). of the concepts, our approach is assessed with different fea-
ture combinations (see Table 3). A summary of the best
5.3.2 Violence prediction team runs is presented in Table 2 (results are presented in
Given the previous set of concept predictors of different decreasing order of F1 -score values).
qualities, we proceed to train the frame-wise violence pre- The use of mid-level concept predictions and multi-layer
dictor. Using the concept ground truth as a substitute for perceptron (see ARF-(c)) ranked first and achieved the high-
Table 2: Violence shot-level detection results at MediaEval 2012 [2][17].
0.4
est F1 -score of 49.94%, that is an improvement of more
than 6 percentage points over the other teams’ best runs,
0.3
i.e. team ShanghaiHongkong [30], F1 -score of 43.73%. For
our approach, the lowest discriminative power is provided by
0.2
using only the visual descriptors (see ARF-(v)), where the
F1 -score is only 35.65%. Compared to visual features, audio
features seem to show better descriptive power, providing 0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
the second best F1 -score of 46.27%. The combination of de- recall
scriptors (early fusion) tends to reduce their efficiency and
yields lower performance than the use of concepts alone, e.g.,
audio-visual (see ARF-(av)) yields an F1 -score of 44.58%, Figure 3: Precision-recall curves for the proposed approach.
while audio-visual-concepts (see ARF-(avc)) 42.44%.
Another observation is that, despite the use of general
purpose descriptors (see Section 5.2), the representation of precision of 25% and above.
feature information via mid-level concepts allows better per-
formance than other, more elaborate content description ap- 5.4.2 Arbitrary segment-level results
proaches or classification, such as the use of SIFTs, B-o-AW The final experiment is conducted at segment level. Video
of MFCC or motion information. segments of arbitrary length are tagged as “violent” or “non-
Figure 3 details the precision-recall curves for our ap- violent”. Frame-to-segment integration is carried out as pre-
proach. One may observe that the use of concepts alone sented in Section 3.2. The performance assessment in this
(red line) provides significantly higher recall than the sole case is conducted on a per-unit-of-time basis.
use of audio-visual features or the combination off all for a Using the mid-level concepts, we achieve average precision
87000 88000 89000 90000 91000 92000 frame
Figure 4: Examples of violent segment detection in movie “Independence Day” (the oX axis is the time axis, the values on
oY axis are arbitrary, ground truth is depicted in red while the detected segments in blue).
and recall values of 42.21% and 40.38%, respectively, while lent Scenes Detection at MediaEval Multimedia Benchmark
the F1 -score amounts to 41.27%. This yields a miss rate (out of 36 total submissions). However, the main limitation
(at time level) of 50.69% and a very low false alarm rate of of the method is its dependence on a detailed annotation of
only 6%. These results are also very promising considering violent concepts, inheriting at some level its human subjec-
the difficulty of detecting precisely the exact time interval tivity. Future improvements will include exploring the use of
of violent scenes, but also the subjectivity of the human other information sources, such as text (e.g., subtitles that
assessment (reflected in the ground truth). Comparison with are usually provided with movie DVDs).
other approaches was not possible in this case as all other
teams provided only shot-level detection. 7. ACKNOWLEDGMENTS
5.4.3 Violence detection examples This work was supported by the Austrian Science Fund
(FWF): Z159 and P22856-N23, and by the research grant
Figure 4 illustrates an example of violent segments de-
EXCEL POSDRU/89/1.5/S/62557. We acknowledge the
tected by our approach in the movie “Independence Day”.
2012 Affect Task: Violent Scenes Detection of the MediaE-
For visualization purposes, some of the segments are de-
val Multimedia Benchmark http://www.multimediaeval.
picted with a small vignette of a representative frame.
org for providing the test data set that has been supported,
In general, the method performed very well on the movie
in part, by the Quaero Program http://www.quaero.org.
segments related to action (e.g., involving explosions, fire-
We also acknowledge the violence annotations, shot detec-
arms, fire, screams) and tends to be less efficient for seg-
tions and key frames made available by Technicolor [1].
ments where violence is encoded in the meaning of human
actions (e.g., fist fights or car chases). Examples of false
detections are due to visual effects that share similar audio- 8. REFERENCES
visual signatures with the violence-related concepts. Com- [1] Technicolor, http://www.technicolor.com, last
mon examples include accentuated fist hits, loud sounds or accessed 2012.
the presence of fire not related to violence (e.g., see the [2] C.-H. Demarty, C. Penet, G. Gravier, M. Soleymani,
rocket launch or the fighter flight formation in Figure 4, first “The MediaEval 2012 Affect Task: Violent Scenes
two images). Misdetection is in general caused by limited Detection in Hollywood Movies”, Working Notes Proc. of
accuracy of the concept predictors (see last image in Figure the MediaEval 2012 Workshop [17], http://ceur-ws.
4, where some local explosions filmed from a distance have org/Vol-927/mediaeval2012_submission_3.pdf.
been missed). [3] Y. Gong, W. Wang, S. Jiang, Q. Huang, W. Gao,
“Detecting Violent Scenes in Movies by Auditory and
6. CONCLUSIONS Visual Cues”, 9th Pacific Rim Conf. on Multimedia:
We presented a naive approach to the issue of violence Advances in Multimedia Information Processing, pp.
detection in Hollywood movies. Instead of using concept de- 317-326. Springer-Verlag, 2008.
scriptors to learn directly how to predict violence, as most [4] A. Hanjalic, L. Xu, “Affective Video Content
of the existing approaches do, the proposed approach re- Representation and Modeling”, IEEE Trans. on
lies on an intermediate step consisting of predicting mid- Multimedia, pp. 143-154, 2005.
level violence concepts. Predicting mid-level concepts from [5] M. Soleymani, G. Chanel, J. J. Kierkels, T. Pun.
low-level features seems to be more feasible than directly “Affective Characterization of Movie Scenes based on
predicting all forms of violence. Predicting violence from Multimedia Content Analysis and User’s Physiological
mid-level concepts proves to be much easier than using di- Emotional Responses”, IEEE Int. Symp. on Multimedia,
rectly the low-level content features. Content classification Berkeley, California, USA, 2008.
is performed with a multi-layer perceptron whose parallel [6] M. Xu, J. Wang, X. He, J. S. Jin, S. Luo, H. Lu, “A
architecture fits well the target of labeling individual video Three-level Framework for Affective Content Analysis
frames. The approach is naive in the sense of its simplic- and its Case Studies”, Multimedia Tools and
ity. Nevertheless, its efficiency in predicting arbitrary length Applications, 2012.
violence segments is remarkable. The proposed approach [7] E. Açar, S. Albayrak, “DAI Lab at MediaEval 2012
ranked first in the context of the 2012 Affect Task: Vio- Affect Task: The Detection of Violent Scenes using
Affective Features”, Working Notes Proc. of the “Learning color names for real-world applications”, IEEE
MediaEval 2012 Workshop [17], http://ceur-ws.org/ Trans. on Image Processing, 18(7), pp. 1512-1523, 2009.
Vol-927/mediaeval2012_submission_33.pdf. [23] O. Ludwig, D. Delgado, V. Goncalves, U. Nunes,
[8] A. Datta, M. Shah, N. Da Vitoria Lobo, “Trainable Classifier-Fusion Schemes: An Application To
“Person-on-Person Violence Detection in Video Data”, Pedestrian Detection”, IEEE Int. Conf. On Intelligent
Int. Conf. on Pattern Recognition, 1, p. 10433, Transportation Systems, 1, pp. 432-437, St. Louis, 2009.
Washington, DC, USA, 2002. [24] Yaafe core features, http://yaafe.sourceforge.net/,
[9] W. Zajdel, J. Krijnders, T. Andringa, D. Gavrila, last accessed 2012.
“CASSANDRA: Audio-Video Sensor Fusion for [25] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever,
Aggression Detection”, IEEE Conf. on Advanced Video R. Salakhutdinov, “Improving Neural Networks by
and Signal Based Surveillance, pp. 200-205, London, Preventing Co-Adaptation of Feature Detectors”,
UK, 2007. arXiv.org, http://arxiv.org/abs/1207.0580, 2012.
[10] E. Bermejo, O. Deniz, G. Bueno, and R. Sukthankar, [26] D. E. Rumelhart, G. E. Hinton, R. J. Williams,
“Violence Detection in Video using Computer Vision “Learning Representations by Back-Propagating Errors”,
Techniques”, Int. Conf. on Computer Analysis of Images Nature, 323, pp. 533–536, 1986.
and Patterns, LNCS 6855, pp. 332-339, 2011. [27] C. Liu, L. Xie, H. Meng, “Classification of Music and
[11] Fillipe D. M. de Souza, Guillermo C. Chávezy, Speech in Mandarin News Broadcasts”, Int. Conf. on
Eduardo A. do Valle Jr., Arnaldo de A. Araújo, Man-Machine Speech Communication, China, 2007.
“Violence Detection in Video Using Spatio-Temporal [28] V. Lam, D.-D. Le, S.-P. Le, Shin’ichi Satoh, D.A.
Features”, 23rd SIBGRAPI Conf. on Graphics, Patterns Duong, “NII Japan at MediaEval 2012 Violent Scenes
and Images, pp. 224-230, 2010. Detection Affect Task”, Working Notes Proc. of the
[12] J. Nam, M. Alghoniemy, A.H. Tewfik, “Audio-Visual MediaEval 2012 Workshop [17], http://ceur-ws.org/
Content-Based Violent Scene Characterization”, IEEE Vol-927/mediaeval2012_submission_21.pdf.
Int. Conf. on Image Processing, 1, pp. 353 - 357, 1998. [29] F. Eyben, F. Weninger, N. Lehment, G. Rigoll, B.
[13] A.F. Smeaton, B. Lehane, N.E. O’Connor, C. Brady, Schuller, “Violent Scenes Detection with Large,
G. Craig, “Automatically Selecting Shots for Action Brute-forced Acoustic and Visual Feature Sets”, Working
Movie Trailers”, 8th ACM Int. Workshop on Multimedia Notes Proc. of the MediaEval 2012 Workshop [17],
Information Retrieval, pp. 231Ű-238, 2006. http://ceur-ws.org/Vol-927/mediaeval2012_
[14] T. Giannakopoulos, A. Makris, D. Kosmopoulos, S. submission_25.pdf.
Perantonis, S. Theodoridis, “Audio-Visual Fusion for [30] Y.-G. Jiang, Q. Dai, C.C. Tan, X. Xue, C.-W. Ngo,
Detecting Violent Scenes in Videos”, Artificial “The Shanghai-Hongkong Team at MediaEval2012:
Intelligence: Theories, Models and Applications, LNCS Violent Scene Detection Using Trajectory-based
6040, pp 91-100, 2010. Features”, Working Notes Proc. of the MediaEval 2012
[15] J. Lin, W. Wang, “Weakly-Supervised Violence Workshop [17], http://ceur-ws.org/Vol-927/
Detection in Movies with Audio and Video based mediaeval2012_submission_28.pdf.
Co-training”, 10th Pacific Rim Conf. on Multimedia: [31] N. Derbas, F. Thollard, B. Safadi, G. Quénot, “LIG at
Advances in Multimedia Information Processing, pp. MediaEval 2012 Affect Task: use of a Generic Method”,
930-935, Springer-Verlag, 2009. Working Notes Proc. of the MediaEval 2012 Workshop
[16] C. Penet, C.-H. Demarty, G. Gravier, P. Gros, [17], http://ceur-ws.org/Vol-927/mediaeval2012_
“Multimodal Information Fusion and Temporal submission_39.pdf.
Integration for Violence Detection in Movies”, IEEE Int. [32] V. Martin, H. Glotin, S. Paris, X. Halkias, J.-M.
Conf. on Acoustics, Speech, and Signal Processing, Prevot, “Violence Detection in Video by Large Scale
Kyoto, 2012. Multi-Scale Local Binary Pattern Dynamics”, Working
[17] MediaEval Benchmarking Initiative for Multimedia Notes Proc. of the MediaEval 2012 Workshop [17],
Evaluation Workshop, Pisa, Italy, October 4-5, 2012, http://ceur-ws.org/Vol-927/mediaeval2012_
CEUR-WS.org, ISSN 1613-0073, submission_43.pdf.
http://www.multimediaeval.org/, last accessed 2012. [33] C. Penet, C.-H. Demarty, M. Soleymani, G. Gravier,
[18] W.-H. Cheng, W.-T. Chu, J.-L. Wu, “Semantic P. Gros , “Technicolor/INRIA/Imperial College London
Context Detection based on Hierarchical Audio Models”, at the MediaEval 2012 Violent Scene Detection Task”,
ACM Int. Workshop on Multimedia Information Working Notes Proc. of the MediaEval 2012 Workshop
Retrieval, pp. 109 - 115, 2003. [17], http://ceur-ws.org/Vol-927/mediaeval2012_
[19] L.-H. Chen, H.-W. Hsu, L.-Y. Wang, C.-W. Su, submission_26.pdf.
“Violence Detection in Movies”, 8th Int. Conf. Computer [34] J. Schlüter, B. Ionescu, I. Mironică, M. Schedl , “ARF
Graphics, Imaging and Visualization, pp.119-124, 2011. @ MediaEval 2012: An Uninformed Approach to
[20] A. Hanjalic, “Shot-Boundary Detection: Unraveled Violence Detection in Hollywood Movies”, Working
and Resolved?”, IEEE Trans. on Circuits and Systems Notes Proc. of the MediaEval 2012 Workshop [17],
for Video Technology, 12(2), pp. 90 - 105, 2002. http://ceur-ws.org/Vol-927/mediaeval2012_
[21] B. Ionescu, V. Buzuloiu, P. Lambert, D. Coquin, submission_36.pdf.
“Improved Cut Detection for the Segmentation of [35] S. Paris, H. Glotin, “Pyramidal Multi-level Features
Animation Movies”, IEEE Int. Conf. on Acoustics, for the Robot Vision @ICPR 2010 Challenge”, 20th Int.
Speech, and Signal Processing, France, 2006. Conf. on Pattern Recognition, pp. 2949 - 2952, Marseille,
[22] J. Van de Weijer, C. Schmid, J. Verbeek, D. Larlus, France, 2010.