Documente Academic
Documente Profesional
Documente Cultură
A key step in the humanization of robotics applications involve automatic blurring of faces on
is the ability to classify the emotion of the Google Streetview footage and automatic recognition
human operator. In this paper we present of Facebook friends in photos.
the design of an artificially intelligent sys- An even more advanced development in this field is
tem capable of emotion recognition trough fa- emotion recognition. In addition to only identify-
cial expressions. Three promising neural net- ing faces, the computer uses the arrangement and
work architectures are customized, trained, shape of e.g. eyebrows and lips to determine the
and subjected to various classification tasks, facial expression and hence the emotion of a per-
after which the best performing network is son. One possible application for this lies in the area
further optimized. The applicability of the of surveillance and behavioural analysis by law en-
final model is portrayed in a live video ap- forcement. Furthermore such techniques are used in
plication that can instantaneously return the digital cameras to automatically take pictures when
emotion of the user. the user smiles. However, the most promising appli-
cations involve the humanization of artificial intelli-
gent systems. If computers are able to keep track
1 Introduction
of the mental state of the user, robots can react
Ever since computers were developed, scientists and upon this and behave appropriately. Emotion recog-
engineers thought of artificially intelligent systems nition therefore plays a key-role in improving human-
that that are mentally and/or physically equivalent machine interaction.
to humans. In the past decades, the increase of
generally available computational power provided a 1.2 Research objective
helping hand for developing fast learning machines,
whereas the internet supplied an enormous amount of In this research we mainly focus on neural network
data for training. These two developments boosted based artificially intelligent systems capable of deriv-
the research on smart self-learning systems, with neu- ing the emotion of a person through pictures of his
ral networks among the most promising techniques. or her face. Different approaches from existing lit-
erature will be experimented with and the results of
various choices in the design process will be evalu-
1.1 Backgrounds
ated. The main research question therefore reads as
One of the current top applications of artificial in- follows: How can an artificial neural network
telligence using neural networks is the recognition be used for interpreting the facial expression
of faces in photos and videos. Most techniques of a human?
process visual data and search for general patterns The remainder of this article describes the several
present in human faces. Face recognition can be steps taken to answer the main research question,
used for surveillance purposes by law enforcers as i.e. the sub-questions. In section 2, a literature sur-
well as in crowd management. Other present-day vey will clarify what the role of facial expressions is
1
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
in emotion recognition and what types of networks a recommendation for future work. Readers inter-
are suitable for automated image classification. The ested in research on emotion classification via speech
third section explains how the neural networks under recognition are referred to Nicholson et al. [14]. As a
consideration are structured and how the networks final point of attention, emotions should not be con-
are trained. Section 4 describes how the final model fused with mood, since mood is considered to be a
performs after which a conclusion and some recom- long-term mental state. Accordingly, mood recog-
mendations follow in the last section. It may be noted nition often involves longstanding analysis of some-
that the aim of our work is not to design an emotion one’s behaviour and expressions, and will therefore
recognizer from scratch but rather to review design be omitted in this work.
choices and enhance existing techniques with some
new ideas. 2.2 Image classification techniques
The growth of available computational power on con-
2 Literature review
sumer computers in the beginning of the twenty-first
For the development of a system that is able to rec- century gave a boost to the development of algo-
ognize emotions through facial expressions, previous rithms used for interpreting pictures. In the field of
research on the way humans reveal emotions as well image classification, two starting points can be distin-
as the theory of automatic image categorization is re- guished. On the one hand pre-programmed feature
viewed. In the first part of this section, the role of extractors can be used to analytically break down
interpreting facial expressions in emotion recognition several elements in the picture in order to categorize
will be discussed. The latter part surveys previous the object shown. Directly opposed to this approach,
studies on automatic image classification. self-learning neural networks provide a form of ’black-
box’ identification technique. In the latter concept,
the system itself develops rules for object classifica-
2.1 Human emotions
tion by training upon labelled sample data.
A key feature in human interaction is the universal- An extensive overview of analytical feature extrac-
ity of facial expressions and body language. Already tors and neural network approaches for facial expres-
in the nineteenth century, Charles Darwin published sion recognition is given by Fasel and Luettin [6].
upon globally shared facial expressions that play an It can be concluded that by the time of writing, at
important role in non-verbal communication [3]. In the beginning of the twenty-first century, both ap-
1971, Ekman & Friesen declared that facial behaviors proaches work approximately equally well. However,
are universally associated with particular emotions given the current availability of training data and
[5]. Apparently humans, but also animals, develop computational power it is the expectation that the
similar muscular movements belonging to a certain performance of neural network based models can be
mental state, despite their place of birth, race, educa- significantly improved by now. Some recent achieve-
tion, etcetera. Hence, if properly modelled, this uni- ments will be listed below.
versality can be a very convenient feature in human-
machine interaction: a well trained system can un- (i) A breakthrough publication on automatic image
derstand emotions, independent of who the subject classification in general is given by Krizhevsky
is. and Hinton [9]. This work shows a deep neural
One should keep in mind that facial expressions are network that resembles the functionality of the
not necessarily directly translatable into emotions, human visual cortex. Using a self-developed la-
nor vice versa. Facial expression is additionally a belled collection of 60000 images over 10 classes,
function of e.g. mental state, while emotions are also called the CIFAR-10 dataset, a model to catego-
expressed via body language and voice [6]. More elab- rize objects from pictures is obtained. Another
orate emotion recognition systems should therefore important outcome of the research is the visual-
also include these latter two contributions. However, ization of the filters in the network, such that it
this is out of the scope of this research and will remain can be assessed how the model breaks down the
2
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
3
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
Figure 1: Samples from the FERC-2013 (left), CK+ (center), and RaFD (right) datasets
3 Experiment setup 2013 set harder to interpret, but given the large size
of the dataset, the diversity can be beneficial for the
To assess the three approaches mentioned previously robustness of a model.
on their capability of emotion recognition, we devel- We reason that, once trained upon the FERC-2013
oped three networks based on the concepts from [9], set, images from ’clean’ datasets can easily be clas-
[10], and [7]. This section describes the data used sified, but not vice versa. Hence for the three net-
for training and testing, explains the details of each works under consideration, training will be done us-
network, and evaluates the results obtained with all ing 9000 samples from the FER-2013 data (see fig-
three models. ure 2) with another 1000 new samples for validation.
Subsequently testing will be done with 1000 images
3.1 Dataset from the RaFD set to get an indication of perfor-
mance on clean high quality data. This latter set has
Neural networks, and deep networks in particular, are an even distribution over all emotions.
known for their need for large amounts of training
data. Moreover, the choice of images used for train-
ing are responsible for a big part of the performance
of the eventual model. This implies the need for a
both high qualitative and quantitative dataset. For
emotion recognition, several datasets are available for
research, varying from a few hundred high resolu-
tion photos to tens of thousands smaller images. The
three we will discuss are the Facial Expression Recog-
nition Challenge (FERC-2013) [8], Extended Cohn-
Kanade (CK+) [12], and Radboud Faces Database
(RaFD) [11], all shown in figure 1.
The datasets differ mainly on quantity, quality, Figure 2: Number of images per emotion in the train-
and ’cleanness’ of the images. The FERC-2013 set ing set
for example has about 32000 low resolution images,
where the RaFD provides 8000 high resolution pho- Please note that non-frontal faces and pictures
tos. Furthermore it can be noticed that the facial with the label contemptuous are taken out of the
expressions in the CK+ and RaFD are posed (i.e. RaFD data, since these are not represented in the
’clean’), while the FERC-2013 set shows emotions ’in FERC-2013 training set. Furthermore, with use of
the wild’. This makes the pictures from the FERC- the Haar Feature-Based Cascaded Classifier inside
4
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
the OpenCV framework [15], all data is preprocessed. layers the number of nodes of each fully con-
For every image, only the square part containing the nected was reduced from 4096 to 1024. While
face is taken, rescaled, and converted to an array with the original network was divided for parallel
48x48 grey-scale values. training, it was observed that was not necessary
for the smaller version. The network also makes
3.2 Networks use of local normalization to speed up the train-
ing and dropout layers in order to reduce the
The networks are programmed with use of the overfitting.
TFLearn library on top of TensorFlow, running on
Python. This environment lowers the complexity of (C) The last experiments are performed on a net-
the code, since only the neuron layers have to be cre- work based by the work of Gudi [7]. Since this
ated, instead of every neuron. The program also pro- research also aimed on recognizing 7 emotions
vides real-time feedback on training progress and ac- using the FERC-2013 dataset, the architecture
curacy, and makes it easy to save and reuse the model should be a good starting point for our research.
after training. More details on this framework can be The original network starts with an input layer
found in reference [16]. of 48 by 48, matching the size of the input data.
This layer is followed by one convolutional layer,
(A) The first network to test is based on the pre- a local contrast normalization layer, and a max-
viously described research by Krizhevsky and pooling layer respectively. The network is fin-
Hinton [9]. This is the smallest network of the ished with two more convolutional layers and
three, which means that it has the lowest com- one fully connected layer, connected to a soft-
putational demands. Since one of the future ap- max output layer. Dropout was applied to the
plications might be in the form of live emotion fully connected layer and all layer contain ReLu
recognition in embedded systems, fast working units.
algorithms are benificial.
For our research, a second maxpooling layer is
The network consists of three convolutional lay- applied to reduce the number of parameters.
ers and two fully connected layers, combined This lowers the computational intensity of the
with maxpooling layers for reducing the image network, while the reduction in performance is
size and a dropout layer to reduce the chance claimed to be only 1-2%. Furthermore the learn-
of over fitting. The hyper parameters are chosen ing rate is adjusted. Instead of linearly decreas-
such that the number of calculations in each con- ing the learning rate as done by Gudi [7], we
volutional layer remains roughly the same. This believe a learning rate which makes use of mo-
ensures that information is preserved throughout mentum would converge faster, as the momen-
the network. Training is performed using differ- tum increases the learning rate when the gradi-
ent numbers of convolutional filters to evaluate ent keeps going in the same direction.
their effect on the performance.
5
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
0.45 The last described network from section 3.2 was ob-
served to have the most promising performance for
0.4
practical applications. An overview of its architec-
0.35 ture is shown in figure 4. The source files for this
Network A
network, as well as other scripts used for this project
0.3
Network B
Network C
can be found on https://github.com/isseu/emotion-
0.25
0 10 20 30 40 50 60
recognition-neural-networks.
Epochs As can be seen from figure 3, the accuracy seems
to still increase in the last epochs. We therefore will
Figure 3: Accuracy on the validation set during train- train the network for 100 epochs in the final run, to
ing epochs make sure the accuracy converges to the optimum. In
an attempt to improve the final model even more, the
network will be trained on a larger set than the one
on the processing time. This means that fast models
described previously. Instead of 9000 pictures, train-
can be made with very reasonable performance.
ing will be done with 20000 pictures from the FERC-
2013 dataset. The ratios of the emotions present in
Table 1: Details of the trained networks this set are given in figure 5. Newly composed valida-
Network Accuracy Size tion (2000 images) and test sets (1000 images) from
Validation RaFD the FERC-2013 dataset are used as well, together
with the well-balanced RaFD test set from the pre-
A 63% 50% Small
vious experiment.
B 53% 46% Large
C 63% 60% Medium
6
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
more data and longer training can improve the per- be distinguished clearly.
formance of a network. Given that the state-of-the-
art networks from previous research obtained about
4.2 Live application
67% on test sets, and keeping in mind our limited re-
sources, the results are in fact pretty good. Notable As is already mentioned, live emotion recognition
is the accuracy on the RaFD test set, which contains through video is one of the most important key-points
completely different pictures than the training data. in human-machine interaction. To show the capabil-
This illustrates the powerful generalizing capabilities ities of the obtained network, an application is de-
of this final model. veloped that can directly process webcam footage
through the final model.
Table 2: Accuracy of the networks With use of the aforementioned OpenCV face
Network FERC-2013 RaFD recognition program [15], the biggest appearing face
Validation Test from real-time video is tracked, extracted, and scaled
to usable 48x48 input. This data is then fed to the
A 63% 50% input of the neural network model, which in its turn
B 53% 46% returns the values of the output layer. These values
C 63% 60% represent the likelihood that the each emotion is de-
Final 66% 63% 71% picted by the user. The output with the highest value
is assumed to be the current emotion of the user, and
To see how the model performs per emotion, a ta- is depicted by an emoticon on the left of the screen.
ble is generated, depicted in figure 6. Very high ac- Figures 7 to 13 shows the live application and the in-
curacy rates are obtained on happy (90%), neutral teraction with the authors of this research and with
(80%), and surprised (77%). These are in fact the live television.
most distinguishable facial expressions according to Although it is hard to objectively assess, the live
humans as well. Sad, fearful, and angry are often application shows promising performance. Though,
misclassified as neutral too though. Apparently these it encounters problems when shadows are present on
emotions very look alike. The lowest accuracy is ob- the face of the subject. All emotions are easily rec-
tained on sad (28%) and fearful (37%). Finally it is ognized when acted by the user, and when pointing
noteworthy that even though the percentage of data the camera on the television, most emotions in the
with label disgusted in the training set is low, the wild can be classified. This once again emphasizes
classification rate is very reasonable. In general the the power of using neural network based models for
main diagonal, showing the right classification, can future applications in emotion recognition.
7
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
8
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
Figure 6: Performance matrix of the final model. Vertically the input, horizontally the output.
9
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
10
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
11
Emotion Recognition using Deep Convolutional Neural Networks TU Delft IN4015
12