Sunteți pe pagina 1din 11

Available online at www.sciencedirect.

com

ScienceDirect
Cognitive Systems Research 52 (2018) 212–222
www.elsevier.com/locate/cogsys

Wild facial expression recognition based on incremental


active learning

Minhaz Uddin Ahmed a, Kim Jin Woo a, Kim Yeong Hyeon a, Md. Rezaul Bashar b,
Phill Kyu Rhee a,⇑
a
Computer Engineering Department, Inha University, 100 Inha-ro, Nam-gu 22212, Incheon, Republic of Korea
b
Science, Technology and Management Crest, Sydney, Australia

Received 29 March 2018; received in revised form 15 June 2018; accepted 28 June 2018
Available online 6 July 2018

Abstract

Facial expression recognition in a wild situation is a challenging problem in computer vision research due to different circumstances,
such as pose dissimilarity, age, lighting conditions, occlusions, etc. Numerous methods, such as point tracking, piecewise affine transfor-
mation, compact Euclidean space, modified local directional pattern, and dictionary-based component separation have been applied to
solve this problem. In this paper, we have proposed a deep learning–based automatic wild facial expression recognition system where we
have implemented an incremental active learning framework using the VGG16 model developed by the Visual Geometry Group. We
have gathered a large amount of unlabeled facial expression data from Intelligent Technology Lab (ITLab) members at Inha University,
Republic of Korea, to train our incremental active learning framework. We have collected these data under five different lighting con-
ditions: good lighting, average lighting, close to the camera, far from the camera, and natural lighting and with seven facial expressions:
happy, disgusted, sad, angry, surprised, fear, and neutral. Our facial recognition framework has been adapted from a multi-task cascaded
convolutional network detector. Repeating the entire process helps obtain better performance. Our experimental results have demon-
strated that incremental active learning improves the starting baseline accuracy from 63% to average 88% on ITLab dataset on wild envi-
ronment. We also present extensive results on face expression benchmark such as Extended Cohn-Kanade Dataset, as well as ITLab face
dataset captured in wild environment and obtained better performance than state-of-the-art approaches.
Ó 2018 Published by Elsevier B.V.

Keywords: Expression recognition; Emotion classification; Face detection; Convolutional neural network; Active learning

1. Introduction different facial expressions, such as happiness, sadness, dis-


gust, anger, fear, and surprise. In our daily conversations,
Facial expressions are a form of nonverbal communica- 55 percent of our feelings are expressed nonverbally or in
tion for social interactions. These days, the self-portrait a facial expression (Ekman, 1993; Mehrabian, 1968;
photograph (called a selfie) is used on many different social Zhang, Zhao, Morvan, & Chen, 2018). The main challenge
networking websites where images are captured in uncon- in facial expression recognition (FER) is facial expressions
trolled situations due to the availability of cheaper digital varies in different situations, such as the person’s mood,
cameras and smartphones. Many of these images contain age, and skin color, and under different lighting conditions.
Facial frontal view with different expression have number
of applications such as human computer interface, surveil-
⇑ Corresponding author.
lance systems, census systems, virtual reality, customer
E-mail address: pkrhee@inha.ac.kr (P.K. Rhee).

https://doi.org/10.1016/j.cogsys.2018.06.017
1389-0417/Ó 2018 Published by Elsevier B.V.
M.U. Ahmed et al. / Cognitive Systems Research 52 (2018) 212–222 213

satisfaction are greatly rely on efficient face detection Finally, in Section 5, we discuss the conclusions drawn
accuracy (Osuna, Freund & Girosi,1997). from the experiments and suggest future work.
A survey on FER by Fasel and Luettin (2003) included
different prominent automatic facial expression analysis 2. Related work
methods. Moreover, Sharma (2011) correctly classified
63.33% accuracy based on feature point tracking technique Multilayer perceptron network (MLPN) using con-
among them 50% accuracy for angry and 60% accuracy for strained learning algorithm (CLA) is an effective approach
sad images. Different feature extraction methods can be in order to speed up the training process in a neural con-
applied to solve FER challenges, such as geometric mea- nectionism approaches (Huang, 2004). Their proposed
surement, linear discriminant embedding (LLDE) (Li, method used adaptive learning parameter which helps the
Zheng & Huang, 2008), principal component analysis initial weight selection method for the root finding of
(PCA), quadtree decomposition, and adaptive boosting MLPN increases the accuracy. Recently, convolutional
(AdaBoost) (Fasel & Luettin, 2003), Modified Census neural networks (CNNs) have explicitly enhanced the
Transform classifier (Froba & Ernst, 2004). Recent facial expression recognition research (Simonyan &
research has shown that a deep convolutional neural net- Zisserman, 2014). Commonly, CNNs enhance traditional
work increase recognition performance (Simonyan & features by learning from millions of training data
Zisserman, 2014). However, a large volume of labeled data (Taigman et al., 2014). A large amount of labeled data
is necessary for training in order for a machine to accu- and a deep network are mandatory for accurate face
rately recognize facial expressions (Taigman, Yang, expression recognition using CNN. But in practice, it is
Ranzato, & Wolf, 2014). Moreover, dataset creation with hard to work with millions of annotated images and train
a large number of labeled emotions is tedious work due them from scratch. Therefore, it is common to use pre-
to the manual effort involved. Although manually labeled trained models with a CNN to train the network.
face expression images increase the training reliability for In this paper, we have explored many different deep
the computer, it is very time-consuming. Our incremental learning methods used in recent face expression recognition
active learning framework is suitable for tackling this chal- work such as Inception (Szegedy, Vanhoucke, Ioffe, Shlens
lenge. In our research, we have considered five different & Wojna, 2016), Vgg (Simonyan & Zisserman, 2014).
environments (good lighting, average lighting, close to Taigman et al. (2014) proposed DeepFace, a nine-layer
the camera, far from the camera, and natural lighting) with framework that contains 120 million parameters. The
seven facial expressions: happy, disgusted, sad, angry, sur- DeepFace computation time with labeled faces in the wild
prised, afraid, and neutral. Generally, most of the facial dataset is a fairly high percentage 97.35% on the labeled
expression recognition research work considers six emo- data. Schroff, Kalenichenko, and Philbin (2015) proposed
tions (Ekman, 1993; Fasel & Luettin, 2003), omitting the FaceNet is a data driven method, which directly learns
neutral facial expression. In face appearance, a number mapping from face images into a compact Euclidean space
of challenges exist, such as the pose, occlusions, age, gen- where distances are considered based on face similarity. Liu
der, and expression-intensity changes. In addition, most (2015) proposed a two-stage combined approach based on
research used less data for training, and therefore, another a multi-patch deep CNN and metric learning, which out-
challenge is dealing with a massive amount of data. performed different state-of-the-art methods.
The main contributions of this work are summarized Deep supervised auto encoders were used by Gao,
below. Zhang, Jia, Lu, and Zhang (2015), who extracted robust
features, variances in illumination, different expressions,
(I) This facial expression recognition method success- occlusions, and face poses for recognition. Zhu and
fully works under different lighting conditions for Ramanan (2012), obtained surprising result on face bench-
both real-time and offline data. mark dataset ‘‘in the wild”. Their model used Mixtures of
(II) We have five different wild environment datasets trees with a shared pool of parts but trained with only hun-
(good lighting, average lighting, face close to the cam- dreds of faces. Kim, Lee, Roh, and Lee (2015) modified a
era, face far from the camera, and natural light), deep network architecture, input normalization, and ran-
which is unique, and we also include experimental dom weight initialization while training deep models. In
support in our work. order to classify-six facial expressions such as anger,
(III) Our incremental active learning method improved happy, sad, surprise, neutral, disgust, they constructed a
performance on both ITLab and Extended Cohn- hierarchical architecture with exponentially weighted deci-
Kanade Dataset (Kanade & Cohn, 2000). sion fusion. Uddin et al. (2017) considered facial modified
local directional pattern features processed with general-
We have organized this paper as follows. In Section 2, ized discriminant analysis and then applied a deep-belief
we explain related work with facial expression recognition. network for better performance. A face-detection module
In Section 3, we describe the proposed facial expression based on an ensemble of three state-of-the-art face detec-
recognition system. In Section 4, we explain details about tors, was followed by a classification module with an
the dataset, the experimental environment, and the results. ensemble of multiple deep CNNs. Yu (2015) assembled
214 M.U. Ahmed et al. / Cognitive Systems Research 52 (2018) 212–222

three state-of-the-art face detectors followed by a classifica- network training time due to less noisy data and correctly
tion module with a combination of multiple deep convolu- labeled FER.
tional neural networks. A webcam (Logitech HD Webcam C270) is used to
Taheri, Patel, and Chellappa (2013) proposed joint face gather facial-expression images, as seen in Fig. 1. In the
and facial-expression recognition with a dictionary-based second step, we normalized the image intensity value and
component separation algorithm. They decomposed an eliminated noise in the image by cropping. Here, color
expressive test face into building components from two images are converted to grayscale images, and contrast is
data-driven dictionaries that used for neutral components, enhanced by histogram equalization.
and another for expression components. Different morpho-
logical elements of the test face with the dictionaries were 3.1. Noise removal and pre-processing
used for face and expression recognition. Wang, Hu, and
Deng (2017) applied a compressed Fisher vector for robust Fig. 2 shows intensity normalization where the bright-
facial expression recognition. First, they put forward a new ness of the light is a noise element that is a big barrier to
compact Fisher vector by zeroing out small posteriors, and facial recognition. The reason is that, when a shadow
they then calculated first-order statistics and reweighted occurs, the corresponding pixel value of the color code at
them. Secondly, light iterative quantization and compact that position changes. Dark, occluded images and bright
Fisher Vector (CFV) were applied together to encode con- images without occlusion have completely different pixel
volutional activations of a CNN. Rifai, Bengio, Courville, values, even if the facial expression in the image is the
Vincent, and Mirza (2012, chp. 58) applied a multi-scale same. Adopting an intensity normalization (Pizer et al.,
contractive convolutional network to determine facial 1987) method minimizes interference from illumination
traits in an image. Burkert, Trier, Afzal, Dengel, and (Zhang et al., 2018).
Liwicki (2015) proposed a CNN architecture that has four Our primary model create with a large number of
parts (convolution, pooling, parallel feature extraction, and good facial expression sample images trained with the
fully connected layers) to achieve high accuracy also has Visual Geometry Group (VGG) pre-trained model. A
most similarity with our approach. large batch and a mini batch of image datasets were cre-
ated using an image gathering tool developed by the
3. Overview of the proposed system

Our proposed system uses an incremental active learning


method with a CNN for precise facial expression recogni-
tion. Labeling a large volume of FER data manually is a
challenging issue for a deep-learning platform. On the
other hand, inclusion of new FER data incurs the overfit-
ting problem. A trained model infected with noisy data
and substantial errors reduces the prediction power, which
a computer cannot tackle correctly. In order to solve this
problem, a better learning approach (labeling new data
using the incremental active learning method) reduces the Fig. 2. Intensity normalization.

Fig. 1. Illustration of the proposed incremental active learning approach.


M.U. Ahmed et al. / Cognitive Systems Research 52 (2018) 212–222 215

Intelligent Technology Lab (ITLab) at Inha University, After the performance evaluation, if the predicted score
Korea, as shown in Fig. 3. Each mini batch of images was less than 0.9, we applied active learning to eliminate
contains seven facial expressions where total facial the low-score facial expression labels and replaced them
expression has five sets, for a total of 35 images per with new images to improve the performance of the learn-
batch. In order to get better performance, we increased ing model. If the training score was higher than the previ-
the number of each facial expression images by 10 and ous training result, then we combined the batch with
15 per dataset. In that case, the total number of images another mini batch of a dataset and trained again. If the
becomes 70 and 105 per batch in each image dataset. performance was lower than the previous training result,
Then we trained that mini batch and evaluated each the training data were discarded and a new batch of train-
batch of the dataset against the pre-trained model and ing datasets was used for training. This process continued
checked the performance score. until we reached saturation. Fig. 4 shows the training

Fig. 3. Facial expression gathering and the dataset creation tool.

Fig. 4. Face image dataset training tool.


216 M.U. Ahmed et al. / Cognitive Systems Research 52 (2018) 212–222

tool’s graphical user interface (GUI), where we can set van der Maaten, & Weinberger, 2016; Huynh, Tran, &
fine-tuning parameters like learning rate, batch size, num- Kim, 2016; Lawrence, Giles, Tsoi, & Back, 1997; Rifai
ber of epochs, and central processing unit (CPU) or graph- et al., 2012, chp. 58). The CNN extracts feature vectors
ics processing unit (GPU) execution. Active learning is according to the network structure determined by the user
performed when the result of the training data is not (Gao et al., 2015; Rifai et al., 2012, chp. 58) and classifies
improved smoothly. A confidence value is considered them with ground truth labels. These trained network fea-
between 0 and 1, whereas the threshold value was 0.9 tures are suitable for training data. The VGG deep learning
because a lower value biases the outcome of our experi- 16 network was used for facial expression recognition
ment. If the labeled score is below the threshold value, (Simonyan & Zisserman, 2014). Instead of designing the
we do not consider that image, replacing it with a new network from scratch, we used the pre-trained VGG16 net-
facial expression image. work for transfer learning (Li, Member, Sun, & Xu, 2017).
Fig. 5 shows the active learning tool, which is supervised In order to recognize facial expressions, it is necessary to
learning that automatically learns a new label and then iter- extract feature vectors that are classified by modifying
atively learns unlabeled data through the model. This the layer of the VGG network so they can be identified
method greatly reduces the effort required to label new through facial feature representation. The classifier classi-
data. If a new dataset includes the wrong label in the train- fies the extracted feature vectors when receiving a new face
ing process can drastically reduce performance. Finally, the image with this value, where 1 denotes an angry expression,
active learning method overcomes the over fitting problem 2 denotes a disgusted expression, 3 denotes fear, 4 denotes
by attaching the correct label (Cohn, Ghahramani, & smiling, 5 denotes sadness, 6 denotes a surprised facial
Jordan, 1996). expression, and 7 is neutral. The network adds a very small
(3  3) convolution filter to the existing VGG network,
3.2. Deep convolutional neural network resulting in better performance for large-image recognition.
Fig. 6 shows the structure of the VGG very deep 16 net-
The convolutional neural network is one of the new work used in our experiments (Huynh et al., 2016;
machine learning schemes that have received attention in Simonyan & Zisserman, 2014). The overall algorithm for
recent years due to better performance in resolving com- facial expression classification in wild environment using
puter vision problems(Burkert et al., 2015; Huang, Liu, incremental active learning presented in Algorithm1.

Fig. 5. Active learning tool for face image manipulation.


M.U. Ahmed et al. / Cognitive Systems Research 52 (2018) 212–222 217

Fig. 6. The VGG very deep 16 network model.

4. Experimental Result and Analysis


Algorithm 1. Wild facial expression classification using
incremental active learning
4.1. Dataset Overview
Input: Labeled, unlabeled dataset, pre-trained model
Output: Correctly labeled wild FER dataset and 4.1.1. Extended Cohn-Kanade dataset
classification model In our experiments, we have used Cohn-Kanade dataset.
Method: It has two kinds of data such as version 1 and version 2
Step 1: Select pre-processed FE feature from a training which includes 65% female, 15% African-American and
set 3% percent Asian or Latino face expression images. We
Repeat: have used version 2, known as CK+ for our experiment
Step 2: Train a model using Step 3 and Step 4 due to sufficient facial expression labeled data and face
Step 3: For each image predict FE pose with frontal view. Six common facial expressions in
Find the max predicted score CK+ dataset are anger, disgust, fear, happiness, sadness,
Step 4: If false prediction and surprise are considered. First few images of each video
Apply AL clip without expression considered for neutral. We have
If predicted score more than threshold value not considered other expressions such as contempt. In
Combine with the previous dataset order to train the network models we have used 760 images
Else for training, and 180 for testing. Training samples before
Replace the FE image set with new and after pre-processing are shown in Fig. 7.
image set
Step 5: Train, go to Step 4 4.1.2. Itlab face expression datasets
Step 6: Select final model Our dataset includes face expressions are mainly from
East and South Asians people are members of intelligent

Fig. 7. Cohn-Kandae dataset the face portion before and after pre-processing.
218 M.U. Ahmed et al. / Cognitive Systems Research 52 (2018) 212–222

technology Lab (ITLab), Inha University, Korea Republic.


Generally, seven emotions are considered such as anger,
disgust, fear, happiness, sadness, surprise and neutral. We
consider these facial expressions in five different atmo-
spheres such as Good Lighting Condition, Average Light-
ing Condition, Close to Camera, Far from Camera and
Natural lighting Condition. We have collected 30 images
from 30 sequential frames for each batch size. For each
expression, on average, there are 1050 images. For each
environment, we gathered both test and training datasets.
Fig. 6 shows VGG 16 network model which use to train
ITLab database to create a pre-training model.

4.2. Facial expressions in different environments

Fig. 8 shows different facial expression images in wild


environments of good lighting, average lighting, with the
face close to the camera, the face far from the camera,
and in natural light.
As outlined in Fig. 8, our approach takes advantage of
an integrated learning framework and detects facial expres-
sions under different lighting conditions. The input for our
framework is a sequence of images. We consider a diverse
and large amount of training data, which helps a new data-
set be classified well and in real-time detection. We thus
trained our network with 50,000 iterations and set the
learning rate at 0.001 with a batch size of 8. Fig. 9 shows
the real-time online evaluation of facial expression recogni-
tion on different ITLab members.
Our experiments use a publicly available VGG net that
has five convolutional layers and three fully connected lay-
ers. These networks are a pre-trained ImageNet Large Scale
Visual Recognition Challenge (ILSVRC-2014) dataset
(Deng, Berg, & Fei-Fei, 2010), which includes 1.2 million
training images, labeled into 1000 classes. For the detection
system, we used a multi-task cascaded convolutional net-
work (Zhang et al., 2016), which is a convolutional neural
network–based face detector, and an incremental active
learning model (Lopes, de Aguiar, De Souza, & Oliveira–
Santos, 2017) that is implemented on the popular Caffe deep
learning library(Jia et al., 2014). All implementations were
on a single server with the Compute Unified Device Archi-
tecture (CUDA) deep neural network (cuDNN) (Chetlur &
Woolley, 2014) and a single NVIDIA GeForce GTX 970
graphics card running Ubuntu 14.04.
Table 1 shows various facial expression image accuracies
in a wild environment. We can see that for close to the cam-
era and natural light environments FE detection perfor-
mance is higher, compared to other environments due to
the less noisy conditions. The correctly classified facial
expressions were 94 images and 96 images respectively.
The face far from the camera had the worst detection rate
from FE, compared to other environments. Fig. 8. Samples of different ITLab facial expressions in a wild
environment.
Table 2 shows the average facial expression recognition
accuracy for sadness, neutral, disgust, happy, fear, angry,
and surprise in the different environments. Our experiments produce better results, with average FER accuracy at
show that natural light and average lighting conditions 81% and 74.83%, respectively.
M.U. Ahmed et al. / Cognitive Systems Research 52 (2018) 212–222 219

Fig. 9. Example images from real-time online evaluation of facial expression recognition.

Table 1 selection is the initial data model without adding additional


Correctly classified facial expression. (Higher result indicates better face expression images. However, for random selection,
performance).
training data samples are selected randomly from the num-
Environment Average correctly Average incorrectly ber of facial expression images in order to avoid sample
classified facial classified facial
selection bias (Zadrozny, 2004). Without Incremental
expression images expression images
Active learning the average face expression recognition
Good lighting conditions 91 9
performance is only 63% whereas including Incremental
Average lighting conditions 92.5 7.5
Close to the camera 94 5.5 Active learning performance is over 88%. When Random
Far from the Camera 90 10 sampling face expressions data model are considered the
Natural light 96 3.5 average performance is only 72.4%. Quite remarkable,
Incremental Active learning produce the better perfor-
mance compared to random and baseline selection due to
Table 3 shows online evaluation result. We compare our finest samples are considered for training.
Incremental Active Learning method against baseline and In Fig. 10, we show the performance of the baseline,
random facial expression recognition. Here baseline random, and incremental active learning approaches. Each
220 M.U. Ahmed et al. / Cognitive Systems Research 52 (2018) 212–222

Table 2 when capturing facial expressions under better lighting


Facial expression recognition in different environments. (Higher result conditions in similar environments.
indicates better performance).
Environment Average FER recognition (%) 4.2.1. Benchmark dataset comparison
Good lighting conditions 71.19 Few popular deep learning approaches such as inception
Average lighting conditions 74.83 (Szegedy et al., 2016), vgg (Simonyan & Zisserman, 2014),
Close to the camera 73.30
Far from the camera 64.50
GoogLeNet (Szegedy, Liu, Jia, Sermanet, Reed, Anguelov,
Natural light 81.00 Erhan, Vanhoucke, & Rabinovich, 2015) in facial expres-
sion recognition are compared with ours. These models
are trained with both CK+ (Kanade & Cohn, 2000) and
group of bars shows the performance from facial expres- ITLab face data.
sion recognition in online evaluation. Each group shows The result in Table 4 shows few representative compar-
three bars corresponding to baseline accuracy, random ison method and their performance on CK+ dataset. Our
sampling accuracy, and the proposed incremental active model trained with CK+ data outperform other methods.
learning method’s accuracy.
In our experiment, the active learning model starts from Table 4
an environment with a high initial performance. If we start Few representative comparison method and performance result on Cohn-
active learning from high initial performance, then super- Kanade dataset.
vised learning predicts incorrect labels with high confi- Method Dataset (CK+)
dence. Therefore, to minimize this process, we gradually KNN (Shan, Guo, You, Lu, & Bie, 2017) 77.27
attempted to learn from the environment most similar to CNN (Shan et al., 2017) 80.30
the data used to construct the initial learning model. Exper- GM (Wu and Lin, 2018) 86.83
GM + AFM (Wu and Lin, 2018) 87.78
imental results show that the convergence of the initial per- GM + W-AFM (Wu and Lin, 2018) 88.25
formance decreases as the environment changes. This is GM + W-CR-AFM (Wu and Lin, 2018) 89.84
because the learning of various environmental data is auto-
Ours 91.80
matic through active learning. Our method is more useful

Table 3
Facial expression recognition accuracy in online evaluation.
Average Neutral Angry Happy Disgust Fear Sadness Surprise
Baseline 63.0 ± 0.2% 81 ± 0.2% 55 ± 0.3% 49 ± 0.3% 50 ± 0.2% 75.0 ± 0.1% 69 ± 0.2% 58 ± 0.2%
Random 72.4 ± 0.2% 76 ± 0.1% 51 ± 0.2% 73 ± 0.2% 91 ± 0.1% 73.0 ± 0.2% 80.5 ± 0.1% 62.5 ± 0.2%
Incremental Active Learning 88.0 ± 0.03% 92 ± 0.02% 89 ± 0.01% 89 ± 0.02% 90 ± 0.04% 78.0 ± 0.1% 86 ± 0.03% 76 ± 0.02%

Fig. 10. Facial expression performance analysis comparing the proposed incremental active learning with the baseline and random data results from
online evaluation.
M.U. Ahmed et al. / Cognitive Systems Research 52 (2018) 212–222 221

Table 5
The comparison result of state-of-the-art approach and our method.
Approach Accuracy (%)
Test set (180, 0) Test set (180, 250) Test set (180, 500) Test set (180, 750) Test set (180, 1000) Test set (0, 1000)
VGG [ADAM] 0.912 0.415 0.300 0.225 0.222 0.215
VGG [RMS] 0.927 0.514 0.391 0.304 0.312 0.233
Inception [RMS] 0.916 0.514 0.380 0.300 0.301 0.234
Ours 0.91 0.76 80.77 82.22 82.75 72.96

In the Table 5 ‘‘Test set (N, M)” means the test data References
which is combined with N from CK+ dataset and M test
data from the ITLab facial expression dataset. Common Burkert, P., Trier, F., Afzal, M. Z., Dengel, A., & Liwicki, M. (2015).
Incremental Active Learning parameters are the number DeXpression: deep convolutional neural network for expression
recognition, pp. 1–8. http://arxiv.org/abs/1509.05371.
of batch size, total number of epoch and learning rate. Chetlur, S., & Woolley, C. (2014). cuDNN: Efficient primitives for deep
The result in Table 5 shows comparison result of differ- learning. arXiv Preprint arXiv, pp. 1–9. Retrieved from http://arxiv.
ent state-of-the-art approaches using the ITLab facial org/abs/1410.0759.
expression dataset and CK+. It can be notice that our Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996). Active learning
method outperforms other techniques using the mixed test with statistical models. Journal of Artificial Intelligence Research, 4,
129–145. https://doi.org/10.1613/jair.295.
data of ITLab and CK+ datasets. Our method gains the Deng, J., Berg, A. C., Li, K., & fei-fei, L. (2010). What Does Classifying
lowest accuracy for Test set (0, 1000) and obtained best More Than 10,000 Image Categories Tell Us? European Conference on
accuracy for Test set (180, 0) which is significantly higher Computer Vision (ECCV) (pp. 71–84).
than (Wu & Lin, 2018). Ekman, P. (1993). Facial expression and emotion. American Psychologist.
https://doi.org/10.1037/0003-066X.48.4.384.
5. Conclusion Fasel, B., & Luettin, J. (2003). Automatic facial expression analysis: A
survey. Pattern Recognition, 36(1), 259–275. https://doi.org/10.1016/
S0031-3203(02)00052-3.
In this paper, we propose an Incremental Active learn- Froba, B., & Ernst, A. (2004). Face detection with the modified census
ing framework that can work with facial expressions in var- transform. IEEE international conference on automatic face and gesture
ious wild environments. We successfully label the right recognition (FGR’04).
candidate image using active learning method. While doing Gao, S., Zhang, Y., Jia, K., Lu, J., & Zhang, Y. (2015). Single sample face
recognition via learning deep supervised auto-encoders. IEEE Trans-
a rigorous experiment on ITLab and CK+ dataset, there actions on Information Forensics and Security, 6013(c). https://doi.org/
are several issues that affects the performance such as per- 10.1109/TIFS.2015.2446438, 11.
son’s skin color, camera distance is too far from the face Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. (2016).
(more than 10 m in our case) with various lighting condi- Densely connected convolutional networks. https://doi.org/10.1109/
tion. However, by adjusting the experiment order with CVPR.2017.243.
Huang, D. S. (2004). A constructive approach for finding arbitrary roots
camera distance (less than 5 m) and comparing the initial of polynomials by neural networks. IEEE Transactions on Neural
performance, these issues can be addressed through exclud- Networks, 15(2).
ing the data from the experiment when the performance is Huynh, X., Tran, T., & Kim, Y. (2016). Information Science and
not satisfactory. The proposed system is expected to help a Applications (ICISA), 2016(376), 441–442. https://doi.org/10.1007/
lot when requires a large amount of training data in vari- 978-981-10-0557-2.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., ...
ous environments. Our future research direction is to find Darrell, T. (2014). Caffe: Convolutional architecture for fast feature
ways to deal more flexible way with the learning sequence. embedding. ACM International Conference on Multimedia, 675–678.
In addition, by reducing the number of erroneous labels https://doi.org/10.1145/2647868.2654889.
with high reliability that occurs in the active learning pro- Kanade, T., & Cohn, J. F. (2000). Comprehensive database for
cess, minimizing the discarded learning data will maximize facial expression analysis. In Proceedings Fourth IEEE Interna-
tional Conference on Automatic Face and Gesture Recognition
its efficiency. Our experiment result depicts that propose (Cat. No. PR00580) (pp. 46–53). https://doi.org/10.1109/AFGR.
method applicable to different domains including cognitive 2000.840611.
systems, surveillance systems and security systems where Kim, B., Lee, H., Roh, J., & Lee, S. (2015). Hierarchical committee of
environment is not friendly. deep CNNs with exponentially-weighted decision fusion for static
facial expression recognition. In Proceedings of the 2015 ACM on
International Conference on Multimodal Interaction (pp. 427–434).
Acknowledgement https://doi.org/10.1145/2818346.2830590.
Lawrence, S., Giles, C. L., Tsoi, A. C., & Back, A. D. (1997). Face
This research was supported by Basic Science Research recognition: A convolutional neural-network approach. IEEE Trans-
Program through the National Research Foundation of actions on Neural Networks, 8(1), 98–113. https://doi.org/10.1109/
72.554195.
Korea (NRF) funded by the Ministry of Education Li, H., Member, S., Sun, J., & Xu, Z. (2017). Multimodal 2D + 3D
(NRF-2016R1D1A1B03935440). The GPUs used in this facial expression recognition with deep fusion convolutional
research were generously donated by NVIDIA.
222 M.U. Ahmed et al. / Cognitive Systems Research 52 (2018) 212–222

neural. Network, 9210(c), 1–16. https://doi.org/10.1109/TMM.2017. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016).
2713408. Rethinking the inception architecture for computer vision. In
Li, B., Zheng, C. H., & Huang, D. S. (2008). Locally linear discriminant Proceedings of the IEEE Conference on Computer Vision and Pattern
embedding: an efficient method for face recognition. Pattern Recog- Recognition (pp. 2818–2826).
nition, 41, 3813–3821. Taheri, S., Patel, V. M., & Chellappa, R. (2013). Component-based
Liu, J. (2015). Targeting ultimate accuracy: Face recognition via deep recognition of facesand facial expressions. IEEE Transactions on
embedding. Cvpr, 4–7. Affective Computing, 4(4), 360–371. https://doi.org/10.1109/T-
Lopes, A. T., de Aguiar, E., De Souza, A. F., & Oliveira-Santos, T. (2017). AFFC.2013.28.
Facial expression recognition with Convolutional Neural Networks: Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). DeepFace:
Coping with few data and the training sample order. Pattern Recogni- Closing the gap to human-level performance in face verification. In
tion, 61, 610.–628. https://doi.org/10.1016/j.patcog.2016.07.026. Proceedings of the IEEE Computer Society Conference on Computer
Mehrabian, A. (1968). Communication without words. Psychology Today, Vision and Pattern Recognition (pp. 1701–1708). https://doi.org/
2, 53–55. 10.1109/CVPR.2014.220.
Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector Uddin, M. Z., Hassan, M. M., Almogren, A., Zuair, M., Fortino, G., &
machines: an application to face detection. Proceedings of IEEE Torresen, J. (2017). A facial expression recognition system using
Computer Society Conference on Computer Vision and Pattern Recog- robust face features from depth videos and deep learning. Computers
nition., ISSN: 1063-6919. and Electrical Engineering, 63, 114–125. https://doi.org/10.1016/
Pizer, S. M., Amburn, E. P., Austin, J. D., Cromartie, R., Geselowitz, A., j.compeleceng.2017.04.019.
Greer, T., ter Haar Romeny, B., Zimmerman, J. B., & Zuiderveld, K. Wang, H., Hu, J., & Deng, W. (2017). Compressing fisher vector for
(1987). Adaptive histogram equalization and its variations. Computer robust face recognition. IEEE Access, 5, 23157–23165. https://doi.org/
Vision, Graphics, and Image Processing, 39(3), 355–368. 10.1109/ACCESS.2017.2749331.
Rifai, S., Bengio, Y., Courville, A., Vincent, P., & Mirza, M. (2012). Wu, B. F., & Lin, C. H. (2018). Adaptive feature mapping for customizing
Disentangling factors of variation for facial expression recognition BT deep learning based facial expression recognition mode. IEEE Access,
- Computer Vision – ECCV 2012. In Computer Vision – ECCV 2012 12451–12461.
7577 (pp. 808–822). https://doi.org/10.1007/978-3-642-33783-3_58. Yu, Z. (2015). Image based Static Facial Expression Recognition with
Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified Multiple Deep Network Learning. In ACM on International Confer-
embedding for face recognition and clustering. In Proceedings of the ence on Multimodal Interaction - ICMI (pp. 435–442). https://doi.org/
IEEE Computer Society Conference on Computer Vision and Pattern 10.1145/2823327.2823341.
Recognition, 7-12-NaN-2015 (pp. 815–823). https://doi.org/10.1109/ Zadrozny, B. (2004). Learning and evaluating classifiers under sample
CVPR.2015.7298682. selection bias. In Twenty-first international conference on machine
Shan, K., Guo, J., You, W., Lu, Di, & Bie, R. (2017). Automatic facial learning – ICML (pp. 114).
expression recognition based on a deep convolutional-neural-network Zhang, K., Zhang, Z., Li, Z., Member, S., Qiao, Y., & Member, S. (2016).
structure. IEEE SERA 2017. Joint face detection and alignment using multi - task cascaded
Sharma, P. (2011). Feature based method for human facial emotion convolutional networks. Spl, 1, 1–5. https://doi.org/10.1109/
detection using optical flow based analysis. International Journal of LSP.2016.2603342.
Research in Computer Science eISSN, 1(1), 31–38, eISSN 2249-8265. Zhang, W., Zhao, X., Morvan, J. M., & Chen, L. (2018). Improving
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks shadow suppression for illumination robust face recognition. IEEE
for large-scale image recognition. pp. 1–14. 10.1016/j.infsof.2008.09. Transactions on Pattern Analysis and Machine Intelligence, 1–14.
005. https://doi.org/10.1109/TPAMI.2018.2803179.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, Zhu, X., & Ramanan, D. (2012). Face detection, pose estimation, and
D., Vanhoucke, V., & Rabinovich, A. (2015). Going Deeper with landmark localization in the Wild. CVPR.
Convolutions. The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).

S-ar putea să vă placă și