Sunteți pe pagina 1din 5


discussions, stats, and author profiles for this publication at:

How Popular CNNs Perform in Real

Applications of Face Recognition

Conference Paper · November 2016

DOI: 10.1109/TELFOR.2016.7818876


0 205

3 authors, including:

Payam Ahmadvand
Simon Fraser University


All content following this page was uploaded by Payam Ahmadvand on 10 October 2017.

The user has requested enhancement of the downloaded file.

How Popular CNNs Perform in Real
Applications of Face Recognition
Pouya Ahmadvand, Reza Ebrahimpour, Payam Ahmadvand

 illumination, extreme poses, and variations in facial

Abstract — In this paper we evaluate the performance of expression, the face recognition problem is far from being
CNN in regards to face recognition for real world solved [7]. In addition, the size of the dataset in the
applications. In recent years, many high performance deep successful works is much larger than any realistic data set
neural networks have been proposed to the face recognition
world. These deep networks were trained by images provided
would be for practical applications: roughly three orders of
by the internet, and they commonly are of good quality when magnitude which is beyond the capabilities of most
facial expression and posture are not particularly complex. research groups.
However, this is not the case in real world applications; the This paper has two goals. The first is to introduce the
provided images vary a lot and do not reflec ideal conditions. Iranian face dataset with a reasonable size for real
We collect and introduce a new dataset in which the images applications. The data consists of two sets of pre-processed
come from two different cameras and scenes. The well-known
CNNs are trained on the dataset and then tested on the two face images captured by two different cameras in different
collections of the dataset. The results reveal that that though scenes. The second goal is to explore the performance of
performances of all of the CCNs drop dramatically, VGG- well-known CNNs models for face recognition on the
Face can still perform acceptably despite image degradations. provided dataset. We evaluate the recent works on face
Keywords — CNN, Face Recognition, image degradations, recognition and assess their performance on real
deep learning. applications with a high variety of images.
I. INTRODUCTION In section II, a new face dataset is introduced. The
experimental setup for training CNNs is presented in
F ace recognition has been one of the most challenging
and attractive areas, and it has become a popular
research area in the computer vision community because of
section III. Finally, section IV is dedicated to the
experiments and results.
its impact in application areas, such as surveillance, II. DATASET COLLECTION
security systems, and biometrics. Face recognition In this section, we introduce the datasets with which we
algorithms have been introduced in order to answer the proposed to evaluate the performance of the CNNs. We
question: with any given image that has a face on it, who is provided two collections captured by two different
this person? While we can easily recognize our friends and cameras in two different scenes. The first dataset contains
family members, it has been very challenging for a 100,000 images of 268 unique identities and the second set
machine to do this [1]. has 59 unique identities.
With the introduction deep learning techniques [2], face For the first collection, a fixed IP camera was placed in
recognition`s accuracy has skyrocketed. Since then, CNN- an indoor scene, and FULL HD resolution images were
based deep learning systems [2] have continuously captured. As a preprocessing step, Viola et al.`s [8]
improved its performance by using more and more training method was used to detect and crop the faces so that they
data. Because further improvement can be achieved by are aligned to the center of image. 100,000 images were
leveraging more training data, most of the recent advances captured in four weeks, and after omitting the small and
have been restricted to Internet giants such as Google and low quality pictures, the dataset ended up with 24,000
Facebook [3, 4]. pictures for 268 different Iranians. Fig. 1 shows the
The state-of-the-art has achieved close to 100% collected images in which people exhibit various facial
recognition rates by applying convolutional neural expressions and postures (e.g. talking on the phone,
networks (CNN) [5], as these networks are more robust in eating, drinking, etc.). In the first collection all of the
regards to image variation than other methods are [6]. women wore a veil.
However, in an uncontrolled environment with The second collection was captured by a Y50-70
notebook equipped with 720p HD webcam. Subjects we
Pouya Ahmadvand is with the Faculty of Computer Engineering, asked to sit in front of the notebook camera, and rotate
Shahid Rajaee Teacher Training University. Shabanlou, Lavizan, Tehran, their head slowly left-right and up-down; then, Viola et
Iran (e mail:
al.`s [8] method was employed to crop the faces. This
Reza Ebrahimpour is a Associate Professor at the Faculty of
Computer Engineering, Shahid Rajaee Teacher Training University. collection contains 10,000 pictures taken of 59 Iranians
Shabanlou, Lavizan, Tehran, Iran. including of males and females who are of different ages,
Payam Ahmadvand is with the School of Computer Science, Simon and where some of the women wore a veil and some of
Fraser University, 8888 University Dr, Burnaby, BC V5A 1S6, Canada;
(e mail: them did not.
A. CNNs Architecture
Each CNN has its architecture with a different number
of layers, nodes, activation functions, training images, etc.
1) AlexNet
In terms of architecture AlexNet[9] is similar to
LeNet[10] (which was the first successful architecture);
however, AlexNet is deeper and employs more layers. This
CNN has 60 million parameters and 650,000 neurons. It
consists of 5 volumes and 3 fully connected layers
connected to the classifier. AlexNet trained on 1.2M
images came from 1000 categories of the ImageNet[11]
dataset. it is worth mentioning that AlexNet won the 2012
Fig. 1. Example images from the first dataset, consisting of championship (top-1 score was 37.5% and top-5 score was
challenging facial expressions and postures. 17 in ILSVRC 2010)[9].
The first and second collections have 16 identities in
common. Fig. 2 shows 64 of the 268 identities of the first
collection, which includes the 16 people who are in both
collections. The dataset is available online at

Fig. 3. AlexNet architecture which composed of 5

volumes, 3 fully connected layers and the final
2) Places205-AlexNet
This network architecture is similar to AlexNet, but the
network trained on 2.5M images from 205 scene categories
of Places Database [13].
3) Hybrid-CNN
The architecture of this CNN is similar to AlexNet, but
trained on 3.6M images from 1183 categories (205 scene
categories from Places Database and 978 object categories
from the train data of ILSVRC 2012 (ImageNet) [13].
(a) 4) VGG Network
This network has a different architecture (13 ,16, and 19
layers). VGG`s performance[14] was ranked as the first
and second in the localization and classification tasks in
the ILSVRC 2014 competition. The network has been
improved further in order to achieve better accuracy than
ImageNet in classification. We used the VGG-16-layer and
VGG-19-layer networks in our experiment[14].
5) VGG-Face
This is the network that was introduced after the
aforementioned 4) VGG Network. The architecture uses
the VGG-16-layer, but has been trained on 2.6M images of
2622 unique identities. This network has achieved results
comparable to the state-of-the-art networks, and using
(b) fewer images[15].
Fig. 2. A snapshot of two collections.
(a) 64 unique identities from the first collection. B. Training
(b) 59 unique identities from the second Before training, as a post-processing step the histogram
collection. Note the common identities in two equalization[16] was applied to all images in order to
collections have green border. enhance contrast, and images are resized to 255x255
CNNs have recently been able to achieve the top The last layer (the classifier layer) of the pre-trained
performance in many classification tasks. This section networks trained by backpropagation using 3000 epochs
describes the architecture of the well-known CNNs used with a learning rate of 0.001 (Fig. 4). The popular open
in the experiments. source deep learning framework Caffe was to build the
classifier on NVIDIA GTX980 Ti GPU with 6GB of
onboard memory, with the training accelerated by CuDNN the training decreased the accuracy further, with VGG-face
libraries. having the best (58.62) and Hybrid-CNN having the worse
(0.97) performance for the 268 identities.

Fig. 4. Training the VGG-face network by replacing the

last layer (fc8) with 2622 dimensions to a new layer
(my_fc8) with a number of unique identity (268)
dimensions [15].
Fig. 6. Accuracy of trained model on the first collection
IV. EXPERIMENTS AND RESULTS with different number of unique identities (45, 90, 135,
180, 225 and 268) on the test samples from the second
This section evaluates the performance of two
experiments on the trained CNNs, as described in session
III.A. B. The second experiment
A. The first experiment In this experiment we trained the networks in the second
collection, which consisted of consisting 59 unique
The first collection assesses the robustness of models in
identities. The network was trained on between 14 and 34
regards to a number of classes. For this purpose, we
samples from each identity in order to evaluate the effect
trained and tested our networks with a varying different
of a number of identities on the training . Similar to the
number of unique identities (45, 90, 135, 180, 225 and
second experiment, 3000 epochs were enough for all of the
268). From each identity 14 samples were used to train the
networks to converge.
networks, and between 4 and 10 images were selected for
In order to compare the accuracy of the trained models,
testing. Fig. 5 shows some training images from the first
the networks were tested on samples from both collections.
In the first stage, the trained networks were tested on the
same collection using 10 images per any identity. In the
second stage, the test sample came from the second
collection with 20 up to 50 images per 16 common
Model Trained on 1st
Fig. 5. Example images for 6 different identities from the
10 Sampels 34 Sampels
first collection was used for training the networks. Test on Test on Test on Test on
1st 2st 2st 1st
Although all networks converged, the number of epochs
AlexNet 99.83 9.59 100 10.37
needed increased when the number of identities increased.
Places205 99.15 4.94 99.15 6.72
As can be inferred from Fig. 7, VGG-face converged faster
Hybrid-CNN 99.49 7.33 99.49 8.70
and its accuracy was better than the other networks were
VGG-16 99.66 12.52 100 13.25
during training. VGG-face had the best accuracy (98.52%),
VGG-19 99.83 14.48 100 15.73
and Hybrid-CNN had the worse accuracy (91.75) for the
268 identities after 3000 epochs. VGG-Face 99.83 56.04 100 57.99
As two data collections have 16 identities in common, The result of two experiments is summarized in Table 1.
training on the first collection and testing on the other All of the networks` performances dropped considerably
collection were an interesting opportunity to check the by testing on new images from the other collection. VGG-
robustness of the models for image variation. Therefore, Face by far outperformed the other networks, with 42%
between 20 and 50 images per the 16 common identities higher accuracy.
from the second collection were tested on the network
trained on the first collection. V. CONCLUSION
When the networks were tested on the new images from
In this paper, we introduced a standard dataset in order
the other collection, the accuracy of all of the models
to assess the performance of face recognition algorithms
dropped suddenly (Fig. 6). Adding more identities during
for practical applications. the proposed dataset enables

Fig. 7. Accuracy of different CNNs on the first dataset. (a)AlexNet (b)VGG-16 (c)Hybrid-CNN
(d)VGG19 (e) Places-CNN (f)VGG-face
researchers to evaluate their methods under various image its variations," Computer vision, graphics, and image processing,
vol. 39, pp. 355-368, 1987.
degradations. In addition, we evaluated the top-
performance CNNs on the proposed dataset. Experimental
results show that although the accuracy rate remains high
by increasing the number classes, the performance of all
CNNs considerably decreases. when testing samples come
from a new dataset. According to the results, the trained
VGG-Face model by far outperforms other models in the
experiments, and it is the most robust method under
various image degradations.


[1] A. Farzmahdi, K. Rajaei, M. Ghodrati, R. Ebrahimpour, and S.-M.

Khaligh-Razavi, "A specialized face-processing model inspired by
the organization of monkey face patches explains several face-
specific phenomena observed in humans," Scientific reports, vol. 6,
[2] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol.
521, pp. 436-444, 2015.
[3] F. Schroff, D. Kalenichenko, and J. Philbin, "Facenet: A unified
embedding for face recognition and clustering," in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition,
2015, pp. 815-823.
[4] Y. Taigman, M. Yang, M. A. Ranzato, and L. Wolf, "Deepface:
Closing the gap to human-level performance in face verification,"
in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2014, pp. 1701-1708.
[5] C. Ding and D. Tao, "Robust face recognition via multimodal deep
face representation," IEEE Transactions on Multimedia, vol. 17,
pp. 2049-2058, 2015.
[6] M. Ghodrati, A. Farzmahdi, K. Rajaei, R. Ebrahimpour, and S.-M.
Khaligh-Razavi, "Feedforward object-vision models only tolerate
small image variations compared to human," Frontiers in
computational neuroscience, vol. 8, p. 74, 2014.
[7] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen,
et al., "Pushing the frontiers of unconstrained face detection and
recognition: IARPA Janus Benchmark A," in 2015 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
2015, pp. 1931-1939.
[8] P. Viola and M. Jones, "Rapid object detection using a boosted
cascade of simple features," in Computer Vision and Pattern
Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE
Computer Society Conference on, 2001, pp. I-511-I-518 vol. 1.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet
classification with deep convolutional neural networks," in
Advances in neural information processing systems, 2012, pp.
[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based
learning applied to document recognition," Proceedings of the
IEEE, vol. 86, pp. 2278-2324, 1998.
[11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, et
al., "Imagenet large scale visual recognition challenge,"
International Journal of Computer Vision, pp. 1-42, 2014.
[12] F. Hu, G.-S. Xia, J. Hu, and L. Zhang, "Transferring deep
convolutional neural networks for the scene classification of high-
resolution remote sensing imagery," Remote Sensing, vol. 7, pp.
14680-14707, 2015.
[13] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva,
"Learning deep features for scene recognition using places
database," in Advances in neural information processing systems,
2014, pp. 487-495.
[14] K. Simonyan and A. Zisserman, "Very deep convolutional
networks for large-scale image recognition," arXiv preprint
arXiv:1409.1556, 2014.
[15] O. M. Parkhi, A. Vedaldi, and A. Zisserman, "Deep face
recognition," Proceedings of the British Machine Vision, vol. 1, p.
6, 2015.
[16] S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A.
Geselowitz, T. Greer, et al., "Adaptive histogram equalization and

View publication stats