Sunteți pe pagina 1din 16

Informatics Research Review Emotion recognition through facial expressions

Konstantinos Kanellis January 19, 2012

Introduction

In the last two decades, many eorts have been made to improve the Human - Computer Interaction (HCI). The eorts are focused on the interaction in a more natural way that would resemble to the Human - Human Interaction. Humans use speech as the main way of communication to interact to each other but they also use non-verbal communicative signals such as gestures, body postures and facial expressions to emphasize a certain part of the speech and display emotions. The most expressive way humans display emotions is through facial expressions [3, 23]. Studies showed [38, 14], there are six basic emotions that are universal and cross-culturally displayed by facial expressions. These are anger, disgust, fear, joy, sadness and surprise. Even though emotion recognition for humans is a rather natural process that happens eortlessly [15] for computers is a rather dicult and computational intensive process. The applications that emotion recognition systems can be applied are numerous and almost in every interaction between Human and Computer or Robot. For instance, Emotion recognition systems can be used in the game and entertainment industry. An example would be a more elaborate interactive process in games, closely resembling human-like interaction/communication. Furthermore, other areas that could take advantage of emotion recognition systems are lie detection systems, psychiatric and neuropsychiatric studies, emotion-sensitive automatic tutoring systems, tour-guide robot, face image compression and synthetic face animation [24], video surveillance and security systems. [20, 12, 39, 22]. Due to the importance of facial expression recognition and the wide range of its applications, several eorts were attempted. One of the pioneers was Suwa [42] in 1978, followed by Mase [30] in the early 1990s, who used optical ow for facial expression recognition. Continued eorts in the following 20 years produced further improvements in the eld [44, 26, 40, 18, 37, 1, 29, 13, 32, 36], introducing novel approaches in the process. In this review three dierent appearance-based methods of emotions recognition through facial expressions will be analyzed and compared based on their results at the same database. Furthermore, this review will propose improvements to existing methods that are more suitable and precise for real life situations.

Basic Structure of Facial Expression Analysis Systems

Facial expression analysis consists of three stages: face acquisition, feature extraction and representation, facial expression recognition (Figure 1).

Figure 1: Basic structure of facial expression analysis systems [27] Face acquisition is the stage where the faces position is detected in order to be distinguished from the background and make possible the features extraction. The face acquisition stage is aected by the illumination and the pose of the face. Small variations in illumination of the face do not aect much the face acquisition. On the other hand bad illuminated faces are a signicant problem [6]. Pose variations may distort the facial expression or even make it disappear partially [18]. Feature extraction and representation is the next stage where the facial changes of the located face are extracted and represented by features. There are two approaches: The geometric feature-based methods which are focused on the shape and location of the facial components such as eyes and mouth to extract features that describe the face geometry. The second approach is appearance-based methods which use image lters to the face or parts of the face to extract a feature vector [27]. The last stage is the Facial Expression Recognition.The expression recognition methods can be divided into two methods. The frame-based recognition method that recognizes the expression in each frame separately (static classiers, eg. Bayesian networks, Neural Networks, Support Vector Machine) and the sequence-based recognition method that uses information from a sequence of frames to recognize the expression of the frames (temporal clas3

siers, eg. HMM). [27, 1, 36, 37].

Related studies

In this part three methods of emotion recognition systems will be examined. Each method introduces a dierent approach, but all use the same dataset. The rst two are appearance based methods while the last one is a geometric based method.

3.1

First Method

The rst of the examined approaches is [5] by M. Bartlett, G. Littlewort, M. Frank et al. The method with the best results among the techniques that they presented was AdaSVM which is a combination of the Adaptive Boost algorithm (Adaboost) as a feature selection technique and an SVM classier. In the beginning the faces in the dataset are detected and resized to 48 x 48 pixels with the distance between the eyes to be almost 24 pixels. For the feature extraction step a bank of Gabor lters at 8 orientations and 9 spatial frequencies that produces 40 lters is used [25, 28, 4]. A Gabor lter is a complex sinusoid modulated by a 2D Gaussian function. Gabor lters can be congured to extract a particular band of frequency features from an image. The extraction of the features of an image is acquired by convolving the Gabor lters with the image and the result is 8 x 9 x 48 x 48 = 165888 Gabor features. The number of the features is reduced to 900 with the usage of the Adaboost [4, 41, 46] feature selection algorithm which selects the appropriate Gabor feature for dierent image locations based on their importance to classication accuracy. Adaboost is an iterative algorithm that treats the gabor lters as weak classiers. In every iteration Adaboost selects the classier with the lowest weighted classication error through exhaustive search. The error is then used to update the weights such that the wrongly classied samples get weights increased. The features are selected based on the features that have already been selected in previous iterations and on the goal of reducing the error of the previous lter. After the feature extraction a Support Vector Machine (SVM) classier is used to classify the features. SVM is a supervised classier that constructs 4

a hyperplane to separate the input data which are not linear separable in the initial data space. In order to do so SVM uses kernel functions to nonlinear transform the initial data into a high-dimension feature space, where the data can be optimal separated by a hyperplane. The dataset that was used against the AdaSVM classier (the combination of Adaboost and SVM) was Cohn-Canade dataset. There were also other classiers tested: Adaboost as classier, SVM without pre-process of the features and Linear Discriminant Analysis (LDA). The results showed at Table 1 that the best classier was AdaSVM with 93.3% recognition rate. Remarkable is that the full expression recognition process runs in real time.

Table 1: Leave-one-out generalization performance of Adaboost,SVMs, AdaSVMs and LDA [5]

3.2

Second Method

The second of the examined methods of the expression recognition was introduced by C. Shan, S. Gong and P.W. McOwan [31]. It is based on the idea that facial images can be represented by micro-patterns which can be delineated by Local Binary Patterns (LBP). LBP is a fast and low computational cost method that can be used eectively even in low resolution images to extract facial features [33, 34]. The system gets as input a face image so there is no need for face detection. Instead of that a processing of the image is made to transform the face to a desirable size based on the distance between the centers of the eyes which should be 55 pixels. The distance between the eyes is about 2 times the width of the face and about 3 times the height on the face. In the end the facial image is cropped to an image of 110x150 pixels based on the position of the eyes. For the feature extraction step the LBP method as mentioned above is used. In the LBP method the image is turned to gray scale and divided into 5

sub-regions. For each pixel of each sub-region a binary number is calculated. The binary number is the output of the comparison of a pixels value with some of surrounding pixels values image. If we consider the pixel as the center of a circle then the comparison should be made with the pixels that are on the perimeter of the circle. The basic LBP algorithm uses a small neighborhood of 3x3 pixels with 1 pixel radius for the circle (Figure 2 left). Extended LBP algorithm allows any radius (R) and any number of pixels (P) in neighborhood and is denoted as LBPP,R [35] (Figure 2 right). Further extension of LBP is the usage only of binary numbers that contain at most two bitwise transitions from 0 to 1. These LBPs are called uniform patterns and are used to reduce the labels. The usage of uniform patterns is denoted as LBPu2 [35]. For instance 00000000, 00110000 and 11100001 are uniform patterns. In the end a histogram of the uniform patterns is created for each of the sub-regions and all the histograms are concatenated to produce a global histogram which is the description of the face. In this realization the images were divided into 42 regions (6 rows x 7 columns matrix of 18 x 21 pixels per region), that give good ratio in recognition performance to computational cost [2] (Figure 3 left). Also it is used a LBPu2 , which has 59 labels. 8,2

Figure 2: Left: The basic LBP operator [2]. Right: Two examples of the extended LBP [35] a circular (8, 1) neighbourhood, and a circular (12, 1.5) neighbourhood [31]

Figure 3: Left: A face image divided into 67 sub-region. Right: The weights set for weighted dissimilarity measure. Black squares indicate weight 0.0, dark gray 1.0, light gray 2.0 and white 4.0. [31] For the classication of emotions two dierent techniques are compared 6

the Template Matching with Nearest Neighbour classier and SVM classier. In the template matching during training the LBP global histograms of the images which belong to the same classes are averaged to build a histogram template of the class they belong to. In testing the LBP histogram of the input facial picture is matched with the closest class template. For the matching Chi square statistic (x2 ) with dierent weights for each face region is used (Figure 3 right), according to the importance of the information contained in each region. Facial expressions are expressed mostly by eye area and mouth area. Thus the corresponding regions have more weight. For the SVM classier the classication function of a set of labelled examples T=(xi , yi ), i = 1, ..., l where xi Rn , y i 1, 1 and b is the parameter of the optimal hyperplane is given by: f (x) = sgn(
l i

= lai yi K(xi , x) + b)

Where alphai are Lagrange multipliers and K(xi , x) is a kernel function (xi )(xj ) . SVM can decide between two classes. To accomplish multi-class classication a cascade of binary classiers combined with a voting scheme is used.

Table 2: Results of LBP with Template matching and LBP with SVM . [31] Experiments were run on the Cohn-Kanade dataset for both classiers. Best results gave the combination of LBP with SVM with Polunomial Kernel function. The recognition results are shown on Table 2. Another experiment was run to evaluate the LBP over dierent image resolutions (even with very low) Table 3. The results reect the eectiveness of LBP in real world environment where most of the times low-resolution video input is available.

3.3

Third Method

A dierent approach to the problem was given by I. Cohen et al. [9]. This method uses the Piecewise Bezier Volume Deformation (PBVD) as face 7

Table 3: LBP-based algorithm results on various image resolutions. [31] tracking method [43]. In the rst frame of image sequence the eye corners and mouth corners are detected and their position is used as landmark to t a 16 surface patches face model embedded in Bezier volumes Figure 4(a). This looks like a wireframe that wraps the face and can track the changes of the facial features like eyebrows and mouth. To calculate the magnitude of the features motion in 2D images the template matching technique is used between frames at dierent resolutions. These magnitudes are translated to 3D motion vectors that are called Motion Units Figure 4(b) and used as features for the classication process. The motion unit are similar to Ekmans Action Units [16].

Figure 4: (a) The wireframe model and (b) the facial motion measurements. [9] The idea behind the selection of the classier is to nd a structure that takes into account the dependencies among the features. Tree-AugmentedNaive Bayes (TAN) [21] is a classier that partially fulls that by having the class node with no parent and each feature with parents the class node and 8

at most one other feature Figure 5.

Figure 5: An example of a TAN classier. [9] During the learning process the classier has no xed structure of the Bayesian network, but tries to nd the best one that maximizes the likelihood function given the training data. The method that it is used to nd the best TAN structure is a modied Chow-Liu algorithm [7, 21]. The algorithm Figure 6 calculates the conditional probabilities of pairs of features given the class and using the probabilities as weights constructs a maximum weighted spanning tree. The spanning tree is built based on Kruskals algorithm [11] Figure 7. Another interesting point is that the learning algorithm for the TAN [21] is based on discrete features but the feature data space of the method is continuous, causing complicate computation of the pairs probabilities for the case of general distribution of features. This problem was solved by assuming the distribution of the features to be Gaussian and given by: p(c, x1 , x2 , ..., xn ) = p(c)
n i=1

p(xi |paxi , c)

They also use Gaussian Naive-Bayes, Cauchy Nave-Bayes and HMM as classier. Best results for Cohn-Canade dataset is with Gaussian TAN. The HMM is tested against their own dataset because of technical reasons. The tests were performed ve times with leave-one out cross validation method. The recognition rate for Cohn-Kanade database with 95% condence intervals is 73.22% +- 1.24. At Table 4 are displayed the results of all methods on Cohn-Kanafe dataset. They showed that the classiers based on frames (static) are easier to implement and train instead of dynamic classiers (HMM) that are based on sequential data. On the other hand they noted that static classiers can be unreliable for video sequence because of the misclassication of frames that are not at the peak of expression.

Figure 6: TAN learning algorithm. [9]

Figure 7: Kruskals Maximum Weighted Spanning Tree algorithm. [9]

Table 4: Recognition rates for CohnKanade database together with their 95% condence intervals. [9]

Conclusion

In this review we examined three dierent approaches of facial expression recognition systems. These methods were trained and tested with CohnKanade dataset. The method that seems to have the best result is the combination

10

of AdaBoost feature extraction technique with SVM (with linear RBF kernel function) classier with recognition performance accuracy of 93.3% and real-time operation speed. In comparison to the aforementioned method the combination of LBP feature extraction technique with SVM (with polynomial kernel function) classier seem to be less accurate with recognition rate of 88.4% but with decent performance on low resolution images. Its operation speed is also in real-time. The TAN classier had the worse recognition performance of 73.22%.

4.1

Improvement suggestions

Real life emotional facial expressions are dicult to be gathered because they are short lived and greatly aected even by slight context-based changes [47]. Furthermore the labelling of the data is a dicult process, time consuming and expensive[47]. There is the solution of creating datasets with acted emotions. However the acting of some of the 6 basic emotions is dicult to elicit in any lab environment and may lead to wrong labeling of the data [8]. Moreover the facial expressions of acted emotions dier on intensity, duration and occurrence order from facial expressions of natural spontaneous emotions that occur in daily life situations [10, 17, 45]. Additionally there are situations in real life where two or more expressions of natural spontaneous emotions may be blended or be sequential without being clearly separated by a neutral expression. All of the aforementioned diculties make the creation of the datasets a very challenging task that aects a lot the progress of the emotion recognition systems research. The lack of datasets that fully comply with the requirements leads very often the researchers to use their own datasets which have no homogeneity among them. As result the direct comparison of systems results that accrue from dierent test beds is impossible [19]. Thus the eectiveness of various methods cannot be evaluated objectively.Databases should be improved by including authentic spontaneous facial expressions that take into account the concept of each situation the expression was captured, to make accurate labeling. Most of the times to fully interpret an emotion the facial expression is not enough by itself and more information is needed. Moreover, depending on the context, body gesture, voice and cultural dissimilarities, a facial expression can express intention, cognitive processes, physical eort or other interpersonal meanings [27]. A way to overcome this problem and also improve the recognition rates is to use additional inputs that would give more information about the expressed emotion wherever is possible. For exam11

ple an additional input could be the voice of a person during a conversation or hand gestures for situations without voice presence. The combination of dierent modalities could be resembled with the way that human uses simultaneously dierent senses to recognize an expressed emotion [39, 47]. Another idea is to use Facial Action Coding System (FACS) to classify facial actions prior to any interpretation attempts instead of classifying facial expressions directly into basic emotional categories. The FACS is the most widely accepted technique of measuring the facial muscle movement corresponding to dierent expressions. The FACS is a framework for describing facial expression and codes the 6 basic universal emotions as a combination of facial visual distinct muscular motion known as Action Units (AU). The FACS is also suitable for labelling of datasets because the FACS AUs are objective descriptors and independent of interpretation. [16, 19]

4.2

Future work

The ideal facial expression analysis system must perform automatically and in real-time for all stages of the process and analyze the facial actions regardless of context, culture, gender, age and so on. Furthermore, it should consider except for the type of the expression the intensity and dynamics of facial actions.

References
[1] Automatic Recognition of Facial Expressions using hidden Markov models and estimation of expression intensity), author=Lien, J.J.J., year=1998, school=Washington University. PhD thesis. [2] T. Ahonen, A. Hadid, and M. Pietikinen. Face recognition with local a binary patterns. Computer Vision-ECCV 2004, pages 469481, 2004. [3] N. Ambady and R. Rosenthal. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. Psychological bulletin, 111(2):256, 1992. [4] M.S. Bartlett, G. Littlewort, I. Fasel, and J.R. Movellan. Real time face detection and facial expression recognition: Development and applications to human computer interaction. In Computer Vision and

12

Pattern Recognition Workshop, 2003. CVPRW03. Conference on, volume 5, pages 5353. IEEE, 2003. [5] M.S. Bartlett, G. Littlewort, M. Frank, C. Lainscsek, I. Fasel, and J. Movellan. Recognizing Facial Expression: Machine Learning and Application to Spontaneous Behavior. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR05), volume 2, pages 568573. IEEE, June 2005. [6] P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. sherfaces: Recognition using class specic linear projection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(7):711 720, 1997. [7] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. Information Theory, IEEE Transactions on, 14(3):462467, 1968. [8] J.A. Coan and J.J.B. Allen. Handbook of emotion elicitation and assessment. Oxford University Press, USA, 2007. [9] I Cohen. Facial expression recognition from video sequences: temporal and static modeling. Computer Vision and Image Understanding, 91(12):160187, August 2003. [10] JF Cohn and KS Schmidt. The timing of facial motion in posed and spontaneous smiles. In Proceedings of the 2nd International Conference on Active Media Technology (ICMAT 2003), pages 5772, 2003. [11] T.H. Cormen. Introduction to algorithms. The MIT press, 2001. [12] R Cowie, E Douglas-Cowie, N Tsapatsoulis, G Votsis, S Kollias, W Fellenz, and J G Taylor. Emotion recognition in human-computer interaction. IEEE SIGNAL PROCESSING MAGAZINE, 18(1):3280, 2001. [13] G. Donato, M.S. Bartlett, J.C. Hager, P. Ekman, and T.J. Sejnowski. Classifying facial actions. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 21(10):974989, 1999. [14] P. Ekman. Strong evidence for universals in facial expressions - a reply to russells mistaken critique. PSYCHOLOGICAL BULLETIN, 115(2):268 287, 1994. [15] P. Ekman and W.V. Friesen. The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica, 1(1):4998, 1969. 13

[16] P. Ekman and WV Friesen. Investigators guide to the facial action coding system. palo alto, 1978. [17] P. Ekman and E.L. Rosenberg. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, USA, 1997. [18] I.A. Essa and A.P. Pentland. Coding, analysis, interpretation, and recognition of facial expressions. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 19(7):757763, 1997. [19] B Fasel and J Luettin. Automatic facial expression analysis: a survey. Pattern Recognition, 36(1):259275, January 2003. [20] N Fragopanagos and J G Taylor. Emotion recognition in humancomputer interaction. Neural networks : the ocial journal of the International Neural Network Society, 18(4):389405, May 2005. [21] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classiers. Machine learning, 29(2):131163, 1997. [22] W. Hu, T. Tan, L. Wang, and S. Maybank. A survey on visual surveillance of object motion and behaviors. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 34(3):334 352, 2004. [23] D. Keltner and P. Ekman. Facial expression of emotion. In Handbook of Emotions. [24] R. Koenen. Mpeg-4 project overview. international organisation for standartistion, iso/iec jtc1/sc29/wg11, la baule, 2000. [25] M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C. von der Malsburg, R.P. Wurtz, and W. Konen. Distortion invariant object recognition in the dynamic link architecture. Computers, IEEE Transactions on, 42(3):300311, 1993. [26] A. Lanitis, CJ Taylor, and TF Cootes. A unied approach to coding and interpreting face images. In Computer Vision, 1995. Proceedings., Fifth International Conference on, pages 368373. IEEE, 1995. [27] Stan Z. Li, Anil K. Jain, Ying-Li Tian, Takeo Kanade, and Jerey F. Cohn. Handbook of Face Recognition. Springer-Verlag, New York, 2005.

14

[28] G. Littlewort, M.S. Bartlett, I. Fasel, J. Susskind, and J. Movellan. Dynamics of facial expression extracted automatically from video. Image and Vision Computing, 24(6):615625, 2006. [29] A. Mart nez. Face image retrieval using hmms. In Content-Based Access of Image and Video Libraries, 1999.(CBAIVL99) Proceedings. IEEE Workshop on, pages 3539. IEEE, 1999. [30] K. Mase. Recognition of facial expression from optical ow. Trans. IEICE, 74(10):34743483, 1991. [31] P.W. McOwan. Robust facial expression recognition using local binary patterns. In IEEE International Conference on Image Processing 2005, pages II370. IEEE, 2005. [32] A. Nean and M. Hayes. Face recognition using an embedded hmm. In IEEE Conference on Audio and Video-based Biometric Person Authentication, pages 1924, 1999. [33] T. Ojala, M. Pietikainen, and D. Harwood. Performance evaluation of texture measures with classication based on kullback discrimination of distributions. In Pattern Recognition, 1994. Vol. 1-Conference A: Computer Vision & Image Processing., Proceedings of the 12th IAPR International Conference on, volume 1, pages 582585. IEEE, 1994. [34] T. Ojala, M. Pietikinen, and D. Harwood. A comparative study of texa ture measures with classication based on featured distributions. Pattern recognition, 29(1):5159, 1996. [35] T. Ojala, M. Pietikainen, and T. Maenpaa. Multiresolution grayscale and rotation invariant texture classication with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7):971987, 2002. [36] N. Oliver, A. Pentland, and F. Brard. Lafter: A real-time face and e lips tracker with facial expression recognition. Pattern Recognition, 33(8):13691382, 2000. [37] T. Otsuka and J. Ohya. Recognizing multiple persons facial expressions using hmm based on automatic extraction of signicant frames from image sequences. In Image Processing, 1997. Proceedings., International Conference on, volume 2, pages 546549. IEEE, 1997.

15

[38] Ekman p. Universals and cultural dierences in facial expressions of emotion, 1971. [39] Maja Pantic and Leon J M Rothkrantz. Toward an Aect-Sensitive Multimodal Human Computer Interaction. Organization, 91(9), 2003. [40] M. Rosenblum, Y. Yacoob, and L.S. Davis. Human expression recognition from motion using a radial basis function network architecture. Neural Networks, IEEE Transactions on, 7(5):11211138, 1996. [41] L. Shen and L. Bai. Adaboost gabor feature selection for classication. In Proc. of Image and Vision Computing NewZealand, pages 7783. Citeseer, 2004. [42] M. Suwa, N. Sugie, and K. Fujimora. A preliminary note on pattern recognition of human emotional expression. In International Joint Conference on Pattern Recognition, pages 408410, 1978. [43] H. Tao and T.S. Huang. Connected vibrations: a modal analysis approach for non-rigid motion tracking. In Computer Vision and Pattern Recognition, 1998. Proceedings. 1998 IEEE Computer Society Conference on, pages 735740. IEEE, 1998. [44] N. Ueki, S. Morishima, H. Yamada, and H. Harashima. Expression analysis/synthesis system based on emotion space constructed by multilayered neural network. Systems and Computers in Japan, 25(13):95107, 1994. [45] M.F. Valstar, H. Gunes, and M. Pantic. How to distinguish posed from spontaneous smiles using geometric features. In Proceedings of the 9th international conference on Multimodal interfaces, pages 3845. ACM, 2007. [46] P. Viola and M.J. Jones. Robust real-time face detection. International journal of computer vision, 57(2):137154, 2004. [47] Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. A survey of aect recognition methods: audio, visual, and spontaneous expressions. IEEE transactions on pattern analysis and machine intelligence, 31(1):3958, January 2009.

16

S-ar putea să vă placă și