Sunteți pe pagina 1din 4

Recognizing Facial Expressions in the Intelligent Space

Leon Palafox, Laszlo A. Jeni, Hideki Hashimoto B.H. Lee


Institute of Industrial Science School of Electrical Engineering
The University of Tokyo Seoul National University
{leon, laszlo}@hlab.iis.u-tokyo.ac.jp

Abstract: In human – human communication we use verbal, vocal and non-verbal signals to communicate with
others. Facial expressions are a form of non-verbal communication, recognizing them helps to improve the human
– machine interaction. This paper proposes an approach for recognizing the facial expressions using camera
images. The proposed system tracks the human face using high-speed cameras, extracts the pose-normalized
image and it is able to learn a dynamic model of the facial expressions.

1 INTRODUCTION
Nowadays the need for personal relationship between
humans and artificial systems is growing. Human-human
communication happens in several levels. When important
information has to be communicated, we feel that personal
presence is obligatory, because there are many
communication channels in operation beyond the speaking.
There is much research previously done which analyze
how the humans interact with its environment, an example of
this is 4W1H (Where, When, Who, What and how) which
tries to explain any interaction of the humans and its
environment into a decomposition of 5 simple variable that
will provide us with enough information to asses the current
situation of the human.[8]
During the evolution of the human race several
communication methods have developed, along with
communication channels. These can be categorized into two
main groups: verbal and non-verbal communication channels.
Verbal communication can be attained easily, or transformed
into another environment, so it came to existence early in the
human-machine relationship.
There is more than one way to communicate between
humans (meta-communication) like expressions, gestures and
postures. Nowadays the need for personal relationship (non-
verbal communication) between humans and artificial tools is
growing.
Different messages require different communication
channels. We use verbal, vocal and non-verbal signals to
describe our emotional state. Facial expressions are a form of
non-verbal communication; it is an outward reflection of a
person’s emotional condition. Recognizing these expressions Figure 1. Overview of the System
help us to estimate the emotional state of a person.
We propose a model of a system having the following two Afterwards in chapter 3, we will talk about facial
main goals. expressions, and how the study of them can provide us an
- The system is able to track human faces using insight of the human condition. Finally we will present an
cameras. overview of the learning and capture system as well as some
- It is able to learn some a dynamic model of the facial preliminary results.
expressions.
The overview of the proposed system can be seen on
Figure 1.
2 THE INTELLIGENT SPACE
This paper is organized as follows: in Section 2 we will 2.1 Concept
introduce and talk about the concept and creation of the
Intelligent Space as well as the importance of recognizing The Intelligent Space (iSpace) is a distributed sensor
emotions within this space. network in a limited area, where the monitoring is realized
through Distributed Intelligent Network Devices (DINDs) [1].
Figure 2. The iSpace concept. Figure 3. The iSpace collects various information about the users.

A DIND has a sensing function through devices such as


cameras and microphones that are networked to process the
information in the Intelligent Space. 3 FACIAL EXPRESSIONS
The DINDs monitors the space, gather data and
distributes it through the network [2]. The iSpace consists not Facial expressions result from one or more motions or
only of sensors, cameras and robots, but also of humans. positions of the muscles of the face. These movements
The iSpace is a system for supporting people in it. Events, convey the emotional state of the individual to observers.
which happen in it, are understood. However, to support Facial expressions are a form of nonverbal communication.
people physically, the intelligent space needs robots to handle They are a primary means of conveying social
real objects. Mobile robots become physical agents of the information among humans, but also occur in other mammals
Intelligent Space and they execute tasks in the physical as well as some other animal species. Facial expressions and
domain to support people in the space. Moreover, robots can their significance in the perceiver can, to some extent, vary
understand the requests (e.g. gestures) from people more between cultures.
effectively. The applicable tasks include movement of objects, The focus of our interest is the facial macro-expressions.
providing help, for example to aged or disabled persons etc. We display these facial expressions in our daily interactions
with other people all of the time, when we don’t want to
conceal our emotions. Usually they last from ½ second to 4
2.2 The 4W1H System seconds.
In iSpace the distributed sensors collect different There are seven universal facial expressions [4], which are
information about the users and the objects used by them. The present in every culture. These are anger, contempt, disgust,
currently implemented architecture is able to recognize surprise, fear, happiness and sadness.
people (Who information), localize them (Where
information), classify which objects they are using and how
are they interacting with them in the space (When, What and
How information). We refer this system as the 4W1H
architecture. [8]
It is thought that the relation between a human and an
object is described by observing the use history of the object.
The use history of the object is observed by focusing on the
object’s movement which is caused when it is used by the
human. The name, the size, the color, and the shape etc. of the
object are information given beforehand.
On the other hand, there is information which occurs only
after a person uses the certain elements, such as the use
history or the movement history. Such information is vast,
and considering the cost it is not realistic to describe the use
history information of a wide arrange of objects that exist in
the space. Therefore, it is necessary that the object's
information is written automatically without human
interaction when the object is used by a person. [3]
The current problem is the human-human interaction
recognition and the fifth „W”: Why is the user doing
something? And how is the user being affected with the
current interaction with the object. Figure 4. Action Units of the upper region.
Let us assume that we have d simple computational units
called ‘neurons’ in a recurrent neural network:

 I J +1 
r (t + 1) = f  Fi r (t − i ) +
∑ B j u (t − j ) + e(t )
∑ (1)
 
 i =0 j =1 

where e(t), the driving noise of the RNN, denotes


temporally independent and identically distributed (i.i.d.)
stochastic variables and P(e(t)) = Ne(t)(0; V ), r(t) ∈ Rd
represents the observed activities of the neurons at time t. Let
u(t) ∈ Rc denote the control signal at time t. The neural
network is formed by the weighted delays represented by
matrices Fi (i = 0; … ; I) and Bj (j = 0; … ; J), which connect
neurons to each other and also the control components to the
Figure 5. Dragonfly2 CCD Camera. neurons, respectively.
Control can also be seen as the means of interrogation, or
To describe these emotions Ekman et al. proposed an the stimulus to the network [7]. We assume that function g :
anatomically oriented coding scheme, the Facial Action Rd → Rd in (1) is known and invertible.
Coding System [4]. This system is based on the definition of The computational units, the neurons, sum up weighted
action units (AUs) of a face that cause facial movements. previous neural activities as well as weighted control inputs.
Each action unit may correspond to several muscles that These sums are then passed through identical non-linearities
together generate a certain facial action. according to (1). The goal is to estimate the parameters Fi (i =
As some muscles give rise to more than one action unit, 0; … ; I), Bj (j = 0; … ; J) and the covariance matrix V , as
correspondence between action units and muscle units is only well as the driving noise e(t) by means of the control signals.
approximate. 46 AUs were considered responsible for We introduce the following notations:
expression control and 12 for gaze direction and orientation.
xt +1 = [rt − I ;..; rt ; ut − J +1 ;..; ut +1 ],
4 OVERVIEW OF THE SYSTEM yt +1 = f −1 (rt +1 ),
A = [FI ,.., F0 , B J ,.., B0 ]∈ R d ×m
The system consists two main parts: a face tracking unit to
extract the human faces and a learning system to learn and
recognize the facial expressions.
Using these notations, the original model (1) reduces to a
linear equation:
4.1 Face Tracing Unit
For the face tracking and extraction we used the yt = Axt + et . (2)
Dragonfly2 CCD Camera from Point Gray [5]. This is an
OEM style board level camera designed for imaging product
The InfoMax learning relies on Bayes' method in the
development. It offers 30 FPS frame rate, auto-iris lens
online estimation of the unknown quantities (parameter
control and on-board color processing (Figure 5).
matrix A, noise e(t) and its covariance matrix V).
In our research we used frontal images of the subjects,
however their distance from the camera varies, therefore the
size of the facial area too. Furthermore, sometimes some rigid Optimal control calculation
head motion occurs during the tracking. This presents as ut +1 = arg max u∈U xˆtT+1 K t−1 xˆt +1 ,
rotation in the images, as well as some offsets, that need a
solution. where xˆt +1 = [rt − I ;...; rt ; ut − J +1 ;...; ut ; u ]
To deal with these variables and to project the face to a set xt +1 = [rt − I ;...; rt ; ut − J +1 ;...; ut ; ut +1 ]
fixed region, we can use a face tracking software called
faceAPI [6]. This software provides six degree-of-freedom Observation
observe rt +1 , and let yt +1 = f −1 (rt +1 )
tracking information by tracking the position and rotation of
the head in X, Y and Z, relative to the camera.
Furthermore, faceAPI outputs the pose-normalized image Bayesian update
( )( )
of the face upon commencement of tracking, allowing the −1
system to perform a planar track of the face disregarding the M t +1 = M t K t + yt +1 xtT+1 xt +1 xtT+1 + K t
relative rotation and position of it according to the camera. K t +1 = xt +1 xtT+1 + K t
FaceAPI is able to track a slightly rotated face and map it
as a planar 2D image that will be easy to process further into nt +1 = nt + 1
γ t +1 = 1 − xtT+1 (xt +1 xtT+1 + K t ) xt +1
the recognition. −1

Qt +1 = Qt + ( yt +1 − M t xt +1 )γ t +1 ( yt +1 − M t xt +1 )
T
4.2 Learning System
The learning system provides a dynamic model of the Figure 6. Steps of the InfoMax update
human face and classifies the input image into AU codes.
Figure 7. Frontal and 30-degree views from the Cohn-Kanade database.
Each sequence begins with a neutral expression and proceeds to a target
expression. (©Jeffrey Cohn)

It assumes that prior knowledge is available and it updates


the posteriori knowledge on the basis of the observations.
Control will be chosen at each instant to provide maximal
expected information concerning the quantities we have to
estimate.
Starting from an arbitrary prior distribution of the Figure 8. The original image (upper), the extracted face (bottom left) and
parameters the posterior distribution needs to be computed. the downscaled image (bottom right). (©Jeffrey Cohn)
This estimation can be highly complex, so approximations are
common in the literature. For example, assumed density The preliminary results of the simulations illustrate that it
filtering, when the computed posterior is projected to simpler is possible to extract some facial features that corresponds
distributions, has been suggested. The steps of the InfoMax with AU codes.
update are summarized in Figure 6. Concerning future work, the main open question is how to
build a generative model of the faces to provides more
accurate and complex recognition of the AU codes.
4.3 Data Base
During the simulation we used the Cohn-Kanade AU-
Coded Facial Expression Database [9]. This database is for REFERENCE
research in automatic facial image analysis and synthesis and [1] H. Hashimoto: Intelligent Space -How to Make Space
for perceptual studies. It allows a standardized model that Intelligent by Using DIND-. In: Proceedings of the 2004
allows researchers to compare their model performances TRS Conference on Robotics and Industrial Technology,
Thailand, 2004, pp.1-11
according to a unified model of faces (see Fig. 7). [2] J. Lee, K. Morioka, N. Ando, H. Hashimoto:
The database although highly used, also lacks of high Cooperation of Distributed Intelligent Sensors in
definition imagery, which highly affects the performance in Intelligent Environment. IEEE/ASME Transactions on
the training phase of any learning system, as well a cropping Mechatronics, Vol.9, No.3, 2004, pp.535-543, ISSN
and face detection algorithms have to be undergone since the 1083-4435
images are not normalized and require intensive [3] M. Niitsuma, H. Hashimoto: Spatial Memory as an Aid
preprocessing. System for Human Activity in the Intelligent Space.
IEEE Transactions on Industrial Electronics, Vol. 54,
Issue 2, 2007, pp. 1122-1131, ISSN: 0278-0046.
[4] Ekman, P., & Rosenberg, E. L. (Eds.). (2005). What the
5 PRELIMINARY RESULTS face reveals: Basic and applied studies of spontaneous
We were interested in recognizing independent facial expression using the Facial Action Coding System (2nd
ed.). New York: Oxford University Press.
components on the Cohn-Kanade database using the proposed
method. [5] Point Grey Research, Inc., http://www.ptgrey.com
Using the database we extracted the face region of the [6] Seeing Machines Inc.,
http://www.seeingmachines.com/product/faceapi/
subjects, then generated 5000 facial image-sequences, with
[7] Lewi, J., Butera, R., and Paninski, L. Real-time adaptive
different expressions and different subjects. All images were information-theoretic optimization of neurophysiology
sized to 40×40 pixel, which is equal to the input size of the experiments. In Advances in Neural Information
learning system (see Fig. 8). Processing Systems, volume 19, 2007.
The algorithm can identify four different facial [8] Palafox Leon, Hashimoto Hideki. Human Action
components, these correspond mostly to mouth, eyes, eye Recognition using 4W1H and Particle Swarm
brushes head shapes. Optimization Clustering. Proceedings of International
Conference on Human System Interaction, 2010,
Rzesow, Poland
[9] Kanade, T., Cohn, J. F., & Tian, Y. (2000).
6 CONCLUSION Comprehensive database for facial expression analysis.
In this paper we have proposed a system for learning and Proceedings of the Fourth IEEE International
recognizing human facial expressions using camera images. Conference on Automatic Face and Gesture Recognition
(FG'00), Grenoble, France, 46-53.
The system tracks the human face using high-speed cameras,
extracts the pose-normalized image of the face and feeds it to
a Bayesian Learning system.

S-ar putea să vă placă și