Sunteți pe pagina 1din 7

VISION-BASED SKIN-COLOUR SEGMENTATION OF MOVING HANDS

FOR REAL-TIME APPLICATIONS


S. Askar, Y. Kondratyuk, K. Elazouzi, P. Kauff, O. Schreer
Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut
Germany

ABSTRACT
We present a robust vision-based skin-colour
segmentation method for moving hands in a real-time
application. Segmentation of hands is an important
processing step in gesture recognition applications,
where the general shape and position of the hands are of
interest. In contrast to these approaches, the presented
method concentrates on an accurate segmentation,
which is required for further processing steps in a realtime videoconferencing application. A hand tracking
procedure is applied to improve the segmentation in
terms of accuracy, robustness and processing speed.
Furthermore the presented approach can accomplish
difficult situations like contact between hands or contact
between face and hands. This is important for many
real-time applications, e.g. for the presented
videoconference system to allow the conferees a natural
behaviour. Moreover we present an approach for an
automatic initialisation of the skin-colour range to the
specific user. We show experimental results proving the
efficiency and reliability of our approach. The proposed
hand segmentation method is capable of processing TVsized (CCIR 601, 576x720 pixels) video images in realtime with 25 Hz on a common PC. The presented
approach will support any video processing in visual
media production, where segmentation accuracy and
real-time capability is required.

(7), Malassiotis (8), Jennings (9). The real-time


constraint is considered as well, but just in gesture
recognition applications e.g. in Lovell (10), Herpers
(11). The combination of accurate segmentation of both
hands, robust tracking including overlap of hands and
head, and real-time capability on high resolution video
has not been considered in any publication before.
The presented method on segmentation of hands using
skin-colour is a resulting work of a project on
immersive 3D video conferencing, which is being
developed at FhG/HHI (see Kauff (12)). In this system,
segmentation of hands is an important part to improve
different succeeding processing steps such as disparity
estimation or synthesis of virtual views.
The approach is a new robust segmentation method for
moving hands, which can also handle contact between
hands and contact between hands and head. The
fundamental algorithm is based on skin-colour
segmentation and uses so called bounding boxes, which
tracks hands and head separately (Fig. 1). A spatial subsampling in the considered bounding boxes guarantees a
more robust and additionally a fast segmentation.
Furthermore we present a new initialisation algorithm,
which determines the specific skin-colour values
automatically based on the first few images. This is very
important in order to limit the skin colour range to the
specific person and to achieve robustness and accuracy.
Our algorithms process TV-sized video images (CCIR
601, (576x720) pixels) in real-time with 25 Hz.

1 INTRODUCTION
Numerous applications use skin-colour as one of the
basic features for detecting or analysing human face or
hands. They have different aims and different
constraints under which the human face or hands are
being analysed. One crucial point, which is common for
most of the applications in this context, is an accurate
segmentation of human face or hands. Many
applications deal with segmentation of hands, such as
hand sign recognition, human vehicle interaction,
human computer interfaces, but common to all is a
rough segmentation result as other features are derived
(refer to Cui (1), Imagawa (2), Guo (3), Zhu (4), Starner
(5)). In some hand segmentation approaches marked
gloves are used, which are not applicable in video
conferencing systems ( see Dorfmueller (6)). In others
approaches infrared cameras are used or depth
information based on multi-views is exploited e.g Sato

Fig. 1: Hand contours and bounding boxes


In contrast to many approaches in the field of e.g.
gesture recognition or controlling movements, we are
not interested in features of the hands like orientation of
the hands or the motion. The key problem in our
application is to find a closed region, which coincides as
much as possible with the real contour of the hand to
specify depth discontinuities. In the proposed method,
an initialisation of the specific skin colour range is
followed by the real-time segmentation, which consists
of two succeeding steps: the tracking of hands and the

skin-colour based segmentation (see Fig. 2). The


tracking of hands is performed on a sub-sampled QCIF
image, whereas in the second step a region growing
approach is applied to the full resolution image to
extract the final hand segments.

on subsampled image

on original
image size

Determination
of skin colour
range

Handtracking

Skin-colour
segmentation

Initialisation

Tracking

Segmentation

Fig. 2: Block diagram of the presented method


In the next section, the concept of immersive 3D
videoconferencing is described and the relevance of
accurate segmentation of hands in this context is
explained. Then, the automatic initialisation of the
specific skin-colour range of the participant is
presented. In section 4 and 5 the tracking of hands and
the skin colour segmentation are proposed. In section 6,
a solution in the case of overlapping skin-coloured
regions is presented. Experimental results are shown in
section 7. The paper ends with a conclusion.
2 IMMERSIVE 3D-VIDEOCONFERENCING
The basic idea of immersive 3D videoconferencing is
that the participants are perceiving the virtual
conference scene under correct perspective. This
includes full eye contact with the remote participants,
although the cameras are mounted around the display. It
is achieved by a synthesis of virtual views in order to
simulate a virtual camera at the correct position on the
display (see Lei (13)). The current demonstrator of the
immersive 3D videoconferencing system is shown in
Fig. 3.
To achieve the correct perspective view of the remote
participants, a real-time capable disparity estimator has
been developed calculating depth information from
stereo camera images (see Schreer (14)). Although this
disparity estimator provides convincing results, it fails
at depth discontinuities in occluded areas, where pixel
correspondences can not be calculated. This leads to
artefacts in the synthesized views. Due to the nature of
the videoconferencing application depth discontinuities
mainly occur at the contours of the free gesticulating
hands. Due to the fact that artefacts at these areas
extremely bother the impression of immersiveness and
natural representation of the remote participants, a
reduction of these effects is desired.

Fig. 3: Demonstrator of the immersive 3D


videoconferencing system
Accurate segmentation masks of both hands provide a
very helpful information in order to improve the
disparity estimation by replacing wrong or unknown
disparities with reliable values (see Schreer (14)). The
colour of human hands is a striking feature to offer a
solution to this problem. Segmentation of skin-colour
can provide information about depth discontinuity at the
contour area of the hands.
3 INITIALISATION OF SKIN-COLOUR
Usually, TV- and video data are available in the YUVcolour space. Hence, the investigations in this approach
have been made in the YUV-colour space. Other colour
spaces (e.g. HSV, HSI) aim to provide a more uniform
and accurate representation of colour interpreting it in
the same way as the human perceptual system does. But
the transformation from video signals to this colour
spaces is a very time consuming process and needs to be
avoided in real-time applications. In our proposed
algorithm we consider only the chrominance (U,V
channel), which are fully representing the colour. The
We skin-colour is described as a quadruple consisting of
the mean values mu, mv and the tolerance values tu, tv.
The spectral reflectance of human skin is independent
on the human race and on the wavelength of the
exposed light (see Anderson (15)). The same
observation can be made considering the transformed
colour in common video formats. Hence, the human
skin-colour can be defined as a global skin-colour
cloud in the colour space (see Strring (16)). The
general thresholds for this striking area are still too large
to obtain reasonable segmentation results. Depending on
different factors like shadows, illumination, colour
distribution in the particular video data, different
pigmentation of the persons skin and so on, it is useful
to adapt thresholds to the given illumination conditions
and the observed person. Therefore, it is assumed that
the skin-colour of a human under certain conditions can
be considered as a subset of a global skin-colour
cloud. Hence we distinguish between the following

two terms: 1) global skin-colour, representing skincolour in a general way with large tolerance values, and
2) skin-colour, representing the skin-colour for the
specific person under certain illumination conditions
described with specific mean values and a reduced
tolerances.
Hence, an important question arises how to determine
appropriate skin-colour parameters for a scenario to
achieve best segmentation results. Applying parameters
from general statistical analysis of skin-colour does not
lead to optimal segmentation results in the majority of
cases. But they can be often used as good coarse start
values to find appropriate parameters by slightly
varying them.
One option is to adapt the thresholds manually at the
beginning of the segmentation. This is obviously not
convenient in terms of usability and user friendliness of
in the case of a video conferencing system. Therefore, a
quasi-automatic method is presented to find suitable
parameters. Nevertheless in the case of extreme dark or
extreme bright illumination an additional manual
adjustment is unavoidable. However we experienced,
that for brighter illumination, it is reasonable to chose
larger tolerance values than for dark cases.
The initialisation step is performed in the sub-sampled
image for real-time and stability purposes. Beside the
desired skin-colour range, it provides also three centres
of gravity of the two hands and the head, which are used
as start positions for the bounding boxes. In the first
image, a pixelwise skin-colour segmentation is
performed. For the initial skin colour range, threshold
values are obtained from statistical analysis of a number
of images representing the global skin-colour cloud.
After applying the global thresholds, a rough binary
mask is obtained, which is filtered to reduce noise.
The goal of the following process is to determine the
blob position of both hands and the head and to
calculate new and more accurate skin-colour threshold
values in the distinct area. Hence, the row and column
histograms of the binary image are calculated, which
represents the skin coloured pixel-distribution in
horizontal and vertical direction (Fig. 4).

maximum in the corresponding histogram interval is


determined (Fig. 5). The points of intersection of the
horizontal and vertical maxima yield nine potential
positions of the centres of gravity of possible hand or
head blobs (Fig. 6). Obviously some of them have to be
wrong.
A neighbourhood analysis searching for the points with
the most skin-coloured neighbour pixels removes wrong
points. The resulting three positions mark the three skincolour blobs: left hand, right hand and face (Fig. 7).

Fig. 5: Determination of maximum in each stripe

Fig. 6: Points of intersection based on horizontal and


vertical maxima

Fig. 7: Resulting blob positions


Fig. 4: Row and column histogram of binary image
The image is now divided in three equal stripes in
horizontal and vertical direction, which leads to nine
equal areas in the whole image. For each stripe the

The proposed histogram method is independent on the


orientation of the camera. Obviously this approach
works reliably, if the hands and head are in different
image regions, but not necessarily at specific positions.

In order to distinguish between the face and the hands


the assumption was made, that the object on the top is
related to the face. In the case of larger rotations of the
camera, this information must to be taken into account
to assign the blob position of the face correctly (eg.
most left object, most right object, ...). In summary a
few rules have to be considered by the observed person,
but very simple and general ones. The experiments have
proven, that the correct blob positions are computed
reliably after a few frames.
In our online real-time application, the initialisation will
be repeated as long as three separate and reliable blob
positions are determined. After successful computation
of the blob positions, the skin-coloured pixels around
are analysed and the specific new mean values and
tolerances are derived. After delivering of the
initialisation parameters, the vision-based hand
segmentation starts. During the hand segmentation
process the bounding boxes are kept under surveillance.
Some hints, e.g. no segmented pixels in a box or if
boxes leave the image area, lead to re-initialisation. For
these cases we assume either failure with tracking
process, or change of the environment conditions
(shadows, illumination change).
4 TRACKING
An enormous optimisation in speed can be achieved, if
the segmentation of the hands is limited to a specific
region. This is very important for real-time applications.
Therefore we define two bounding boxes, which track
the hands of the conferee continuously during the
conference session and the search is performed just
inside these boxes. After a pixelwise skin-colour
segmentation inside each bounding box, we calculate
the centre of gravity of the obtained skin-colour area in
the whole box. The purpose of the tracking phase is to
determine the centres of gravity of the hands using the
previous bounding box area. Then a new bounding box
position is calculated and delivered to the succeeding
segmentation step. In Fig. 8, left, the calculation of the
new centre of gravity inside the old box is depicted. The
shifted box to the new position is shown in Fig. 8, right.

get new
centre
of
gravity

new
centre
of
gravity

Moreover we perform the whole tracking process in the


sub-sampled image to achieve further reduction of
processing time. As in the tracking step only a blobtracking is performed, it is sufficient to apply this
procedure in the sub-sampled image. The usage of
bounding boxes and the tracking in the sub-sampled
image have much more impact beside real-time
capability as the segmentation becomes extremely
robust. Pixels spatially far apart from hands and head,
but coloured similar to skin-colour do not have any
influence to segmentation result. In addition, the
tracking process in the sub-sampled image filtered as
well and thus leads to a reduction of noise concerning
the blob positions.
5 SKIN-COLOUR SEGMENTATION
For the succeeding accurate segmentation of the hands
in the high resolution image, a skin-colour segmentation
method based on a region growing technique has been
developed. The region growing approach requires a so
called seed point for the segmentation, which is
provided by the previous tracking process. Starting from
this point the segmented area is enlarged by analysing
continuously the neighbours of the segmented pixels.
The advantage of this technique is that it leads to one
closed region.
The region growing approach accounts for the case that
the gravity point obtained from the tracking step does
not have skin-colour, e.g. because it lies between two
fingers or because hand boxes overlap and mislead
tracking. Additionally, the situation is borne in mind, if
a contact between hands and face happens. Both cases
are discussed in the following section.

Fig. 9: Region growing


6 CONTACT OF SKIN-COLOURED AREAS

old bounding
box position

new bounding
box position

Fig. 8: Tracking of the centre of gravity

In order to allow the participants of the conference


session a natural behaviour with free gestures, contact
between both hands and also contact between hands and
face has to be considered. This is done twice, in the
tracking step as well as in the segmentation step. If the

hands are very close to each other the following


problem occurs in the tracking phase. For example, if
the left hand box also detects a part of the right hand in
the box (see Fig. 10, left), then tracking could be
mislead. Due to the segmented parts of the other hand,
the centre of gravity could be shifted to a wrong
position with no skin-colour.
In the worst case, the search for a skin-coloured region
in the neighbourhood of the determined non-skincoloured centre of gravity will lead to the wrong right
hand and the left hand might be lost. In order to avoid
this case, the search directions, starting from the
determined centre of gravity, are limited to the opposite
of the position of the right hand box and vice versa (Fig.
10, right). A fast and sophisticated retrieval strategy
preserves the correct hand.

bounding box. If one of the hand boxes overlaps with


the head box, then only the non-overlapping part of the
hand box is considered for tracking the centre of
gravity. This is shown in Fig. 11, left.
Thus, wrong movement of the hand box towards the
head is avoided and tracking becomes much more
robust without loosing the hand. Otherwise, if a hand
disappears completely inside the head box, an area
surrounding the head box is observed waiting for the
hand to come out of the head box (see Fig. 11, right). If
skin-coloured pixel are detected in the surrounding area,
the tracking of the hand is going to be continued.
unconcerned
area

part of left
hand
left hand
bounding box

observed
area

hand
detection

Fig. 11: Contact of hand and head box

part of
right hand
search
directions
Fig. 10: Contact of hand boxes
If both hands have contact to each other, then the
bounding boxes overlap. If the hands come apart, the
bounding boxes must be separated obviously, which is
again not trivial. To overcome with this problem, the
following approach has been implemented to separate
the bounding boxes, when the hands get separated. For
each bounding box favourite directions are defined e.g.
left -bottom edge for one box and right-top edges for the
other. While the hands are in contact, the boxes are just
allowed to move in the preferred directions. If the hands
are not connected any more, the preference gets
switched off and the movement of the bounding boxes
is not limited further on. Example images of a sequence
are presented in the next section.
If a hand has contact with the face, the following is
performed: In addition to the hands, the head blob of the
participant resulting from the initialisation phase is
tracked as well in the sub-sampled image, using a third

7 EXPERIMENTAL RESULTS
The presented methods are running on a standard PC
Pentium IV, 2GHz in real-time on full TV resolution
video (576x720 pixels at 25Hz). Hence all situations
such as different behaviour and gestures, have been
tested under real conditions. The following extracted
images of a sequence will show the robustness in
several situations and the accuracy of the segmentation.
In Fig. 12, an example is given, where the hands contact
each other and come apart. After the contact tracking is
still successful and the bounding boxes can be separated
correctly.
In Fig. 13, a misleading tracking is shown. In this case
the right hand box is getting lost after contact of the
hand with the face. Instead of it, the face of the person is
wrongly tracked. The successful operation of our
method is shown in Fig. 14 and Fig. 15 for situations,
where just a single, but also both hands have contact
with the face region.

Fig. 12: Contact of hands

Fig. 13: Contact of hand and head box, hand box is lost

Fig. 14: Contact between hand and head, correct tracking

Fig. 15: Contact of both hands and head together, correct tracking (order: left to right)
The image series (Fig. 14) shows, that the right hand
box still tracks the hand correctly after the contact using
the head box processing method. Despite the robust
tracking it must be indicated that with our algorithm it is
not possible to determine the contours of the objects
while they are connected. Only if they are separated, our
application makes use of the contours determined in the
single boxes.
Finally Fig. 15 gives an example where both hands
touch the head at the same time. After separation each
box is tracking correctly the corresponding object.
Actually, some assumptions for our skin-colour
segmentation method have been made.
No sudden change of illumination
Long sleeves of wearing clothes
Normal motion speed of hands while gesticulating
Minor changes in the illumination can be considered
easily as in every new image actual skin-coloured pixels
are determined. Based on these pixels, new thresholds
can be derived.
The restriction to long sleeves is mainly determined by
the size of the bounding boxes. A larger bounding box
is increasing the computational effort, which may result
in a lower frame rate under certain circumstances.

Nevertheless, experiments have successfully shown a


segmentation of participants wearing T-shirts.
The speed of moving hands may cause to a misleading
tracking. But in online test, it turned out that hands have
to be moved quite fast, which is not expected as a
normal gesticulating behaviour.
8 CONCLUSION
In this paper a new robust method for accurate
segmentation of hands has been presented running
successfully in a real-time application, processing TVsized images (576x720, 25Hz). A new method was
proposed to adjust the thresholds for skin-colour
segmentation automatically according to the specific
participant and the illumination conditions. The required
region of interest of skin-coloured pixels is determined
quite robust using a new histogram technique. The
segmentation is performed in bounding boxes
surrounding the hands in order to reduce the
computational effort. These boxes are tracked
continuously, whereas the method is able to handle

contact between both hands and contact between hands


and face without loosing the tracking objects, namely
the hands. A continued analysis strategy controls
tracking and segmentation and presents accurate hand
masks. Beside videoconferencing other applications are
kept in mind for successful use of the presented
methods, such as advanced gesture recognition tools,
post production using 3D or photo-realistic rendering.
The presented approach can be easily extended for
several specific necessities e.g. processing hands of two
or more persons.

of Faces and Gestures in Real-Time Systems, 133144.


12. P. Kauff, O.Schreer, 2002, IEEE Conf. on
Multimedia and Expo.
13. B. J. Lei, E. A. Hendriks, 2001, Vision, Modeling
and Visualization, 185-192.
14. O. Schreer, N. Brandenburg, S. Askar, P. Kauff,
2001, Vision, Modeling and Visualization, 383390.

9 ACKNOWLEDGEMENT

15. R. R. Anderson, J. Hu, J. A. Parrish, 1981,


Bioengineering and the Skin, chapter 28, 253-265.

This work is supported by the Deutsche Forschungsgemeinschaft (DFG) under grant number DD 20 9 11.

16. M. Strring, H.J. Andersen, E. Granum, 1999,


Symp. on Intelligent Robotics Systems, 187-195.

REFERENCES
1.

Y. Cui, J. Weng, 1996, Int. Conf on Pattern


Recognition, 617-621.

2.

K. Imagawa, S. Lu, S. Igi, 1998, Int. Conf. on


Automatic Face and Gesture Recognition, 462467.

3.

D. Guo, Y. Yan, M. Xie, 1998, Int. Conf. on


Control, Automation and Computer Vision.

4.

X. Zhu, J. Yang, A. Waibel, 2000, Int. Conf.


Autom. Face Gesture Recognition, 446-453.

5.

T. Starner, B. Leibe, D. Minnen, T. Westyn, A.


Hurst, J. Weeks, 2003, Machine Vision and
Applications, Vol. 14(1), 59-71.

6.

K. Dorfmueller-Ulhaas, D. Schmalstieg, 2001,


ACM/IEEE Int. Symp. on Augmented Reality, 3044.

7.

Y. Sato, Y. Kobayashi, H. Koike, 2000, Int. Conf.


on Automatic Face and Gesture Recognition, 462
467.

8.

S. Malassiotis, F. Tsalakanidou, N. Mavridis, V.


Giagourta, N. Grammalidis, M.G. Strintzis, 2001,
Int. Conf. on Image Processing, 955-958.

9.

C. Jennings, 1999, Int. Workshop on Recognition,


Analysis and Tracking of Faces and Gestures in
Real-Time Systems, 152-160.

10. B.C. Lovell, D. Heckenberg, 2002,


Conference on Computer Vision, 336-341.

Asian

11. R. Herpers, W. J. MacLean, C. Pantofaru, L. Wood,


K. Derpanis, D. Topalovic, J. Tsotsos, 2001, Int.
Workshop on Recognition, Analysis and Tracking

S-ar putea să vă placă și