Sunteți pe pagina 1din 13

372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO.

3, APRIL 2008

Video-Based Human Movement Analysis and Its


Application to Surveillance Systems
Jun-Wei Hsieh, Member, IEEE, Yung-Tai Hsu, Hong-Yuan Mark Liao, Senior Member, IEEE, and
Chih-Chiang Chen

Abstract—This paper presents a novel posture classification system to detect pedestrians in outdoor environments and rec-
system that analyzes human movements directly from video se- ognize their activities from multiple views based on a mixture of
quences. In the system, each sequence of movements is converted Gaussian classifiers. Other approaches extract human postures
into a posture sequence. To better characterize a posture in a
sequence, we triangulate it into triangular meshes, from which we or body parts (such as the head, hands, torso, or feet) to ana-
extract two features: the skeleton feature and the centroid context lyze human behavior [5], [6]. Mikic et al.. [5] used complex 3-D
feature. The first feature is used as a coarse representation of models and multiple video cameras to extract 3-D information
the subject, while the second is used to derive a finer description. for 3-D posture analysis, while Werghi [6] extracted 3-D infor-
We adopt a depth-first search (dfs) scheme to extract the skeletal mation and applied wavelet transforms to analyze 3-D human
features of a posture from the triangulation result. The proposed
skeleton feature extraction scheme is more robust and efficient postures. Although 3-D features are more informative than 2-D
than conventional silhouette-based approaches. The skeletal fea- features (like contours or silhouettes) for classifying human pos-
tures extracted in the first stage are used to extract the centroid tures, the inherent correspondence problem and high computa-
context feature, which is a finer representation that can charac- tional cost make this approach inappropriate for real-time ap-
terize the shape of a whole body or body parts. The two descriptors plications. As a result, a number of researchers have based their
working together make human movement analysis a very efficient
and accurate process because they generate a set of key postures analyses of human behavior on 2-D postures. In [2], Cucchiara
from a movement sequence. The ordered key posture sequence is et al.. proposed a probabilistic posture classification scheme
represented by a symbol string. Matching two arbitrary action to identify several types of movement, such as walking, run-
sequences then becomes a symbol string matching problem. Our ning, squatting, or sitting. Ivanov and Bobick [11] presented a
experiment results demonstrate that the proposed method is a 2-D posture classification system that uses HMMs to recognize
robust, accurate, and powerful tool for human movement analysis.
human gestures and behaviors. In [13], Wren et al. proposed
Index Terms—Behavior analysis, event detection, string a Pfinder system for tracking and recognizing human behavior
matching, triangulation.
based on a 2-D blob model. The drawback of this approach
is that incorporating 2-D posture models into the analysis of
I. INTRODUCTION human behavior raises the possibility of ambiguity between the
adopted models and real human behavior due to the mutual oc-

T HE analysis of human body movements can be applied in a


variety of application domains, such as video surveillance,
video retrieval, human-computer interaction systems, and med-
clusions between body parts and loose clothes. In [17] and [18],
Guo et al.. used a stick model to represent the parts of a human
body as sticks and link the sticks with joints for human motion
ical diagnoses. In some cases, the results of such analysis can be tracking. Then, in [19], Ben-Arie et al. used a voting technique
used to identify people acting suspiciously and other unusual to recover the parameters of stick model from video sequences
events directly from videos. Many approaches have been pro- for action analysis. In addition to the stick model, the cardboard
posed for video-based human movement analysis [1]–[6], [9], model [20] is widely recognized to be good for modeling ar-
[10]. For example, Oliver et al.. [9] developed a visual surveil- ticulated human motions. However, the prerequisite that body
lance system that models and recognizes human behavior using parts must be well segmented makes this model inappropriate
hidden Markov models (HMMs) and a trajectory feature. Ros- for real-time analysis of human behavior.
ales and Sclaroff [10] proposed a trajectory-based recognition To solve the problem with the cardboard model, Park and
Aggarwal [15] used a dynamic Bayesian network to segment
Manuscript received July 12, 2006; revised August 4, 2007. This work was a body into different parts [16]. This blob-based approach is
supported in part by the National Science Council of Taiwan, R.O.C., under very promising for analyzing human behavior at the semantic
Grant NSC95–2221-E-155–071 and by the Ministry of Economic Affairs under
Contract 94-EC-17-A-02-S1–032. The associate editor coordinating the review
level, but it is very sensitive to changes in illumination. In addi-
of this manuscript and approving it for publication was Prof. Kiyoharu Aizawa. tion to blobs, several researchers have proposed for classifying
J.-W. Hsieh, Y.-T. Hsu, and C.-C. Chen are with the Department of Elec- human postures based on silhouettes. Weik and Liedtke [12]
trical Engineering, Yuan Ze University, Chung-Li 320, Taiwan, R.O.C. (e-mail:
traced the negative minimum curvatures along body contours
shieh@saturn.yzu.edu.tw; s937114@mail.yzu.edu.tw; s929106@mail.yzu.edu.
tw). to segment body parts and then identified body postures using
H.-Y. M. Liao is with the Institute of Information Science, Academia Sinica, a modified Iterative Closest Point (ICP) algorithm. Fujiyoshi et
Taipei, Taiwan, R.O.C. (e-mail: liao@iis.sinica.edu.tw). al. [8] presented a skeleton-based method to recognize postures
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. by extracting skeletal features based on curvature changes along
Digital Object Identifier 10.1109/TMM.2008.917403 the human silhouette. In addition, Chang and Huang [7] used

1520-9210/$25.00 © 2008 IEEE


HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 373

Fig. 1. Flowchart of the proposed system.

different morphological operations to extract skeletal features


from postures and then identified movements using a HMM
framework. Although the contour-based approach is simple and Fig. 2. Sampling of control points. (a) A point with a high curvature. (b) Two
efficient for coarse classification of human postures, it is vul- points with high curvatures that are too close to each other.
nerable to the effects of noise, imperfect contours, or occlu-
sions. Another approach used to analyze human behavior is the
Gaussian probabilistic model. In [2], Cucchiara et al. used a the feasibility and superiority of the proposed approach for an-
probabilistic projection map to model postures and performed alyzing human behavior.
frame-by-frame posture classification to recognize human be- The remainder of the paper is organized as follows. The De-
havior. The method uses the concept of a state-transition graph launay triangulation technique is introduced in Section II. In
to integrate temporal information about postures in order to deal Section III, we describe the posture classification scheme, which
with the occlusion problem. However, the projection histogram is based on the skeletal feature. In Section IV, the centroid con-
is not robust because it changes dramatically under different text for posture classification is discussed. The key posture se-
lighting conditions. lection and string matching techniques are detailed in Section V.
In this paper, we propose a novel triangulation-based system The experiment results are given in Section VI. We then present
that analyzes human behavior directly from videos. The detailed our conclusions in Section VII.
components of the system are shown in Fig. 1. First, we apply
II. DEFORMABLE TRIANGULATIONS
background subtraction to extract body postures from video se-
quences and then derive their boundaries by contour tracing. In this paper, it is assumed that all video sequences are cap-
Next, a Delaunay triangulation technique [14] is used to divide tured by a stationary camera. When the camera is static, the
a posture into different triangular meshes. From the triangula- background of the captured video sequence can be reconstructed
tion result, two important features, the skeleton and the cen- using a mixture of Gaussian models [24]. This allows us to de-
troid context (CC), are extracted for posture recognition. The tect some samples and extract foreground objects by subtracting
skeleton feature is used for a coarse search, while the centroid the background. We then apply some simple morphological op-
context, being a finer feature, is used to classify postures more erations to remove noise. After these pre-processing steps, we
accurately. To extract the skeleton feature, we propose a graph partition a foreground human posture into triangular meshes
search method that builds a spanning tree from the triangulation using the constrained Delaunay triangulation technique, and ex-
result. The spanning tree corresponds to the skeletal structure tract both the skeletal features and the centroid context directly
of the analyzed body posture. The proposed skeleton extrac- from the triangulation result [25].
tion method is much simpler than silhouette-based techniques Assume that is a posture extracted in binary form by
[8]–[13]. In addition to the skeletal structure, the tree provides image subtraction. To triangulate , a set of control points
important information for segmenting a posture into different must be extracted along its contour in advance. Let be the
body parts. Based on the result of body part segmentation, we set of boundary points along the contour of . We extract some
construct a new posture descriptor, namely, the centroid con- high curvature points from as the set of control points. Let
text descriptor, for recognizing postures. This descriptor uti- be the angle of a point in . As shown in Fig. 2(a), the
lizes a polar labeling scheme to label every triangular mesh with angle can be determined by two specified points, and ,
a unique number. Then, for each body part, a feature vector, selected from both sides of along to satisfy
i.e., the centroid context, is constructed by recording all related
features of the centroid of each triangular mesh according to
this unique number. We can then compare different postures (1)
more accurately by measuring the distance between their cen-
troid contexts. Each posture is assigned a semantic symbol so where and are two thresholds set to and
that each human movement can be converted and represented , respectively. is the total length of contour of .
by a set of symbols. Based on this representation, we use a new With and , the angle can be determined by
string matching scheme to recognize different human move-
ments directly from videos. The string-based method modifies (2)
the calculation of the edit distance by using different weights to
measure the operations of insertion, deletion, and replacement. If is less than a threshold (here we set it at 150 ), is
Because of this modification, even if two movements involve selected as a control point. In addition to (2), it is required that
large-scale changes, their edit distance is still small. The slow two control points must be far apart. This enforces the constraint
increase in the edit distance substantially reduces the effect of that the distance between any two control points must be larger
the time warping problem. The experiment results demonstrate than the threshold (defined in (1)). As shown in Fig. 2(b),
374 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

Fig. 3. All the vertices are labeled anticlockwise such that the interior of V is
located on their left. Fig. 4. Triangulation result of a body posture. (a) Input posture. (b) Triangula-
tion result of (a).

if two candidates, and , are close to each other, i.e.,


, the candidate with the smaller angle is chosen as
a control point.
Assume is the set of control points extracted along the Fig. 5. Flowchart of posture classification using skeletons.
boundary of . Each point on is labeled anticlockwise and
modulated by the size of . If any two adjacent points on
are connected by an edge, is deemed to be a planar graph, III. SKELETON-BASED POSTURE RECOGNITION
i.e., a polygon. We adopt the constrained Delaunay triangula-
tion technique proposed by Chew [26] to divide into trian- In this work, we extract the skeleton and the centroid context
gular meshes. as features from the triangulation result. Conventionally, the ex-
As shown in Fig. 3, is the set of interior points of in tracted skeleton is based on the contour of an object. For ex-
. For a triangulation , is said to be a constrained ample, in [11], Fujiyoshi et al. extract feature points with min-
Delaunay triangulation of if a) each edge of is an edge of imum curvatures from the silhouette of a human body to define
and b) for each remaining edge of , there exists a circle the skeleton. However, as the method is vulnerable to noise, we
such that (1) the endpoints of are on the boundary of , use a graph search scheme to solve the problem.
and (2) if a vertex on is inside , then it cannot be seen from
at least one of the endpoints of . More precisely, given three A. Triangulation-Based Skeleton Extraction
vertices , , and on , the triangle belongs In Section II, we presented a technique for triangulating a
to the constrained Delaunay triangulation if and only if human body image into triangular meshes. By connecting all the
(i) , where centroids of any two connected meshes, a graph can be formed.
; In this section, we use a depth-first search technique to find the
(ii) , where is a circum-circle of skeleton that will be used for posture recognition. Fig. 5 shows
, , and . That is, the interior of does the block diagram of posture classification using skeletons. In
not have a vertex . what follows, details of each block will be described.
Based on the above definition, Chew [26] proposed a divide- Assume that is a binary posture. Using the above tech-
and-conquer algorithm to execute a constrained Delaunay tri- nique, is decomposed into a set of triangular meshes , i.e.,
angulation of in time. The algorithm works re- . Each triangular mesh in has a
cursively. When contains only three vertices, itself is the centroid , and two meshes, and , are connected if they
triangulation result. However, when contains more than three share one common edge. Then, based on this connectivity,
vertices, we choose an edge from and search for the corre- can be converted into an undirected graph , where all cen-
sponding third vertex that satisfies properties i) and ii). Then, we troids in are nodes on ; and an edge exists between
subdivide into two sub-polygons and . The same pro- and if and are connected. The degree of a node
cedure is applied recursively to and until the processed on the graph is defined by the number of edges connected to it.
polygon contains only one triangle. In summary, the four steps Then, we perform a graph search on to extract the skeleton
of the above algorithm are as follows: of .
To derive a skeleton based on a graph search, we seek a node
whose degree is one and whose position is the highest among
1. Choose a starting edge . all the nodes on . is defined as the head of . Then, we
2. Find the third vertex of that satisfies properties (i) and perform a depth-first search [21] from to construct a spanning
(ii). tree in which all leaf nodes correspond to different limbs of
. The branch nodes (whose degree is three on ) are the
3. Subdivide into two sub-polygons: key points used to decompose into different body parts, such
and . as the hands, feet, or torso. Let be the centroid of , and
4. Repeat Steps 1–3 on and until the processed polygon let be the union of , , , and . The skeleton, ,
consists of only one triangle. of can be extracted by directly linking any two nodes in
if they are connected through other nodes (or a path existing
Details of this algorithm can be found in Bern and Eppstein between the two nodes [21]). Next, we summarize the algorithm
[23]. Fig. 4 shows an example of the triangulation result when for skeleton extraction. The time complexity of this algorithm
the algorithm is applied to a human posture. is , where is the number of triangle meshes.
HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 375

Fig. 6. Extracting the skeleton of a human model. (a) Original image; (b) the
spanning tree of (a); and (c) a simple skeleton of (a). Fig. 7. Distance transform of a posture. (a) Result of triangulation of a human
posture; (b) skeleton extracted from (a); and (c) calculated distance map of (b).

where is the Euclidian distance between and . A more


Triangulation-based Simple Skeleton Extraction Algorithm efficient way of computing a distance map, in terms of preci-
(TSSE): sion and cost effectiveness, can be found in Borgefors [22]. To
enhance the effect of distance changes, (3) can be modified as
Input: a set of triangular meshes extracted from a human follows:
posture .
Output: the skeleton of . (4)

Step 1: Construct a graph from according to the where . The distance between the two distance maps of
connectivity of nodes in . In addition, get the centroid and can be defined as follows:
from .
(5)
Step 2: Find a node, , whose degree is one and whose
position is the highest among all nodes on .
where represents the image size of . When calcu-
Step 3: Apply a depth first search to to find its spanning
lating (5), and must be normalized to a unit size and their
tree.
centers must be set to the origins of and , respec-
Step 4: Get all leaf nodes and branch nodes from the tively. Fig. 7 shows the result of a distance transform of a posture
tree. Let be the union of , , , and . after skeleton extraction. Fig. 7(a) shows the original posture,
Fig. 7(b) is the result of skeleton extraction, and Fig. 7(c) shows
Step 5: Extract the skeleton from by linking any two
the resultant distance map based on Fig. 7(b).
nodes in if they are connected through other nodes.
IV. POSTURE RECOGNITION USING THE CENTROID CONTEXT
Note that, the spanning tree of obtained by the depth-first
search is also a skeleton. As shown in Fig. 6, (a) is the original The skeleton-based method presented in the previous section
posture, and (b) is the calculated spanning tree. From (b), we can provides a simple and efficient way to represent body postures.
get the corresponding skeleton, shown in (c), using the TSSE However, the skeleton is a coarse feature that can only be used in
algorithm. In fact, the linked interior of (b) is also a skeleton of a coarse search process. To better characterize human postures
(a), but it is not as straight as the skeleton in (c). In this paper, and improve the accuracy of the recognition result, we use the
we call the skeleton shown in Fig. 5(b) a “complex skeleton,” centroid context.
and the one shown in Fig. 5(c) a “simple skeleton.” The simple
A. Centroid Context-Based Description of Postures
skeleton is obtained just by connecting all the branch nodes of
. We compare their performances in Section VI. In this section, we introduce a centroid context-based shape
descriptor that can characterize the interior of a shape. In the
B. Posture Recognition Using a Skeleton literature, quite a number of global features [27]–[29] have
been proposed for shape recognition. For example, the moment
In the previous section, a triangulation-based method was descriptor [29] has good invariance properties for trademark in-
proposed for extracting skeletal features from a body posture. dexing but its calculation is very time-consuming. The Fourier
Assume and are two skeletal images extracted from a descriptor [28] is simple but easily affected by noise. Shape
posture and a posture , respectively. We use a distance trans- context [24] is a good descriptor for shape recognition but it
form to convert each skeleton into a gray level image. Then, requires a set of dense feature correspondences. Therefore,
based on the distance maps, we calculate the degree of simi- we propose to use a new type of descriptor-“centroid context”
larity between and . to tackle the above problems. Basically, the descriptor can
Assume is the distance map of . Then, the value be used in the fine search process. Since the triangulation
of a pixel in is the shortest distance between it and all results of human postures vary, we calculate the distribution
foreground pixels in , i.e., of every posture based on the relative positions of the meshs’
centroids. A descriptor of this form guarantees robustness and
(3) compactness. Assume all postures are normalized to a unit size.
376 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

Fig. 10. Multiple centroid contexts using different numbers of sectors and
shells: (a) 4 shells and 15 sectors, (b) 8 shells and 30 sectors, and (c) centroid
histogram of the gravity center of the posture.
Fig. 8. Polar transform of a human posture.

, , and . If we remove all the branch nodes from ,


it will be decomposed into different branch paths . Then,
by carefully collecting the set of triangular meshes along each
path, we find that each path corresponds to one body com-
ponent in . For example, in Fig. 9(b), if we remove from
, two branch paths will be formed, i.e., one from node
to and one from to node . The first path corresponds
to the head and neck of , and the second corresponds to the
left hand of . In some cases, the path may not correspond to
a high-level semantic body component exactly, as shown by the
path from to . However, if the length of a path is further
constrained, over-segmentation can be easily avoided. Since the
Fig. 9. Body component extraction. (a) Triangulation result of a posture; (b) the segmentation of body parts is an ill-posed problem, we do not
spanning tree of (a); and (c) centroids derived from different parts (determined propose a complete system for segmenting a human posture into
by removing all branch nodes). high-level semantic body parts. Instead, we use the current de-
composition result to recognize postures up to the middle level
(using components similar to the body parts, but not real ones).
Then, similar to the technique used in shape context [27], we
Given a path , we collect a set of triangular meshes
project a sample onto a polar coordinate and label each mesh.
along it. Let be the centroid of the triangular mesh closest to
Fig. 8 shows a polar transform of a human posture. We use
the center of the set of meshes. As shown in Fig. 9(c), is the
to represent the number of shells used to quantize the radial
centroid extracted from the path that begins at and ends at .
axis and to represent the number of sectors that we want to
Given a centroid , we can obtain its corresponding histogram
quantize in each shell. Therefore, the total number of bins used
using (6). Assume that the set of these path centroids is
to construct the centroid context is . For the centroid
. Then, based on , the centroid context of is defined as
of the triangular mesh of a posture, we construct a vector
follows:
histogram , in which
is the number of triangular mesh centroids in the th bin (8)
when is considered as the origin, i.e.,

(6) where is the number of elements in . Fig. 10 shows two


cases of multiple centroid contexts when the number of shells
where is the th bin of the log-polar coordinate. Then, and sectors is set to (4, 15) and (8, 30), respectively. The cen-
given two histograms, and , the distance between troid contexts are extracted from the head and the center of the
them can be measured by a normalized intersection [31] posture, respectively. (c) is the centroid histogram of the gravity
center of the posture. Given two postures, and , the distance
(7) between their centroid contexts is measured by

where is the number of bins and denotes the number


of meshes calculated from a posture. Using (6) and (7), we can
define a centroid context to describe the characteristics of an
arbitrary posture . (9)
In the previous section, we presented a tree search algorithm
that finds a spanning tree from a posture based on the tri-
angulation result. As shown in Fig. 9, (b) is the spanning tree de- where and are the area ratios of the th and th body
rived from (a). The tree captures the skeleton feature of . parts residing in and , respectively. Based on (9), an arbi-
Here, we call a node a branch node if it has more than one child. trary pair of postures can be compared. We now summarize, the
By this definition, there are three branch nodes in Fig. 9(b), i.e., algorithm for finding the centroid context of a posture .
HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 377

Assume that all the postures have been extracted from a video
Algorithm for Centroid Context Extraction: clip and each frame has only one posture. In addition, assume
is the posture extracted from the th frame. Given two adjacent
Input: the spanning tree of a posture . postures, and , the distance between them, , is calcu-
Output: the centroid context of . lated using (10), where is set at 0.5. Assume is the average
value of for all pairs of adjacent postures. For a posture , if
Step 1: Recursively trace using a depth first search until is greater than , we define it as posture-change instance.
the tree is empty. If a branch node (a node with two children) By collecting all postures that correspond to posture-change in-
is found, collect all the visited nodes to a new path and stances, we derive a set of key posture candidates . How-
remove these nodes from . ever, may still contain many redundant postures, which
Step 2: If a new path only contains two nodes, eliminate would degrade the accuracy of the sequence modeling. To ad-
it; otherwise, find its path centroid . dress this problem, we use a clustering technique to find a better
set of key postures.
Step 3: Using (6) to find the centroid histogram of each Initially, we assume each element in individually
path centroid . forms a cluster . Then, given two cluster elements, and ,
Step 4: Collect all the histograms as the centroid in , the distance between them is defined by
context of .
(11)
The time complexity of centroid context extraction is
, where is the number of triangle meshes in where is defined in (10) and represents the number
and . of elements in . According to (11), we can execute an itera-
tive merging scheme to find a compact set of key postures from
B. Posture Recognition Using the Skeleton and the Centroid . Let and be the th cluster and the collection of
Context all clusters , respectively, at the th iteration. At each itera-
Given a posture, we can extract its skeleton feature and cen- tion, we choose a pair of clusters, and , whose distance
troid context using the techniques described in Sections III-A is the minimum among all pairs in , i.e.,
and IV-A, respectively. Then, the distance between any two pos-
tures can be measured using (5) (for the skeleton) or (9) (for the
centroid context). As noted previously, the skeleton feature is
used for a coarse search and the centroid context feature is used
for a fine search. To obtain better recognition results, the two If is less than , then and are merged
distance measures should be integrated. We use the following to form a new cluster and construct a new collection of clusters
weighted sum to represent the total distance: . The merging process is executed iteratively until no fur-
ther merging is possible. Assume is the final set of clusters.
(10) Then, from the th cluster in , we can extract a key posture
that satisfies the following equation:
where is the total distance between two postures
and and is a weight used to balance and (12)
.

Based on (12) and checking all clusters in , the set of key


V. BEHAVIOR ANALYSIS USING STRING MATCHING
postsures , i.e., , can be constructed for
We represent each movement sequence of a human subject analyzing human movement sequences.
by a continuous sequence of postures that changes over time. To
simplify the analysis, we convert a movement sequence into a B. Human Movement Sequence Recognition Using String
set of posture symbols. Then, different movement sequences of Matching
a human subject can be recognized and analyzed using a string Based on the results of key posture selection and posture clas-
matching process that selects key postures as its basis. In the sification, we can model different human movement sequences
following, we describe a method that automatically selects a with strings. For example, in Fig. 11, there are three kinds of
set of key postures from video sequences, and then present our movement sequence: walking, picking up, and falling. Let the
string matching scheme for behavior recognition. symbols “ ” and “ ” denote the start and end points of a se-
quence. Then, the sequence shown in Fig. 11 can represented
A. Key Posture Selection by: (a) “swwwe,” (b) “swwppwwe,” and (c) “swwwwffe,” where
The movement sequences of a human subject are analyzed “ ” denotes walking, “ ” denotes picking-up, and “ ” denotes
directly from videos. Some postures in a sequence may be re- falling. Based on this behavior parsing, different movement se-
dundant, so they should be removed from the modeling process. quences can be compared using a string matching scheme.
To do this, we use a clustering technique to select a set of key Assume and are two movement sequences whose string
postures from a collection of surveillance video clips. representations are and , respectively. We use the edit
378 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

will be if ; otherwise, its weight will be 1.


This implies that . Similarly, for the
“deletion” operation, we obtain . If
, the weights of the “insertion” and “deletion”
operations would be smaller than the weight of the “replace-
ment” operations; therefore, it would be impossible to choose
“replacement” as the next operation. This problem can be easily
solved by setting the weight to so that
Fig. 11. Three kinds of movement sequence taken from: (a) Walking. (b) and will not be compared when calculating .
Picking up. (c) Falling.
However, they will be compared when calculating
. Since the same end symbol “e” appears in and ,
distance [30] between and to measure the dissimilarity the final distance will be equal to
between and . The edit distance between and , is . Thus, the delay in comparison will
defined as the minimum number of edit operations required to not cause an error; however, the cost of “insertion” and “dele-
change into . The operations include replacements, in- tion” will increase if any incorrect editing operations are chosen.
sertions, and deletions. For any two strings, and , we de- Therefore, (14) should be modified as follows:
fine as the edit distance between and
. That is, is the minimum number of
edit operations needed to transform the first characters (15)
of into the first characters of . In addition, let
where and
be a function whose value is 0 if and
. In this work, is consistently set to 0.1 because we
1 if . Then, we get ,
consider the influence of replacement to be more significant than
, and .
that of insertion and deletion. In the experiments, we obtained
is calculated by the following recursive equation: satisfactory results by using the above setting. The algorithm of
string matching to calculate the edit distance between and
is summarized as follows.

(13) Algorithm for String Matching:

where the “insertion,” “deletion,” and “replacement” operations Input: two strings and .
correspond, respectively, to the transitions from cell
to cell , from cell to cell , and from cell Output: the edit distance between and .
to cell , respectively. Step 1: and .
Assume that is a “walking” video clip whose string repre-
sentation is “swwwwwwe.” is different from that in Fig. 11(a) Step 2: and for all
due to the scaling problem along the time axis. According to and
(13), the edit distance between and the sequence in Fig. 11(a) Step 3: For all and do
is 3, while the edit distance between and the sequences
shown in Fig. 11(b) and (c) is 2. However, is more similar
to Fig. 11(a) than to Fig. 11(b) and (c). Clearly, (13) cannot
be directly applied to human movement sequence analysis
and must be modified. Thus, we define a new edit distance to Step 4: Return .
address this problem.
Let , , and be the respective costs of the “inser- VI. EXPERIMENTAL RESULTS
tion,” “replacement,” and “deletion” operations performed at the To test the efficiency and effectiveness of the proposed ap-
th and th characters of and , respectively. Then, (13) proach, we constructed a large database of more than thirty
can be rewritten as thousand postures, from which we obtained over three hun-
dred human movement sequences. The ratio of female type in
this database is 20%. In each sequence, we recorded a spe-
cific type of human behavior: walking, sitting, lying, squatting,
(14) falling, picking up, jumping, climbing, stooping, or gymnas-
tics. Fig. 12 illustrates the skeleton-based posture recognition
Here, the “replacement” operation is considered more important system. The left-hand side of the figure shows a person walking,
than “insertion” and “deletion”, since it means a change of pos- while the right-hand side shows twelve pre-determined key pos-
ture type. Thus, the weight of “insertion” and “deletion” should tures for matching. Among the twelve key postures, which were
be scaled down to , where . Following this rule, if an determined by a clustering algorithm, the one surrounded by
“insertion” is found when calculating , its weight a bold frame matches the current posture of the query action
HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 379

Fig. 14. Result of key posture selection from three action sequences: walking,
squatting, and gymnastics.

Fig. 12. Results of walking behavior analysis using the skeleton feature. The
recognized posture is highlighted with a bold frame and shown in (b).

Fig. 15. Results of skeleton extraction using different sampling rates. The first
row shows the skeletons extracted by connecting all triangle centroids, while
the second row shows the skeletons extracted by connecting all branch nodes.

Fig. 13. Result of walking behavior analysis using the centroid context feature.
The recognized type is highlighted with a bold frame and shown in (b).
posture selection from sequences of walking, squatting, and
gymnastics. Due to space limitations, only 36 key postures
are shown in Fig. 14. With this set of key postures, different
sequence [see (b)]. Fig. 13 shows the centroid context-based
movement sequences can be converted into a set of symbols
posture recognition system. The current posture shown on the
and then recognized through a string matching method. In the
left-hand side matches the third key posture (surrounded by a
experiments, the average accuracy of key posture selection was
bold frame) located on the right-hand side of the figure [see (b)].
91.9%.
In the first set of experiments, the performance of key posture
In the second set of experiments, we evaluated the perfor-
selection was examined. Since there is no benchmark behavior
mance of skeleton extraction using triangulation. Fig. 15 shows
database in the literature, twenty movement sequences (two se-
the results of skeleton extraction using different sampling rates.
quences per type) were chosen to evaluate the accuracy of our
The top row illustrates the skeletons extracted by connecting all
proposed key posture selection algorithm. For each sequence, a
the triangle centroids, while the bottom row shows the skeletons
set of key postures was manually selected to serve as the ground
extracted by connecting all the branch nodes. It is obvious that
truth. Two postures were considered “correctly matched” if their
no matter which sampling rate was used, the skeletons were
distance was smaller than 0.1 [see (10)]. Let
always extracted; however, a higher sampling rate achieved a
and be the numbers of key frames selected by a manual
better approximation result. Fig. 16 shows another example of
method and our approach, respectively. We can then define the
extracting a simple skeleton and a complex skeleton directly
accuracy of key posture selection as follows:
from a profile posture. In this paper, a simple skeleton means
a skeleton derived by connecting all the branch nodes, and a
complex skeleton is a skeleton derived by connecting all the
centroids of all the triangles. To classify a posture, we have
where is the number of key postures selected by our to convert a skeleton into its corresponding distance map.
method, each of which matches one of the manually selected Fig. 17(a) and (b) show the results of applying the distance
key postures. is the number of manually selected key transform to the skeletons shown in Fig. 16(a) and (b), respec-
postures, each of which correctly matches one of the automat- tively. Table I details the success rates of posture recognition
ically selected key postures. Fig. 14 shows the results of key when different numbers of meshes (resolutions) were used to
380 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

Fig. 16. Skeleton extraction directly from a posture profile. (a) Simple skeleton
derived by connecting branch nodes and (b) complex skeleton derived by con-
necting all triangle centroids.

Fig. 18. Posture recognition results when the simple skeleton is used. The first
row shows the skeletons of all query postures. The silhouettes in the second row
show the corresponding posture type retrieved from the database.

Fig. 17. Result of distance transform. (a) Distance map extracted from
Fig. 16(a) and (b) distance map extracted from Fig. 16(b).

TABLE I
SUCCESS RATES OF POSTURE RECOGNITION WHEN DIFFERENT NUMBERS OF
MESHES WERE USED FOR SKELETON EXTRACTION

Fig. 19. Success rates of posture recognition using the simple skeleton and
complex skeleton on different movement types.
extract both simple and complex skeletons. From the table,
we observe that 1) the performance obtained using complex
skeleton representation was better than that derived using the
simple skeleton and 2) the more meshes used for representation,
the better the performance of posture recognition.
We explain how the posture recognition process works by
the following example. The top row of Fig. 18 shows a number
of input postures represented by simple skeletons. Their corre-
sponding binary maps are shown in the bottom row of the figure
using dark shading. The associated silhouettes, shown in the
second row, represent the corresponding posture types retrieved
from the posture database. Fig. 19 compares the performance
of the simple and complex skeletons in recognizing postures
Fig. 20. Different numbers of sectors and shells used to define the centroid
from different types of action (movement) types. From Fig. 19, context: (a) 4 shells and 15 sectors and (b) 8 shells and 30 sectors.
it is obvious that the complex skeleton consistently outperforms
the simple skeleton. The reason why the above results were ob-
tained was due to the fact that the simple skeleton loses more were (4, 15), (4, 20), (4, 30), (8, 15), (8, 20), and (8, 30), in which
information than the complex skeleton. Therefore, the complex the first argument represents the number of shells and the second
one performs better than the simple one in terms of accuracy. represents the number of sectors. Fig. 20 shows two types of
The next set of experiments was conducted to evaluate the polar partition. Table II lists the recognition rates when different
performance of the centroid context feature. Since the results (shell, sector) parameter sets were predefined and applied to the
of this experiment were influenced by the number of predefined single centroid context and the multiple centroid context cases.
sectors and shells, we conducted six sets of experiments. They In the former case, the center was set at the gravity center of a
HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 381

TABLE II
RECOGNITION RATES OF POSTURES WHEN DIFFERENT NUMBERS
OF SECTORS AND SHELLS WERE USED

Fig. 23. Posture recognition results using (from left to right) the CC, Shape
Context, PPM, and Zernike moment methods, respectively. The shaded region
shows the input query, and the silhouettes are their recognized posture types.

TABLE III
FUNCTIONAL COMPARISONS AMONG CENTROID CONTEXT, SHAPE CONTEXT,
PPM, AND MOMENTS. 3K : NUMBER OF MESHES; v : NUMBER OF BODY
PARTS; n: NUMBER OF BOUNDARY POINTS; (w; h): DIMENSION OF THE INPUT
POSTURE; L: NUMBER OF MOMENT ORDERS. T: TRANSLATION; S: SCALING;
R: ROTATION. IT IS NOTICED THAT n  K  v
Fig. 21. Posture recognition results using multiple centroid contexts.

the average retrieval accuracy rate was 88.9%. In another set


of experiments, we compared our multiple centroid context ap-
proach with three other approaches namely: the shape context-
based approach [27], the Zernike moment-based approach [29],
and the probabilistic projection map-based (PPM) approach [2].
Fig. 23 illustrates the recognition results for a walking pos-
ture (the shaded region). The silhouettes in the figure, from left
to right, represent the database postures retrieved by using the
multiple centroid context approach [see (10)], the shape con-
text approach, the PPM approach, and the Zernike approach, re-
spectively. The shape context approach and the PPM approach
misclassified the posture type as “running.” The Zernike ap-
Fig. 22. Comparison of posture recognition accuracy for different behavior
types. proach correctly recognized the posture type as the walking
mode. However, the recognized posture belonged to the next
key posture of the same movement sequence. Table III lists
posture, but in the latter case, the center of each component was the functional comparisons among these methods. The Zernike
taken as the gravity center of every partitioned component. From approach has several invariance properties including rotation,
the recognition rates reported in Table II, it is obvious that using translation, and scaling. However, the calculation of moments
a multiple centroid context is definitely better than using a single at different orders is very time consuming. The PPM approach
centroid context. Thereby, in our behavior analysis system, the is simple and efficient but not robust to noise. The shape context
multiple centroid context is implemented and adopted for pos- is robust to noise but with a higher time complexity. Our cen-
ture classification. In different parameter sets, we found that the troid context is robust to noise and can represent a posture up
best setting for the (shell, sector) pair was (4, 30). Therefore, we to a syntactic level. Due to its region-based nature, it can work
used this setting in all the experiments. very efficiently to classify postures. Fig. 24 compares the per-
Fig. 21 shows the results when multiple centroid contexts formance of the four approaches for ten behavior types. In terms
were adopted. The images in the top row show how multiple of accuracy, our approach’s performance was similar to that of
centroid contexts cover four different postures. In these experi- the shape context approach. However, our approach is better at
ments, we only used two centroid contexts to represent each pos- dealing with sequences that involve significant movements, such
ture. In the bottom row, we retrieved database postures (in sil- as climbing and walking. Most importantly, our method has a
houette form) and then superimposed their corresponding query lower time complexity than the shape context method because
postures (in shaded form). All the retrieved postures were actu- it is a region-based descriptor, whereas the shape-context ap-
ally very close to the query postures. Fig. 22 lists the recogni- proach is pixel-based.
tion rates for different types of behavior when the multiple cen- Next, we conducted a series of experiments to evaluate
troid context approach was adopted. For all movement types, movement sequence recognition using the proposed string
382 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

Fig. 25. Abnormal activity detection. (a) Five key postures defining several
normal human movements. (b) Normal condition is detected. (c) Warning mes-
sage is triggered when an abnormal posture is detected.
Fig. 24. Comparison of the performance of different approaches.

TABLE IV
ANALYSIS OF TEN BEHAVIOR TYPES USING OUR PROPOSED
STRING MATCHING SCHEME

Fig. 26. Abnormal posture detection. (a) and (c): Normal postures were de-
tected. (b) and (d): Abnormal postures were detected due to the unexpected
matching scheme. We collected 450 movement sequences, “shooting” and “climbing wall” postures, respectively.
divided equally among the ten behavior types (i.e., each be-
havior type had 45 movement sequences). The experiment
results, listed in Table IV, show that most of the test sequences Fig. 25(a). However, the postures in Fig. 26(b) and (d) were clas-
were correctly identified and classified into the appropriate sified as abnormal, since the subjects adopted “shooting” and
categories. However, some of the “squat” sequences tended “climbing wall” postures. The abnormal posture detection func-
to be misclassified as “picking up” behavior due to their high tion improves a video surveillance system in two ways. First, it
degree of similarity to squatting. In this set of experiments, the saves a great deal of storage space because it is only necessary to
average accuracy of behavior analysis was 94.5%. store abnormal postures. Second, it responds immediately when
To test the proposed approach on real-world applications, we an abnormal posture is identified.
analyzed irregular (or suspicious) human movements captured
by a surveillance camera. First, we extracted a set of “normal” VII. CONCLUSION
key postures from video sequences to learn “regular” human We have proposed a method for analyzing human movements
movements like walking or running. After a training process, we directly from videos. Specifically, the method comprises the
took unknown surveillance sequences and determined whether following.
the postures in them were regular or not. If some irregular pos- 1) A triangulation-based technique that extracts two im-
tures appear continuously, an alarm message triggers a security portant features, the skeleton feature and the centroid
warning. For example, in Fig. 25(a), a set of normal key pos- context feature, from a posture to derive more semantic
tures was extracted from a walking sequence. From these pos- meaning. The features form a finer descriptor that can
tures, we can tell that the posture in (b) is a “normal” posture. describe a posture from the shape of the whole body or
However, the posture in (c) could be classified as “abnormal,” from body parts. Since the features complement each
since the person adopted a bending-down posture. We used a other, all human postures can be compared and classi-
marker to label these abnormal posture sequences until they re- fied very accurately.
turned to normal. Fig. 26 shows another two cases of abnormal 2) A clustering scheme for key posture selection, which
posture detection. The postures in Fig. 26(a) and (c) were rec- recognizes and codes human movements using a set of
ognized as normal, since they were similar to the postures in symbols.
HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 383

3) A novel string-based technique for recognizing human [21] E. Horowitz, S. Sahni, and S. A. Freed, Fundamentals of Data Structure
movements. Even though events may have different in C. New York: W. H. Freeman.
[22] G. Borgefors, “Distance transformations in digital images,” Comput.
scaling changes, they can still be recognized. Vis., Graph., Image Process., vol. 34, no. 3, pp. 344–371, Feb. 1986.
The average accuracies of posture classification and behavior [23] M. Bern and D. Eppstein, Mesh Generation and Optimal Triangulation,
analysis were 95.16% and 94.5%, respectively. The experiment Computing in Euclidean Geometry, 2nd ed. : World Scientific, 1995,
results show that the proposed method is superior to other ap- pp. 47–123.
[24] C. Stauffer and E. Grimson, “Learning patterns of activity using real-
proaches for analyzing human movements in terms of accuracy, time tracking,” IEEE Trans. Pattern Recognit. Machine Intell., vol. 22,
robustness, and stability. no. 8, pp. 747–757, 2000.
[25] D. Chetverikov and Z. Szabo, “A simple and efficient algorithm for de-
tection of high curvature points in planar curves,” in Proc. 23rd Work-
REFERENCES shop Austrian Pattern Recognition Group, 1999, pp. 175–184.
[1] T. B. Moeslund and E. Granum, “A survey of computer vision-based [26] L. P. Chew, “Constrained Delaunay triangulations,” Algorithmica, vol.
human motion capture,” Comput. Vis. Image Understand., vol. 81, no. 4, no. 1, pp. 97–108, 1989.
3, pp. 231–268, Mar. 2001. [27] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object
[2] R. Cucchiara, C. Grana, A. Prati, and R. Vezzani, “Probabilities posture recognition using shape contexts,” IEEE Trans. Pattern Recognit. Ma-
classification for human-behavior analysis,” IEEE Trans. Syst., Man, chine Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.
Cybern. A, vol. 35, no. 1, pp. 42–54, Jan. 2005. [28] Y. Rui, A. C. She, and T. S. Huang, “Modified Fourier descriptors
[3] S. Haritaoglu, D. Harwood, and L. S. Davis, “W4: Real-time surveil- for shape representation -a practical approach,” in Proc. 1st Int. Work-
lance of people and their activities,” IEEE Trans. Pattern Anal. Machine shop on Image Databases and Multi-Media Search, Amsterdam, The
Intell., vol. 22, no. 8, pp. 809–830, 2000. Netherlands, Aug. 1996.
[4] S. Maybank and T. Tan, “Introduction to special section on visual [29] A. Kontanzad and Y. H. Hong, “Invariant image recognition by Zernike
surveillance,” Int. J. Comput. Vis., vol. 37, no. 2, pp. 173–173, 2000. moments,” IEEE Trans. Pattern Recognit. Machine Intell., vol. 12, no.
[5] I. Mikic et al., “Human body model acquisition and tracking using 5, pp. 489–497, May 1990.
voxel data,” Int. J. Comput. Vis., vol. 53, no. 3, pp. 199–223, 2003. [30] E. Ukkonen, “On approximate string matching,” in Proc. Int. Conf. on
[6] N. Werghi, “A discriminative 3D wavelet-based descriptors: Applica- Foundations of Comp. Theory, 1983, pp. 487–495.
tion to the recognition of human body postures,” Pattern Recognit. [31] M. J. Swain and D. H. Ballard, “Color indexing,” Int. J. Comput. Vis.,
Lett., vol. 26, no. 5, pp. 663–677, 2005. vol. 7, no. 1, pp. 11–32, 1991.
[7] I.-C. Chang and C.-L. Huang, “The model-based human body motion
analysis,” Image Vis. Comput., vol. 18, no. 14, pp. 1067–1083, 2000.
[8] H. Fujiyoshi, A. J. Lipton, and T. Kanade, “Real-time human motion
analysis by image skeletonization,” IEICE Trans. Inform. Syst., vol.
E87-D, no. 1, pp. 113–120, Jan. 2004.
[9] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer vision
system for modeling human interactions,” IEEE Trans. Pattern Anal.
Machine Intell., vol. 22, no. 8, pp. 831–843, Aug. 2000.
[10] R. Rosales and S. Sclaroff, “3D trajectory recovery for tracking mul-
tiple objects and trajectory guided recognition of actions,” in Proc. Jun-Wei Hsieh (M’06) received the B.S. degree in
IEEE Conf. On Computer Vision and Pattern Recognition, Jun. 1999, computer science from TongHai University, Taiwan,
vol. 2, pp. 637–663. R.O.C., in 1990 and the Ph.D. degree in computer
[11] Y. Ivanov and A. Bobick, “Recognition of visual activities and inter- engineering from the National Central University,
actions by stochastic parsing,” IEEE Trans. Pattern Recognit. Machine Taiwan, in 1995.
Intell., vol. 2, no. 8, pp. 852–872, Aug. 2000. From 1996 to 2000, he was a Researcher Fellow
[12] S. Weik and C. E. Liedtke, “Hierarchical 3D pose estimation for artic- at the Industrial Technology Researcher Institute,
ulated human body models from a sequence of volume data,” in Proc. Hsinchu, Taiwan, and managed a team to develop
video-related technologies. He is presently an As-
Int. Workshop on Robot Vision 2001, Auckland, New Zealand, Feb.
sociate Professor with the Department of Electrical
2001, pp. 27–34.
Engineering, Yuan Ze University, Chung-Li, Taiwan.
[13] C. R. Wren et al., “Pfinder: Real-time tracking of the human body,”
His research interests include content-based multimedia databases, video
IEEE Trans. Pattern Anal. Machine Intell., vol. 19, no. 7, pp. 780–785, indexing and retrieval, computer vision, and pattern recognition.
Jul. 1997. Dr. Hsieh received the Best Paper Awards from the Image Processing and
[14] J. Shewchuk, “Delaunay refinement algorithms for triangular mesh Pattern Recognition Society of Taiwan in 2005, 2006, and 2007, and the Distin-
generation, computational geometry,” Theory Applicat., vol. 23, pp. guished Research Award from Yuan Ze University in 2006. He is a recipient of
21–74, May 2002. the Phai-Tao-Phai Award.
[15] S. Park and J. K. Aggarwal, “Segmentation and tracking of interacting
human body parts under occlusion and shadowing,” in Proc. IEEE
Workshop on Motion and Video Computing, Orlando, FL, 2002, pp.
105–111.
[16] S. Park and J. K. Aggarwal, “Semantic-level understanding of human
actions and interactions using event hierarchy,” in Proc. 2004 Conf.
Computer Vision and Pattern Recognition Workshop, Washington, DC,
Jun. 2, 2004.
[17] Y. Guo, G. Xu, and S. Tsuji, “Understanding human motion pat-
Yung-Tai Hsu was born in Kaohsiung, Taiwan,
terns,” in Proc. IEEE Int. Conf. Pattern Recognition, 1994, vol. 2,
R.O.C., in 1978. He received the B.S and M.S
pp. 325–329. degrees in mechanical engineering from Yuan Ze
[18] Y. Guo, G. Xu, and S. Tsuji, “Tracking human body motion based on University, Taiwan, in 2000 and 2002, respectively.
a stick figure model,” J. Vis. Commun. Image Representat., vol. 5, pp. He is currently pursuing the Ph.D degree in elec-
1–p, 1994. trical engineering at Yuan Ze University, Chung-Li,
[19] Ben-Arie, Z. Wang, P. Pandit, and S. Rajaram, “Human activity recog- Taiwan.
nition using multidimensional indexing,” IEEE Trans. Pattern Anal. His research interests include pattern recognition,
Machine Intell., vol. 24, no. 8, pp. 1091–1104, Aug. 2002. computer vision, and machine learning.
[20] S. X. Ju, M. J. Black, and Y. Yaccob, “Cardboard people: A parameter- Mr. Tsai received the Best Paper Award for
ized model of articulated image motion,” in Proc. Int. Conf. Automatic the 2006 IPPR Conference on Computer Vision,
Face and Gesture Recognition, Killington, VT, 1996, pp. 38–44. Graphics, and Image Processing.
384 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

Hong-Yuan Mark Liao (SM’01) received the Chih-Chiang Chen was born in Taipei, Taiwan,
B.S. degree in physics from National Tsing-Hua R.O.C., in 1982. He received the B.S degree in
University, Hsinchu, Taiwan, R.O.C., in 1981, and electrical engineering from Yuan Ze University,
the M.S. and Ph.D. degrees in electrical engineering Taiwan, in 2004, where he is currently pursuing the
from Northwestern University, Evanston, IL, in 1985 Ph.D degree in electrical engineering.
and 1990, respectively. His research interests include image processing,
He was a Research Associate with the Computer behavior analysis, and video surveillance.
Vision and Image Processing Laboratory, North-
western University, in 1990–1991. In July 1991,
he joined the Institute of Information Science,
Academia Sinica, Taipei, Taiwan, as an Assistant
Research Fellow. He was promoted to Associate Research Fellow and then
Research Fellow in 1995 and 1998, respectively. From August 1997 to July
2000, he served as the Deputy Director of the institute. From February 2001
to January 2004, he was the Acting Director of the Institute of Applied
Science and Engineering Research. He is jointly appointed as a Professor of
the Computer Science and Information Engineering Department of National
Chiao-Tung University. His current research interests include multimedia
signal processing, video surveillance, and content-based multimedia retrieval.
Dr. Liao is the Editor-in-Chief of the Journal of Information Science and En-
gineering. He is on the Editorial Boards of the International Journal of Visual
Communication and Image Representation, the EURASIP Journal on Advances
in Signal Processing, and Research Letters in Signal Processing. He was an As-
sociate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA from 1998 to 2001.
He received the Young Investigators’ Award from Academia Sinica in 1998, the
Excellent Paper Award from the Image Processing and Pattern Recognition So-
ciety of Taiwan in 1998 and 2000, the Distinguished Research Award from the
National Science Council of Taiwan in 2003, and the National Invention Award
in 2004. He served as the Program Chair of the International Symposium on
Multimedia Information Processing (ISMIP’97), the Program Co-Chair of the
Second IEEE Pacific-Rim conference on Multimedia (2001), the Conference
Co-Chair of the Fifth IEEE International Conference on Multimedia and Ex-
position 2004 (ICME), and as Technical Co-Chair of the Eighth IEEE ICME
(2007).

S-ar putea să vă placă și