Sunteți pe pagina 1din 8

Journal of Computational Information Systems 9: 9 (2013) 36593666

Available at http://www.Jofcis.com

Hidden Markov Model-based Human Action Recognition


Using Mixed Features ?
Xiaofei JI ,

Ce WANG, Yibo LI, Qianqian WU

School of Automation, Shenyang Aerospace University, Shenyang 110136, China

Abstract
Action recognition is a complex and challenging research that has received more and more attention in
computer vision area. Action feature representation and recognition algorithm selection play the key
roles in action recognition. On the basis of studying the representation and recognition of human actions,
and giving full consideration to the advantages and disadvantages of different features and recognition
algorithm, A novel Hidden Markov Model(HMM)based method using mixed features of silhouette and
optical flow was proposed in this paper. The effectiveness of the features and HMM parameters selection
on action recognition accuracy was discussed. Finally the proposed method was tested on the public
Weizmann dataset. Experimental results show the method can achieve 100% correct recognition rate
and outperform most existing methods.
Keywords: Action Recognition; Mixed Feature; Hidden Markov Model

Introduction

As one of the most active research areas in computer vision, human action recognition has attracted more and more scholars attention. Its applications have been widely used in many areas
such as smart surveillance, video conference and human-computer interaction. Action recognition
is a challenge problem. For instance, it is evident that a same action executed multiple times
by the same person, or by different persons will exhibit variation in that human actions are not
absolutely consistent[1, 2]. Therefore, a good action representation and recognition approach is
need, which can generalize over variations within one class and distinguish between actions of
different classes[3].
Hidden Markov Models (HMMs) have fully considered the dynamic procedure of human motion
and model the small variations of actions in the time scale and spatial scales by using probabilistic
methods. The models have good robustness, thus it has been one of the mainstream methods in
human action recognition[4, 5, 6, 7]. Although this method obtains some research progress, it has
many problems to be solved. Yamato et al.[4] proposed a framework for action recognition based
?

Project supported by the National Nature Science Foundation of China (No. 61103123, No. 60804025).
Corresponding author.
Email address: jixiaofei7804@yahoo.com.cn (Xiaofei JI).

15539105 / Copyright
DOI: 10.12733/jcis5950
May 1, 2013

2013 Binary Information Press

3660

X. Ji et al. /Journal of Computational Information Systems 9: 9 (2013) 36593666

on time sequential images using HMMs. Although they take advantage of the strong modeling
ability in time domain of HMMs, each feature vector of the sequence is assigned to a symbol by
vector quantization when they build model. As a result, some information of feature will be lost in
the quantization process and the result of recognition will be affected. Mendoza et al. [5] employed
a histogram of silhouettes to build the HMMs. In this model continuous probability density
function is used to generate the observation probability. However the extraction of silhouette
feature is hard and unreliable in the real word because of the illumination change and body selfocclusion, which limited the applications of this method. Ramanan and Forsyth [8] track persons
in 2D by learning the appearance of the body parts. These 2D tracks are subsequently lifted to
3D using stored snippets of annotated pose and motion. A HMM is used to infer the action from
these labeled codeword motions[6]. The accuracy of the recognition algorithm depends on the
correct detection of human body parts. Lv et al. [7]presented a 3D human joint representation
using HMMs which are corresponding to the motion of a single joint or combination of related
multiple joints. The weak classifiers with strong discriminative power are then combined by the
Multi-Class AdaBoost algorithm. Since 3-D joint position trajectories must be extracted in every
frame, the recognition results are affected by tracking algorithm accuracy.
In summary, the performance of action recognition method depends on the choice of representation of human motion feature and recognition algorithm. On the basis of above discussion, this
paper proposes a novel HMM-based human action recognition algorithm by using mixed features
of silhouette and optical flow. In order to improve robustness of feature and effectively represent
the moving target feature in the videos, firstly the silhouette and optical flow field of the moving
target are extracted. Then a radial bins histogram of the silhouette and the optic flow is employed to represent the motion feature[9]. In the training stage, Kmeans algorithm for reducing
the training time is used to estimate the initial parameter and obtain the most optimal action
model. In the test process, the best matched sequence of actions is tracked using the Viterbi
algorithm. The HMM-based action recognition using mixed features can improve the recognition
rate without complicated calculation.

Feature Extraction

In our work, feature is extracted from 2D images directly[10]. Although silhouette can directly
reflect the characteristics of the image, it cant reflect the sufficient motion information. And
it is also easily affected by the environmental impact such as lighting and illumination change,
body self-occlusion, etc. Optical flow feature contains more motion information than silhouette,
it reflect the rich information of physical structure. However effectiveness of optical flow feature
is depended on the image sequences character, kinematic velocity and noise factors. Owing to
the limitation of single feature, a mixed feature combination of a silhouette with optical flow
is utilized in this paper. It not only reflects contour information of the moving target but also
embody the motion characteristics, so the feature is more adaptive and discriminative.

2.1

Image preprocessing

It is essential for feature extraction to have a good image preprocessing which can make the
feature more stable and reliable. First of all, we transform the videos into a set of continuous
image sequence. Secondly, we use background subtraction algorithm to obtain the silhouette and

X. Ji et al. /Journal of Computational Information Systems 9: 9 (2013) 36593666

3661

the bounding box. In order to remove the spot noise from binary image, morphological operators
are used to reduce the image noise and holes inside moving entities. At last, the region of interest
corresponding to the moving target can be segmented from the binary image.

2.2

Grid-based feature extraction

Silhouette images of the moving target can be directly obtained by the regions of interest which
segmented from the binary images. In order to reduce the computational complexity, optical flow
is calculated between the regions of interest of the adjacent gray frames by using Lucas-Kanade
algorithm[11]. And then the optical flow measurements are split into horizontal and vertical
channels. Each channel is smoothed using median filter. Finally two real-valued channels Fx and
Fy are obtained (see Fig. 1).

channel Fy

original gray level images

channel Fx

Fig. 1: The two channels of optical flow.

Considering the different size of interesting region of every frame, normalization is needed.
Supposing the size of silhouette and optical flow image corresponding to the region of interest
is M N dimension. We scale the bigger side of the region to a fixed size M . (M =120 in this
paper) It requires a large calculation cost to directly use the silhouette and optical flow image
to recognize. When the dimension of feature is too high, the feature can not reflect the image
characteristic well. According to the different characteristic of feature within the different grids
of feature image, a grid based radial histogram is proposed to improve the features robustness.
The detailed process is outlined in the following: firstly the normalized region of silhouette and
optical flow image is divided into 2*2 sub-windows. Secondly each sub-window is divided into 18
pie slices. Each slice has the equal angle (20 degrees in this paper). The center of the pie is in
the center of the sub-window. The number of silhouette pixels and the amplitude of optical flow
in each pie slice are integrated to represent the feature of silhouette and optical flow. Each frame
can be described by 2*2*18 dimensional silhouette histogram, 2*2*18 dimensional horizontal and
2*2*18 dimensional vertical optical flow histogram. Combining the above feature vectors, each
frame can be represented by the 216 dimensional mixed features. The process graph of grid based
feature extraction of silhouette is shown in Fig.2.

3662

X. Ji et al. /Journal of Computational Information Systems 9: 9 (2013) 36593666

Fig. 2: The grid based feature extraction of silhouette.

HMM Algorithm and Recognition

Previously methods of action recognition can be divided into two categories[12]: template based
approaches and state-space approaches. The former compares a static shape pattern converted by
an image sequence to prestored action prototypes during recognition such as Template Matching,
Dynamic Programming, Dynamic Time Warping, etc. State-space approaches, such as HMMs,
Conditional Random Field, Dynamic Bayesian Networks [13, 14], regard each static posture as
a state, and use certain probabilities to generate mutual connections between these states. Any
motion sequence can be considered as a tour through various states of these static postures.
Considering HMMs have a good capability of modeling the dynamic process of human action, we
choose HMMs as our action model for recognizing. In this paper HMM is trained for each action
class. Likelihood between the test observation sequence and each action HMM is calculated.
Then likelihood with the maximum value is selected as the criterion for action classification. The
detailed process of HMM for recognizing is shown in next section.

3.1

Description of HMM

Classical graph model of HMM is shown in Fig.3.


An HMM is characterized by the following parameters:
(1) N :The number of states in HMM (the number of hidden states). We denote N states as
S = {s1 , s2 , . . . , sN }and the hidden state at time t as qt S = {s1 , s2 , . . . , sT }.
(2) T : The number of observation symbol in the sequence. We denote the observation symbol
sequence as O = {o1 , o2 , . . . , oN } .
(3) A : State transition matrix A = (aij )N N , where aij = p(qt+1 = sj |qt = si )(1 i, j N )
aij is the probability of reaching state sj at time t + 1 from state si at time t.
(4) B : Observation symbol probability distribution, B = {bi (ot )} , where bi (ot ) = p(ot |si )(1

X. Ji et al. /Journal of Computational Information Systems 9: 9 (2013) 36593666

3663

Fig. 3: The classical graph model of HMM.


i N ), bi (ot ) is the probability of generating observation symbol ot from state si at time t.
(5) : The initial state distribution, = {1 , 2 , . . . , N }, where i is the probability of initial
state si .
We denote an HMM as = {A, B, } using the above parameters. In doubly embedded
stochastic process, parameter , Adescribe the Markov chain and B describes the relation between
state and observation symbol respectively.
The probability of generating observation symbol from each state can be computed by Gaussian
probability-density function Eq. (1):
bi (ot ) = b(ui ,i ) (ot ) =

1
2

(ot ui )
P e 2
| i|

dp

P1
i

(ot ui )

(1)

P
Where ui , i is respectively the mean and covariance matrix of observations classified in cluster
P1
T
i; d is the dimension of observation
symbol
o
;
(o

u
)
is
the
transpose
of
matrix
(o

u
);
t
t
i
t
i
i
P
is the inverse of matrix i .

3.2

Recognition

Supposing that (1) , (2) , . . . , (H) are the training HMMs of given action1, action2,, actionH and
O = {o1 , o2 , . . . , oT } is the query (test) observation sequence. We compute the likelihood which
is shown in the following between the test sequence and each given trained action HMM using
forward-backward algorithm[15]: p(O|(1) ), p(O|(2) ), . . . , p(O|(H) )
The action corresponding to the maximum likelihood is chosen as the best recognition action:
test number = arg max(p(O|(h) ))

(2)

1hH

3.3

Training problem

In this paper Baum-Welch algorithm[15] is used to train the mixed features based action HMM,
however it is depended on the choice of initial parameter. If the improper initial parameter is
chosen, it can lead procedure to the locally maximized. Thus, we take the result of Kmeans

3664

X. Ji et al. /Journal of Computational Information Systems 9: 9 (2013) 36593666

algorithm as initial input of the Baum-Welch algorithm. Kmeans algorithm employs the distance
between data and means to update the parameter. The equation of distance is shown as follow:
p
di (ot ) = (ot ui )T (ot ui )
(3)
where Eq. (3) is the distance between observation ot and the mean ui in cluster i.
The clustering centers calculated by Kmeans algorithm are regarded as the initial parameters
of each corresponding action HMM. Multi training sequences are used to train each action HMM
for recognizing by Baum-Welch algorithm. In the training stage, in order to solve the problem of
value close to 0 caused by cumulative production and avoid the underflow, we will introduce the
scale factor method[15] to inhibit the above problem.

Experimental Process, Results and Discussion

The data used in this experiment comes from Weizmann dataset. It contains 10 action classes
( bend,jack,jump,pjump,run,side,skip,walk,wave1,wave2). Each action is performed by 9 actors.
The backgrounds and view point are static. There are some sample action examples in Fig 4.

Fig. 4: Sample frames for weizmann action dataset.


In order to verify the effectiveness of our algorithm, the experiment can be divided into two
parts as follows:
1. Comparison between different features. We respectively tested the single silhouette feature
based HMMs, single optical flow feature based HMMs and mixed features based HMMs for the
influence of the recognition rate.
2. Comparison between different numbers of people for training. We respectively tested the
recognition result using 4, 5, 6, 7, 8 persons for training the HMMs.
In this paper we firstly select the hidden state numbers from 4 to 9 for training the HMMs and
recognizing. The experimental results show that we can get the best recognition result when the
number is 6. Even though keeping on improving the state number, there is little effect for our
recognition. So we adapt 6 hidden states for recognizing in the following experiment. The test
results of experiment 1 are shown in Table 1. It is to say that when using one persons action as
a test sequence, we remove all sequences of the same actor from the training set.
Table 1: Recognition result comparison between HMMs using different features
Feature for recognition

Silhouette

Optical flow

Mixed features

Accuracy

90.00 %

98.89%

100.00%

The results show that the mixed features based HMMs achieved better recognition than the
single feature based HMMs. This is because single silhouette feature is so easily impacted by

X. Ji et al. /Journal of Computational Information Systems 9: 9 (2013) 36593666

3665

outside factor that its difficult to extract complete information of silhouette in the process of
acquiring silhouette image. Error silhouette information will be acquired when the gray levels
between the moving target and background are close. In addition, silhouette feature cant discriminate the actions with similar shapes. Optical flow images contain the motion information of
moving target and also can detect the moving target without any background information. Thus
a better result is obtained in the experiment.
The test results of experiment 2 are shown in Table 2. The features based HMM was used for
testing. The test method is also a circulating test. When M people are used for training, others
are used for testing the results. We will circulate the serial number of people for training so as
to ensure reasonableness of the experiment.
Table 2: Recognition results comparison between different numbers of people for training
Training number

Accuracy

98.89 %

99.72%

100.00%

100.00%

100.00%

The result show that recognition rate using mixed features based HMM can be reach 100%
when we only employ 6 samples for training. Even 4 samples for training can also receive 98.89%
recognition rate. Thus our algorithm can effectively solve the problem of demanding a large
number of samples in the HMM training stage. It can also work even without enough training
samples.
The recognition performance comparison between our algorithm and state of the art approaches
on Weizmann dataset is shown in Table 3.
Table 3: Different mixed features based recognition rate
Method

Mixed features for recognition

Accuracy

Ahmad et al [16]

Shape + CLG-motion flow + HMM

88.29%

Nikhil et al [17]

Spatio-temporal feature + Adaboost

97.8%

Du Tran et al [9]

Silhouette + optical flow + NN

96.7%

Our approach

Local silhouette + optical flow +HMM

100%

Table 3 demonstrates that our approach are the best results on the dataset, it outperforms the
most existing methods. The proposed mixed features with high reliability is easily extracted and
represented, therefore, we can avoid the complex computation caused by feature extraction of
human model.

Conclusion

In this paper, we have introduced a novel HMM-based human action recognition algorithm using
mixed features of silhouette and optical flow and compared different number of hidden states and
training samples for the influence of the recognition rate of our proposed algorithm through a lot
of experimental results. At the same time we have also compared the recognition accuracy by
using single silhouette feature based HMMs, single optical flow feature based HMMs and mixed

3666

X. Ji et al. /Journal of Computational Information Systems 9: 9 (2013) 36593666

features based HMMs. The results verify that our approach can acquire a high recognition rate.
At present the multi-view, complex scenario and continuous actions cant be recognized by this
algorithm, thus how to solve the above problems will be our next stage direction.

References
[1]

X. Ji, H. Liu, Advances in view-invariant human motion analysis: a review, IEEE Transactions
on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 40 (2010), 13-24.

[2]

Y. Wang, Y. Li, Y. Liu, X. Ji, Action recognition based on local descriptor with dynamic distribution, Journal of Computational Information Systems, 9 (2012), 2527-2536.

[3]

R. Poppe, A survey on vision-based human action recognition, Image and Vision Computing, 28
(2010), 976-990.

[4]

J. Yamato, J. Ohya, K. Ishii. Recognizing Human Action in Time -Sequential Images Using Hidden Markov Model, Proc of the IEEE Conference on Computer Vision and Pattern Recognition,
Yokosuka, (1992): 379-385.

[5]

M. A. Mendoza, N. P. Blanca. HMM-based Action Recognition Using Contour Histograms, Pattern


Recognition and Image Analysis, (2007): 394-401.

[6]

Automatic annotation of everyday movements, Advances in Neural Information Processing Systems, 16 (2003): 1-8.

[7]

F. Lv and R Nevatia. Recognition and Segmentation of 3-D Human Action Using HMM and
Multi-class AdaBoost, Proc of the European Conference on Computer Vision, (2006): 359-372.

[8]

D. Ramanan, D A. Forsyth, Tracking people by learning their appearence, IEEE Transactions on


Pattern Analysis and Machine Intelligence, (2007): 65-81.

[9]

D. Tran, A. Sorokin. Human Activity Recognition with Metric Learning, Proc of the European
Conference on Computer Vision, (2008): 548-561.

[10] X. Li. HMM Based Action Recognition Using Oriented Histograms of Optical Flow Field, Electronics Letters, 43 (2007): 560-561.
[11] B. D. Lucas, T. Kanade. An Iterative Image Registration Technique with an Application to Stereo
Vision, Proc of the International Joint Conference on Artificial Intelligence, Vancouver, (1981):
121-130.
[12] L Wang, W Hu, T Tan. Recent Developments in Human Motion Analysis, Pattern Recognition,
36 (2003): 585-601.
[13] F Rong. Research on Hidden Markov Model-based Activity Recognition Algorithms. GuangzhouSun Yat-Sen University (2009).
[14] H Chai. Research on Videos based Human Action Recognition Algorithms, Changsha: Central
South University (2008).
[15] L. R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, 77, (1989): 257-286.
[16] M Ahmad, S Lee. Human action recognition using shape and CLG-motion flow from multiview
image sequences, Pattern Recognition. 41(2008): 2237-2252.
[17] N Sawant, K Biswas. Human action recognition based on spatio-temporal features, Pattern Recognition and Machine Intelligence, Lecture Notes in Computer Science, (2009): 357-362.

S-ar putea să vă placă și