Documente Academic
Documente Profesional
Documente Cultură
Introduction
Human action recognition is a very broad research area where different issues
such as viewpoint, scale variation and illumination conditions are tackled. Most
of the research done in recent years vary from how the motion features are computed and to how the recognition framework is designed. Some of the recent algorithms focussed on combining various attributes and motion of spatio-temporal
interest points to determine what action is being performed in YouTube videos,
Google videos or action movie clips as illustrated in the HMDB dataset. This is
geared towards solving the problem of automatic video annotation in wild action
data-sets for unconstrained video search applications. But, a crucial issue which
has not been given due importance is that of real-time recognition of a human
action or activity from streaming surveillance footage. Tackling this specific issue requires that the learning mechanism must learn the underlying continuous
temporal structure and be able to associate different temporal scales of the action cycle in a specific temporal window. These variations occur due to different
speeds at which a person performs a particular activity, thereby obtaining variable action cycle lengths in a fixed temporal window. In this manuscript, we
propose a novel human action recognition framework (Figure 1) which computes
a motion and shape feature set in each frame and learns the non-linear manifold in a time-independent manner. Every action class has a non-linear manifold
Nair et al.
(a) Training of action basis for mth action class and subsequent learning of the
GRNN Model.
(b) Testing of a streaming sequence by projection of test action features onto each action
basis and comparison with GRNN estimations.
Fig. 1: Action Recognition Framework
Related Work
By detecting sparse features or interest points on video sequences, a robust representation can be obtained by computing a histogram of these sparse features
in a bag of words model. Following this paradigm, a well-known interest point
detector known as the Spatio-Temporal Interest Point (STIP) was proposed by
Laptev et al. for detecting interesting events in video sequences and was extended
to classify actions [4]. Recent progress in human action and activity recognition
used these STIP points and the Bag of Words model as low-level features in
complex learning frameworks. Wang et al [13] developed a contextual descriptor
for each STIP which described its local neighborhood and the arrangement of
points using a probabilistic model. Yuan et al [16] computed the 3D-R transform
to get the global arrangement of the STIP points and proposed a contextual SVM
kernel for sequence classification. However, these algorithms focused mainly on
Fig. 2: Extraction of Motion and Shape features between two consecutive frames.
automatic video annotation and not necessarily designed for real time action
recognition.
The proposed algorithm follows the paradigm of extracting the temporal
structure of the action i.e. to model the per-frame features with respect to time.
Chin et al. [1] performed an analysis on modeling the variation of the human silhouettes with respect to time. Shao and Chen [11] explored a subspace learning
methodology known as Spectral Regression Discriminant Analysis to learn the
variation in human body poses described by masked silhouettes. Saghafi et al. [9]
proposed an embedding technique based on spatio-temporal correlation distance
for action sequences. Our earlier works [6, 7] proposed time-invariant approaches
to characterize body posture for classifying human actions but these required a
good segmentation to remove features corresponding to the background. Here,
the learning mechanism follows the GRNN/EOF model paradigm from our previous work. The key differences are the feature modeling computed within a bag
of features and the probabilistic manifold matching scheme.
Feature Extraction
Two kinds of features are computed at a frame: shape features and motion features which represents the pose and motion of an individual respectively. The
shape features are computed by applying the R-Transform(RT) on the motion
history images(MHI). This provides a shape characteristic profile of an action
at a specific instant. The motion features at a frame, however consists of two
different feature sets computed from the optical flow field, the Histogram of Flow
(HOF) and the Local Binary Flow patterns (LBFP). No prior segmentation is
used to mask out features as done in [6, 7]. Let the optical flow field be represented as (A(p), (p)) at each pixel p. (p) is the quantized (discretized) version
of the optical flow direction (p) and is given by (p) = + 2
B where B is the
number of bins spanning the flow direction.
To compute features at different scales, we divide the region of interest into
Nair et al.
A basis for each action class can be computed by considering its corresponding set
of per-frame features as a bag of words model. Hence, within this bag of features,
we can associate them as a time series data from where suitable time-independent
basis can be extracted. Nair et al [7] analyzed a time series data using Empirical
Orthogonal Function Analysis (EOF) [2] where it is represented as a linear combination of time-independent orthogonal basis functions. In accordance to EOF
analysis, if there is a time series data x(t) RD for 1 t T where T is the numD
X
ber of frames, then x(t) =
ad (t)ed where ed RD are the time-independent
d=1
orthogonal basis functions (EOFs) and a(t) = [a1 (t) a2 (t) ...aD (t)] RD are the
time-dependent orthogonal coefficients. Each action class m will have its own
time-independent basis functions associated with it which defines the underlying lower dimensional latent action manifold. To learn this manifold of class m,
we accumulate the action features from all the frames of the training sequences
and form SD (m) = {xn : 1 n Nm , xn RD } where Nm is the number
of accumulated observations from class m. By performing SVD decomposition
of the co-variance matrix E[XXT ] where X = [x1 x2 ...xNm ] , we get the EOF
basis functions Em = [e1 e2 ...edm ] and can be termed as Eigenaction basis. The
projections of the action feature vectors x from the set Sx (m) on the Eigen basis
Em will give us the set of coefficients Sa (m) = {an : 1 n Nm . an Rdm }
which forms the low-dimensional manifold. The P
dimensionality
PD of the manifold
dm
of each class is selected using the criteria dm d=1
d / d=1 d > m .
Modeling of an action manifold requires characterizing its surface using suitable transition points and finding an approximation. One way is to find clusters
along the surface of the manifold. By using the bag of features model, we compute code-words or clusters using the kmeans++ algorithm. These code-words
will not only approximate the manifold but it provides us with suitable transition
points. To learn the surface of the manifold, we need to learn the transition from
one code-word to the next and this is possible using Generalized Regression Neural Networks(GRNN) [12]. The main advantages of using GRNN is a fast training
Nair et al.
scheme due to a single pass of the training data and guaranteed convergence to
the optimal regression surface. Let the set of code-words of an action model m
be Sxc (m) = {xk : 1 k K(m) , xk RD } Sx (m) and its corresponding
projections on its basis be Sac (m) = {ak : 1 k K(m) , ak Rdm } Sa (m)
with K(m) number of clusters. The GRNN (Figure 3(a)) learns the mapping
Sx Sa by saving the code-words and the corresponding projections as the
input and output weights. The estimate y
(t) = E[atest (t)] = [
y1 ...
ydm ] of test
action feature xtest (t) at an instant t on the action manifold m is given by
K(m)
X
yd =
ak,d exp(
k=1
(xtest xk )T (xtest xk )
)
2(x )2
(2)
K(m)
exp(
(xtest xk )T (xtest xk )
2(x )2
k=1
Consider a test action sub-sequence with R number of frames with the action featest T
ture set as Xtest = [xtest
1 ...xR ] . The classification involves the following steps :
1) Estimation of the action feature set with respect to each action manifold m by
test T
test (m) = [
] . 2) Projection of
its respective GRNN model to obtain Y
y1test ...
yR
test T
the action feature set onto the action basis Em to get Atest (m) = [atest
1 ...aR ] .
We first estimate the class of the features computed from a single frame. By intuition, if the action feature xtest
at the rth frame belongs to class m, the difference
r
0
between the GRNN estimations y
rtest (m0 ) and basis projections atest
r (m ) should
0
0
be minimal for m = m and large for m 6= m . This measure can be formulated
as a likelihood function given by
test test
P rob(xtest
r |Cm ) = P rob(xr |ar (m), a (m))
(3)
test
T 1 test
C(xtest
atest
atest
r ) = arg max exp((xr
r ) A (xr
r ))
m
(4)
as the correwhere A (m) being the co-variance of the manifold m and atest
r
sponding mean of the local neighborhood. The estimate of the class of the corretest
sponding frame C(xtest
r ) is the maximum of the likelihood function P rob(xr |Cm ).
This is illustrated in Figure 3(b). By considering the class estimate of a frame as
a random variable, we can obtain the final class estimate of this partial sequence
by computing the probability of per-frame class estimates and finding the mode.
This is given by
P rob(Xtest |Atest (m), a (m)) N (Xtest |Atest (m), a (m))
(5)
(6)
The proposed algorithm has been tested on KTH dataset [10] and the UCF
sports dataset [8]. The design of the algorithm requires setting of three different
parameters : number of codewords K(m), feature selection threshold f and
action basis threshold m . The algorithm is evaluated for accuracy for different
sequence lengths on all these datasets.
6.1
KTH Dataset
Nair et al.
This high-resolution dataset contains 182 video sequences and consists of 9 different actions, namely diving, golf swinging, lifting, kicking, horseback riding, running, skating, swinging and walking. Here, we set the number of code-words per
action class as K(m) = 250 with the thresholds set as f = 0.1 and m = 0.995.
This is a very challenging dataset mainly because it contains lots of viewpoint
variations and is collected from the web. The end-purpose of this dataset is to
test action recognition algorithms suited for unconstrained video search applications. This dataset helps to evaluate the proposed algorithm in conditions of large
viewpoint variations. In Table 2, we summarize the accuracy results obtained for
different lengths of the sub-sequences for each type of action. Although accuracies of 88% were reported in literature, there was no method which analyzed
or classified sub-sequences of length 15 25 frames. In Table 3, we compared
our proposed learning scheme against the well-known combination of the bag
of words model and kernel SVM used extensively in unconstrained video search
Table 2: Accuracy for each action for specific combination of window sizes and use of
feature selection
Features(win size,overlap) dive golf kick lift ride run skate swing walk overall
Proposed(8,6)
93 67 21 87 71 58 49 100 49
66
Proposed(8,6) + FS
94 63 18 86 68 58 45
99
46
64
Proposed(10,5)
94 74 18 92 72 62 51 100 50
68
Proposed(10,5) + FS
94 66 16 86 67 60 48 100 48
65
Proposed(12,6)
94 69 21 89 69 61 49 100 51
67
Proposed(12,6) + FS
93 65 15 89 71 60 49 100 49
66
Proposed(14,7)
96 73 20 92 75 60 52 100 52
69
Proposed(14,7) + FS
96 66 15 89 73 60 48 100 51
67
applications [4]. The proposed learning scheme gets close to the accuracy obtained with the Bag of Words model but the key advantage lies in the flexibility
of having different sequence lengths at test time without any prior learning. For
BOW-SVM learning mechanisms, prior learning of the features corresponding
to different sequence lengths is required before test time.
Conclusions
Acknowledgments. This work is an independent research done as a continuation of the work supported initially by US Department of Defense (US Army
Medical Research and Material Command - USAMRMC) under the program
Bioelectrics Research for Casualty Care and Management.
10
Nair et al.
References
1. Chin, T.J., Wang, L., Schindler, K., Suter, D.: Extrapolating learned manifolds for
human activity recognition. In: IEEE International Conference on Image Processing, ICIP 2007. vol. 1, pp. 381384 (october 2007)
2. Holmstrom, I.: Analysis of time series by means of empirical orthogonal functions.
Tellus 22(6), 638647 (1970)
3. Jiang, Z., Lin, Z., Davis, L.: Recognizing human actions by learning and matching
shape-motion prototype trees. Pattern Analysis and Machine Intelligence, IEEE
Transactions on 34(3), 533547 (March 2012)
4. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR
2008. IEEE Conference on. pp. 18 (2008)
5. Liao, S., Chung, A.: Face recognition with salient local gradient orientation binary
patterns. In: Image Processing (ICIP), 2009 16th IEEE International Conference
on. pp. 33173320 (Nov 2009)
6. Nair, B., Asari, V.: Time invariant gesture recognition by modelling body posture
space. In: Advanced Research in Applied Artificial Intelligence, Lecture Notes in
Computer Science, vol. 7345, pp. 124133. Springer Berlin Heidelberg (2012)
7. Nair, B., Asari, V.: Regression based learning of human actions from video using hof-lbp flow patterns. In: Systems, Man, and Cybernetics (SMC), 2013 IEEE
International Conference on. pp. 43424347 (Oct 2013)
8. Rodriguez, M., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum
average correlation height filter for action recognition. In: Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. pp. 18 (June 2008)
9. Saghafi, B., Rajan, D.: Human action recognition using pose-based discriminant
embedding. Signal Processing: Image Communication 27(1), 96 111 (2012)
10. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004. vol. 3, pp. 32 36 (August 2004)
11. Shao, L., Chen, X.: Histogram of body poses and spectral regression discriminant
analysis for human action categorization. In: Proc. BMVC. pp. 88.111 (2010)
12. Specht, D.: A general regression neural network. IEEE Transactions on Neural
Networks 2(6), 568 576 (nov 1991)
13. Wang, J., Chen, Z., Wu, Y.: Action recognition with multiscale spatio-temporal
contexts. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. pp. 31853192 (2011)
14. Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: Computer Vision, 2009 IEEE 12th International Conference on. pp. 492497 (Sept
2009)
15. Yu, L., Liu, H.: Feature selection for high-dimensional data: A fast correlationbased filter solution. In: Proceedings of the Twentieth International Conference on
Machine Leaning (ICML-03). pp. 856863 (2003)
16. Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3d r transform on spatio-temporal
interest points for action recognition. In: Computer Vision and Pattern Recognition
(CVPR), 2013 IEEE Conference on. pp. 724730 (2013)