Sunteți pe pagina 1din 10

Learning and Association of Features for Action

Recognition in Streaming Video


Binu M Nair and Vijayan K Asari
ECE Department, University of Dayton, Dayton, OH, USA
{nairb1,vasari1}@udayton.edu

Abstract. We propose a novel framework which learns and associates


local motion pattern manifolds in streaming videos using generalized
regression neural networks (GRNN) to facilitate real time human action recognition. The motivation is to determine an individuals action
even when the action cycle has not yet been completed. The GRNNs
are trained to model the regression function of patterns in latent action
space on the input local motion-shape patterns. This manifold learning
makes the framework invariant to different sequence length and varying action states. Computation of latent action basis is done using EOF
analysis and association of local temporal patterns to an action class at
runtime follows a probabilistic formulation. This corresponds to finding
the closest estimate the GRNN obtains to the corresponding action basis.
Experimental results on two datasets, KTH and the UCF Sports, show
accuracy of above 90% obtained from 15 to 25 frames.

Introduction

Human action recognition is a very broad research area where different issues
such as viewpoint, scale variation and illumination conditions are tackled. Most
of the research done in recent years vary from how the motion features are computed and to how the recognition framework is designed. Some of the recent algorithms focussed on combining various attributes and motion of spatio-temporal
interest points to determine what action is being performed in YouTube videos,
Google videos or action movie clips as illustrated in the HMDB dataset. This is
geared towards solving the problem of automatic video annotation in wild action
data-sets for unconstrained video search applications. But, a crucial issue which
has not been given due importance is that of real-time recognition of a human
action or activity from streaming surveillance footage. Tackling this specific issue requires that the learning mechanism must learn the underlying continuous
temporal structure and be able to associate different temporal scales of the action cycle in a specific temporal window. These variations occur due to different
speeds at which a person performs a particular activity, thereby obtaining variable action cycle lengths in a fixed temporal window. In this manuscript, we
propose a novel human action recognition framework (Figure 1) which computes
a motion and shape feature set in each frame and learns the non-linear manifold in a time-independent manner. Every action class has a non-linear manifold

Nair et al.

(a) Training of action basis for mth action class and subsequent learning of the
GRNN Model.

(b) Testing of a streaming sequence by projection of test action features onto each action
basis and comparison with GRNN estimations.
Fig. 1: Action Recognition Framework

embedded in it which can represented by a time independent orthogonal action


basis and a learned generalized regression neural network (GRNN). A probabilistic formulation is used to associate a streaming sequence to one of the trained
manifolds by computing per-frame discrepancy between the GRNN estimations
and the action basis feature projections.

Related Work

By detecting sparse features or interest points on video sequences, a robust representation can be obtained by computing a histogram of these sparse features
in a bag of words model. Following this paradigm, a well-known interest point
detector known as the Spatio-Temporal Interest Point (STIP) was proposed by
Laptev et al. for detecting interesting events in video sequences and was extended
to classify actions [4]. Recent progress in human action and activity recognition
used these STIP points and the Bag of Words model as low-level features in
complex learning frameworks. Wang et al [13] developed a contextual descriptor
for each STIP which described its local neighborhood and the arrangement of
points using a probabilistic model. Yuan et al [16] computed the 3D-R transform
to get the global arrangement of the STIP points and proposed a contextual SVM
kernel for sequence classification. However, these algorithms focused mainly on

Learning and Association of Patterns for Action Recognition

(a) Hierarchical Division for L = 2.

(b) Illustration of LBP operator on


directional pattern.

Fig. 2: Extraction of Motion and Shape features between two consecutive frames.

automatic video annotation and not necessarily designed for real time action
recognition.
The proposed algorithm follows the paradigm of extracting the temporal
structure of the action i.e. to model the per-frame features with respect to time.
Chin et al. [1] performed an analysis on modeling the variation of the human silhouettes with respect to time. Shao and Chen [11] explored a subspace learning
methodology known as Spectral Regression Discriminant Analysis to learn the
variation in human body poses described by masked silhouettes. Saghafi et al. [9]
proposed an embedding technique based on spatio-temporal correlation distance
for action sequences. Our earlier works [6, 7] proposed time-invariant approaches
to characterize body posture for classifying human actions but these required a
good segmentation to remove features corresponding to the background. Here,
the learning mechanism follows the GRNN/EOF model paradigm from our previous work. The key differences are the feature modeling computed within a bag
of features and the probabilistic manifold matching scheme.

Feature Extraction

Two kinds of features are computed at a frame: shape features and motion features which represents the pose and motion of an individual respectively. The
shape features are computed by applying the R-Transform(RT) on the motion
history images(MHI). This provides a shape characteristic profile of an action
at a specific instant. The motion features at a frame, however consists of two
different feature sets computed from the optical flow field, the Histogram of Flow
(HOF) and the Local Binary Flow patterns (LBFP). No prior segmentation is
used to mask out features as done in [6, 7]. Let the optical flow field be represented as (A(p), (p)) at each pixel p. (p) is the quantized (discretized) version
of the optical flow direction (p) and is given by (p) = + 2
B where B is the
number of bins spanning the flow direction.
To compute features at different scales, we divide the region of interest into

Nair et al.

different sub-regions in a pyramidal fashion as shown in Figure 2(a). Each local


region gives a vague representation of a motion associated with a body part
where each level l gives the extent of division. As we divide the region further,
the effective number bins at level l is computed as B(l) = 2Bl so that only coarse
variations of flow in local regions are considered. At a single level l and for each
sub-region, the feature vectors HOF (hl ), LBFP (lbl ) and RT (rl ) are computed.
With regard to the motion features, the HOF represents only the distribution
of first order motion variation and does not provide any indication of the local
arrangement of the flow vectors. Thus, we propose a motion descriptor known
as Local Binary Flow Patterns (LBFP) which encodes the flow direction in a
way which brings out the flow texture. This textural information can also be
interpreted as contextual information that the neighboring flow pixels provide to
the center pixel at local regions of the body. Flow texture can then be defined
as the second order variations of the optical flow between two instants in a local
neighborhood. We apply a variant of the LBP (g(pc )) [5] on the flow direction to
characterize this flow texture where P is the number of neighbors around pixel
pc and R is the neighborhood size.
(
P
1
X
1 z=0
i
2 s((pc ) (pi )) ; s(z) =
(1)
g(pc ) =
0 otherwise
i=0
The LBFP (lbl ) motion feature is the LBP (g l ) encoded directional flow image
histogram. So, the action feature vectors computed at frame t are H(t) H =
[h1 , h2 , h2 , h2 , h2 , h3 .....hL ], LB(t) LB = [lb1 , lb2 , lb2 , lb2 , lb2 , lb3 ..lbL ], R(t)
R = [r1 , r2 , r2 , r2 , r2 , r3 .....rL ] where L is the number of levels in the hierarchy.
Assuming independence between the selected feature set, we fuse these features
L
L
= H LB R where X
RD .
to form a single action feature given as X
3.1

Feature Selection using Symmetrical Uncertainty

Due to the high dimensionality of the per-frame feature vector X, we identify


the relevant and redundant subset of the features using a modified feature selection technique based on symmetrical uncertainty [15]. The symmetrical uncerIG(Z1 |Z2 )
tainty between random variables (Z1 , Z2 ) is given by SU (Z1 , Z2 ) = 2 H(Z
1 )+H(Z2 )
where IG(Z1 |Z2 ) = H(Z1 ) H(Z1 |Z2 ) is the information gain, H(Z1 ), H(Z2 )
are the corresponding entropy values. Higher the symmetrical uncertainty, the
higher would be the information gain and more correlated is Z2 to Z1 . Con and zd RN 1 } with
sider the action feature set S = {zd : 1 d D
N
P observations, Nm being the number of observations of class m such that
m Nm = N and D features where these observations are features accumulated across the action classes. The objective is to select a subset of the features
, 1 d0 D and D D}
S
S 0 = {zd0 : zd0 S , zd RN 1 , d0 [1, D]
which are relevant and non-redundant features accumulated for all action classes.
RD to get X RD .
This feature selection is applied to the action feature X
1. Compute SU (zd , C) of feature zd to the set of class ids C = [1 ..m .. M]T
for each observation where m RNm 1 .

Learning and Association of Patterns for Action Recognition

2. Form the set S 0 = {zd : SU (zd , C) f , zd RN 1 } which contains the


relevant features.
0
3. Split the subset S 0 with S
respect to each class i.e. form Sm
= {zd : zd
0
Nm 1
0
0
S , zd R
} where m Sm = S .
4. Compute SU (zd , zd0 ) between each feature within each class where d 6= d0
0
0
5. Form the set Sm+
= {zd : zd Sm
, SU (zd , zd0 ) SU (zd , C) d 6= d0 }
which contains the redundant intra-class features.
0
0
6. Form the set Sm
= (Sm+
)c which contains the non-redundant features
withSrespect to each class m. Combine to form the final action feature set
0
i.e. m Sm
= S0.

Learning : Computation and Modeling of Action


Manifold

A basis for each action class can be computed by considering its corresponding set
of per-frame features as a bag of words model. Hence, within this bag of features,
we can associate them as a time series data from where suitable time-independent
basis can be extracted. Nair et al [7] analyzed a time series data using Empirical
Orthogonal Function Analysis (EOF) [2] where it is represented as a linear combination of time-independent orthogonal basis functions. In accordance to EOF
analysis, if there is a time series data x(t) RD for 1 t T where T is the numD
X
ber of frames, then x(t) =
ad (t)ed where ed RD are the time-independent
d=1

orthogonal basis functions (EOFs) and a(t) = [a1 (t) a2 (t) ...aD (t)] RD are the
time-dependent orthogonal coefficients. Each action class m will have its own
time-independent basis functions associated with it which defines the underlying lower dimensional latent action manifold. To learn this manifold of class m,
we accumulate the action features from all the frames of the training sequences
and form SD (m) = {xn : 1 n Nm , xn RD } where Nm is the number
of accumulated observations from class m. By performing SVD decomposition
of the co-variance matrix E[XXT ] where X = [x1 x2 ...xNm ] , we get the EOF
basis functions Em = [e1 e2 ...edm ] and can be termed as Eigenaction basis. The
projections of the action feature vectors x from the set Sx (m) on the Eigen basis
Em will give us the set of coefficients Sa (m) = {an : 1 n Nm . an Rdm }
which forms the low-dimensional manifold. The P
dimensionality
PD of the manifold
dm
of each class is selected using the criteria dm d=1
d / d=1 d > m .
Modeling of an action manifold requires characterizing its surface using suitable transition points and finding an approximation. One way is to find clusters
along the surface of the manifold. By using the bag of features model, we compute code-words or clusters using the kmeans++ algorithm. These code-words
will not only approximate the manifold but it provides us with suitable transition
points. To learn the surface of the manifold, we need to learn the transition from
one code-word to the next and this is possible using Generalized Regression Neural Networks(GRNN) [12]. The main advantages of using GRNN is a fast training

Nair et al.

(a) Illustration of GRNN network.

(b) Illustration of test sequence


association.

Fig. 3: GRNN network and classification of streaming test sequence.

scheme due to a single pass of the training data and guaranteed convergence to
the optimal regression surface. Let the set of code-words of an action model m
be Sxc (m) = {xk : 1 k K(m) , xk RD } Sx (m) and its corresponding
projections on its basis be Sac (m) = {ak : 1 k K(m) , ak Rdm } Sa (m)
with K(m) number of clusters. The GRNN (Figure 3(a)) learns the mapping
Sx Sa by saving the code-words and the corresponding projections as the
input and output weights. The estimate y
(t) = E[atest (t)] = [
y1 ...
ydm ] of test
action feature xtest (t) at an instant t on the action manifold m is given by
K(m)

X
yd =

ak,d exp(

k=1

(xtest xk )T (xtest xk )
)
2(x )2
(2)

K(m)

exp(

(xtest xk )T (xtest xk )
2(x )2

k=1

Inference : Association of Test Sequences and


Classification

Consider a test action sub-sequence with R number of frames with the action featest T
ture set as Xtest = [xtest
1 ...xR ] . The classification involves the following steps :
1) Estimation of the action feature set with respect to each action manifold m by
test T
test (m) = [
] . 2) Projection of
its respective GRNN model to obtain Y
y1test ...
yR
test T
the action feature set onto the action basis Em to get Atest (m) = [atest
1 ...aR ] .
We first estimate the class of the features computed from a single frame. By intuition, if the action feature xtest
at the rth frame belongs to class m, the difference
r
0
between the GRNN estimations y
rtest (m0 ) and basis projections atest
r (m ) should
0
0
be minimal for m = m and large for m 6= m . This measure can be formulated
as a likelihood function given by
test test
P rob(xtest
r |Cm ) = P rob(xr |ar (m), a (m))

(3)

Learning and Association of Patterns for Action Recognition

(a) Variation with window sizes.

(b) Effectiveness of feature selection.

Fig. 4: Accuracy obtained with the proposed algorithm.

test
T 1 test
C(xtest
atest
atest
r ) = arg max exp((xr
r ) A (xr
r ))
m

(4)

as the correwhere A (m) being the co-variance of the manifold m and atest
r
sponding mean of the local neighborhood. The estimate of the class of the corretest
sponding frame C(xtest
r ) is the maximum of the likelihood function P rob(xr |Cm ).
This is illustrated in Figure 3(b). By considering the class estimate of a frame as
a random variable, we can obtain the final class estimate of this partial sequence
by computing the probability of per-frame class estimates and finding the mode.
This is given by
P rob(Xtest |Atest (m), a (m)) N (Xtest |Atest (m), a (m))

(5)

C(Xtest ) = arg max P rob(C(xtest ) = m)

(6)

Experimental Results and Evaluations

The proposed algorithm has been tested on KTH dataset [10] and the UCF
sports dataset [8]. The design of the algorithm requires setting of three different
parameters : number of codewords K(m), feature selection threshold f and
action basis threshold m . The algorithm is evaluated for accuracy for different
sequence lengths on all these datasets.
6.1

KTH Dataset

The KTH is a low-resolution dataset and consists of 2400 sequences containing


6 actions, such as boxing, hand clapping, hand waving, jogging , running and
walking, performed by 25 subjects in 4 different conditions labeled as sets 1 4.
After empirical evaluation, we set the number of code-words per action class as
K(m) = 250 with the feature selection threshold f = 0.225 and action basis

Nair et al.

Table 1: Comparison with state of the art.


(Window Size / Overlap) Percentage Accuracy
Proposed (Set 1)
(20,15)
92.23%
Proposed (Set 2)
(20,15)
84.5%
Proposed (Set 3)
(20,15)
87.9%
Proposed (Set 4)
(20,15)
95.75%
Yeffet et al[14]
Full
90.1%
Wang et al [13]
Full
93.8%
Yuan at al [16]
Full
95.49%

threshold m = 0.99975 for each set. To compute motion-shape features, we use


the annotations provided by Jiang et al [3]. In Figure 4(a), we plot the overall
accuracy achieved for each set for a particular length of test sub-sequences. The
proposed algorithm gets a high accuracy of 92% or more for sets 1 and 4 and
a moderate accuracy of 84% or more for sets 2 and 3. The drop in accuracy in
latter sets is due to the challenging conditions present in the scene where the
features are more susceptible to noisy artifacts introduced by camera shakiness
and continuous scale change (set 2) and variation of clothing (set 3). In Figure
4(b), we see that in spite of noisy annotations and foreground segmentation
(MHI) and the lack of mean-shift tracking used in [3], the proposed framework
achieves an overall accuracy of around 90% or higher. For each action, the feature
selection technique boosts the accuracy by 12% except for the jog action with a
reduced set of features. This illustrates the effectiveness of the feature selection to
capture only the relevant features for action recognition. The proposed algorithm
is also compared with some of the recent techniques published in literature and
is given in Table 1. We see that our proposed algorithm achieves close to the
state of the art by using only 20 frames for identifying the action. Thus the
proposed framework does not require the complete sequence for classification.
6.2

UCF Sports Dataset

This high-resolution dataset contains 182 video sequences and consists of 9 different actions, namely diving, golf swinging, lifting, kicking, horseback riding, running, skating, swinging and walking. Here, we set the number of code-words per
action class as K(m) = 250 with the thresholds set as f = 0.1 and m = 0.995.
This is a very challenging dataset mainly because it contains lots of viewpoint
variations and is collected from the web. The end-purpose of this dataset is to
test action recognition algorithms suited for unconstrained video search applications. This dataset helps to evaluate the proposed algorithm in conditions of large
viewpoint variations. In Table 2, we summarize the accuracy results obtained for
different lengths of the sub-sequences for each type of action. Although accuracies of 88% were reported in literature, there was no method which analyzed
or classified sub-sequences of length 15 25 frames. In Table 3, we compared
our proposed learning scheme against the well-known combination of the bag
of words model and kernel SVM used extensively in unconstrained video search

Learning and Association of Patterns for Action Recognition

Table 2: Accuracy for each action for specific combination of window sizes and use of
feature selection
Features(win size,overlap) dive golf kick lift ride run skate swing walk overall
Proposed(8,6)
93 67 21 87 71 58 49 100 49
66
Proposed(8,6) + FS
94 63 18 86 68 58 45
99
46
64
Proposed(10,5)
94 74 18 92 72 62 51 100 50
68
Proposed(10,5) + FS
94 66 16 86 67 60 48 100 48
65
Proposed(12,6)
94 69 21 89 69 61 49 100 51
67
Proposed(12,6) + FS
93 65 15 89 71 60 49 100 49
66
Proposed(14,7)
96 73 20 92 75 60 52 100 52
69
Proposed(14,7) + FS
96 66 15 89 73 60 48 100 51
67

Table 3: Comparison of UCF with state of the art learning mechanisms.


Feature Set
Learning Mechanism
Overall Accuracy
HOF+LBFP+RT BoW + Multi-Channel Kernel SVM
63.8%
HOF+LBFP+RT
BoW + Linear SVM
69.48%
HOF+LBFP+RT
BoW+ Gaussian Kernel SVM
68%
HOF+LBFP+RT PCA + GRNN + ProbAssociation
69%

applications [4]. The proposed learning scheme gets close to the accuracy obtained with the Bag of Words model but the key advantage lies in the flexibility
of having different sequence lengths at test time without any prior learning. For
BOW-SVM learning mechanisms, prior learning of the features corresponding
to different sequence lengths is required before test time.

Conclusions

A novel algorithm is proposed which computes per-frame motion-shape features


and models the temporal variation thereby facilitating for real time human action
recognition for streaming surveillance footage. Due to the time-series modeling
by GRNN and the probabilistic matching scheme, we can determine the possible
action happening in a fixed short temporal of 20-25 frames, thereby making
it suitable for streaming video applications. Experiments results validates this
model by extensive testing for different sub-sequence lengths on datasets, one
which provides low resolution with artifacts associated with surveillance cameras
and other a high resolution which has lots of scale and view-point variations.

Acknowledgments. This work is an independent research done as a continuation of the work supported initially by US Department of Defense (US Army
Medical Research and Material Command - USAMRMC) under the program
Bioelectrics Research for Casualty Care and Management.

10

Nair et al.

References
1. Chin, T.J., Wang, L., Schindler, K., Suter, D.: Extrapolating learned manifolds for
human activity recognition. In: IEEE International Conference on Image Processing, ICIP 2007. vol. 1, pp. 381384 (october 2007)
2. Holmstrom, I.: Analysis of time series by means of empirical orthogonal functions.
Tellus 22(6), 638647 (1970)
3. Jiang, Z., Lin, Z., Davis, L.: Recognizing human actions by learning and matching
shape-motion prototype trees. Pattern Analysis and Machine Intelligence, IEEE
Transactions on 34(3), 533547 (March 2012)
4. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human
actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR
2008. IEEE Conference on. pp. 18 (2008)
5. Liao, S., Chung, A.: Face recognition with salient local gradient orientation binary
patterns. In: Image Processing (ICIP), 2009 16th IEEE International Conference
on. pp. 33173320 (Nov 2009)
6. Nair, B., Asari, V.: Time invariant gesture recognition by modelling body posture
space. In: Advanced Research in Applied Artificial Intelligence, Lecture Notes in
Computer Science, vol. 7345, pp. 124133. Springer Berlin Heidelberg (2012)
7. Nair, B., Asari, V.: Regression based learning of human actions from video using hof-lbp flow patterns. In: Systems, Man, and Cybernetics (SMC), 2013 IEEE
International Conference on. pp. 43424347 (Oct 2013)
8. Rodriguez, M., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum
average correlation height filter for action recognition. In: Computer Vision and
Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. pp. 18 (June 2008)
9. Saghafi, B., Rajan, D.: Human action recognition using pose-based discriminant
embedding. Signal Processing: Image Communication 27(1), 96 111 (2012)
10. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004. vol. 3, pp. 32 36 (August 2004)
11. Shao, L., Chen, X.: Histogram of body poses and spectral regression discriminant
analysis for human action categorization. In: Proc. BMVC. pp. 88.111 (2010)
12. Specht, D.: A general regression neural network. IEEE Transactions on Neural
Networks 2(6), 568 576 (nov 1991)
13. Wang, J., Chen, Z., Wu, Y.: Action recognition with multiscale spatio-temporal
contexts. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. pp. 31853192 (2011)
14. Yeffet, L., Wolf, L.: Local trinary patterns for human action recognition. In: Computer Vision, 2009 IEEE 12th International Conference on. pp. 492497 (Sept
2009)
15. Yu, L., Liu, H.: Feature selection for high-dimensional data: A fast correlationbased filter solution. In: Proceedings of the Twentieth International Conference on
Machine Leaning (ICML-03). pp. 856863 (2003)
16. Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3d r transform on spatio-temporal
interest points for action recognition. In: Computer Vision and Pattern Recognition
(CVPR), 2013 IEEE Conference on. pp. 724730 (2013)

S-ar putea să vă placă și