Documente Academic
Documente Profesional
Documente Cultură
Ling Shao
I. I NTRODUCTION
In the past few years, along with the explosion of online
image and video data, computer vision based applications
in image/video retrieval, human-computer interaction, sports
events analysis, .etc are receiving significant attention. Also,
as can be anticipated, future products, such as the Google
Glasses, which can essentially revolutionize traditional humancomputer interaction , will bring more requirements and
challenges to computer vision algorithms. As an important
topic in computer vision, human action recognition plays a key
role in a wide range of applications. Many approaches [21],
[19], [31], [27], [7], [24], [8], [28], [20], [30] are proposed,
however, some challenges still remain in real-world scenarios
due to cluttered background, view point changes, occlusion
and geometric variations of the target.
Recently, novel strategies have been proposed to represent
human actions more discriminatively. These representations
include optical flow patterns [5],[2], 2D shape matching [15],
[26], [12], spatio-temporal interest points [4], [14], trajectorybased representation [17], .etc. Many state-of-the-art action
recognition systems [18], [22], [11] are based on the bag-offeatures (BoF) model, which represents an action video as a
histogram of its local features. When cooperating with infor-
Fig. 1. The flowchart of our framework. Low-level dense trajectories are first coded with LLC to derive a set of coding descriptors. By pooling the peak values
of each dimension of all local coding descriptors, a histogram that captures the local structure of each action is obtained. Dictionary learning is conducted
utilizing randomly selected actions from both views, then source view training actions and target view testing actions are coded with the learned dictionary
pair to obtain the cross-view sparse representations.
(1)
(2)
s.t.ki k0 T,
1
kXs Ds s k22
2
1
+ kXt Dt t k22 + ([s , t ])
2
s.t., [kis k, kjt k]0 T,
(6)
..
..
.
.
Q=
..
.
1
(xm
t , xs )
..
(x1t , xns )
..
.
,
..
.
n
m
(xt , xs )
(7)
min
s R,t R
N
X
F(DS , DT ) =
(3)
FLLC (ci ) =
for each training sample xi , we have to minimize the divergence between Xs and Xt :
(4)
i=1
where dist(vi , B) denotes a set of Euclidean distances between vi and all the bj , and adjusts the weight decay speed
for the locality adaptor. By subtracting max(dist(vi , B)) from
dist(vi , B), di is normalized to be between (0, 1]. Applying
max pooling on all the local codes according to each dimension of the codebook, a global representation xi is obtained.
D. Cross-view dictionary learning
When considering action samples from both the source view
and the target view, two corresponding dictionaries are learned
simultaneously. In addition to finding a good representation
2
j
xi
1
t xs
(xit , xjs ) = e( 2 ) .
2
(8)
0, otherwise.
Thus, Equation (6) can be expanded as:
4
F(DS , DT ) =
min
s R,t R
1
kXs Ds s k22
2
1
+ kXt Dt t k22 + kt Qs k22
2
s.t., [kis k, kjt k]0 T,
(10)
E. Optimization
The problem of minimizing the empirical cost is not convex with respect to DS and DT . We further define X =
= (DsT , DtT )T . Thus, optimizing
(XsT , (Xt QT )T )T and D
Equation (11) is equivalent to optimizing the following equation1 :
1
4
22
) =
fn0 (D,
min kX Dk
R 2
s.t., kk0 T.
(12)
(13)
k
dk ,
G. Classification
We use the multivariate ridge regression model to train a
:
linear classifier W
= arg max kH W k22 + kW k22 ,
W
(15)
(14)
Fig. 2. Exemplar frames from the IXMAS multi-view action recognition dataset. The columns show 5 action categories, including check watch, cross arms,
scratch head, sit down, wave, and the rows show all the 5 camera views for each action category.
TABLE I
P ERFORMANCE COMPARISON OF ACTION RECOGNITION WITH AND WITHOUT KNOWLEDGE TRANSFER .
%
Camera
Camera
Camera
Camera
Camera
0
1
2
3
4
Camera 0
woTran
wTran
25.76
92.42
16.06
92.73
12.42
94.24
12.42
93.94
Camera 1
woTran wTran
23.03
92.42
7.27
92.42
9.39
93.33
10.91
93.03
Camera 2
woTran
wTran
23.94
89.09
35.15
90.61
26.36
90.91
10.3
92.12
Camera 3
woTran
wTran
26.67
91.52
33.33
92.42
29.39
92.12
17.27
95.15
Camera 4
woTran
wTran
30.61
90.00
30.30
90.30
34.55
90.91
30.91
90.30
-
R EFERENCES
[1] M. Aharon, M. Elad, and A. Bruckstein. K-svd: An algorithm
for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(1):4311
4322, 2006.
[2] J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance
of optical flow techniques. International Journal of Computer
Vision, 12(1):4377, 1994.
[3] N. Dalal, B. Triggs, and C. Schmid. Human detection using
oriented histograms of flow and appearance. In European
Conference on Computer Vision (ECCV), 2006.
[4] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior
recognition via sparse spatio-temporal features. In IEEE International Workshop on Visual Surveillance and Performance
Evaluation of Tracking and Surveillance, 2005.
[5] A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizing
action at a distance. In IEEE International Conference on
Computer Vision (ICCV), 2003.
[6] D. Gavrila and L. Davis. 3-d model-based tracking of humans
in action: a multi-view approach. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 1996.
[7] A. Gilbert, J. Illingworth, and R. Bowden. Action recognition
using mined hierarchical compound features. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 33(5):883897,
2011.
[8] I. Junejo, E. Dexter, I. Laptev, and P. Perez. View-independent
action recognition from temporal self-similarities.
IEEE
Transactions on Pattern Analysis and Machine Intelligence,
33(1):172185, 2011.
[9] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning
realistic human actions from movies. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2008.
[10] R. Li and T. Zickler. Discriminative virtual views for cross-view
action recognition. In IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2012.
[11] Y.-C. Lin, M.-C. Hu, W.-H. Cheng, Y.-H. Hsieh, and H.-M.
Chen. Human action recognition and retrieval using sole depth
information. In ACM International Conference on Multimedia
(ACM MM), 2012.
[12] Z. Lin, Z. Jiang, and L. S. Davis. Recognizing actions by shapemotion prototype trees. In IEEE International Conference on
Computer Vision (ICCV), 2009.
[13] J. Liu, M. Shah, B. Kuipers, and S. Savarese. Cross-view action
recognition via view knowledge transfer. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2011.
[14] J. Liu, Y. Yang, and M. Shah. Learning semantic visual
vocabularies using diffusion distance. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2009.
[15] F. Lv and R. Nevatia. Single view human action recognition using key pose matching and viterbi path searching. In
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2007.
[16] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary
learning for sparse coding. In International Conference on
Machine Learning (ICML), 2009.
[17] M. Raptis and S. Soatto. Tracklet descriptors for action
modeling and video analysis. In European Conference on
Computer Vision (ECCV). 2010.
[18] P. Scovanner, S. Ali, and M. Shah. A 3-dimensional sift
descriptor and its application to action recognition. In ACM
International Conference on Multimedia (ACM MM), 2007.
[19] L. Shao, S. Jones, and X. Li. Efficient search and localisation
of human actions in video databases. IEEE Transactions on
Circuits and Systems for Video Technology, 24(3):504512,
2014.
[20] G. Sharma, F. Jurie, C. Schmid, et al. Expanded parts model
for human attribute and action recognition in still images. In
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2013.
[21] Y. Song, L.-P. Morency, and R. Davis. Action recognition by
hierarchical sequence summarization. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2013.
[22] H. Wang, A. Klaser, C. Schmid, and C. Liu. Action recognition
by dense trajectories. In IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2011.
[23] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong.
Locality-constrained linear coding for image classification. In
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2010.
[24] Y. Wang and G. Mori. Hidden part models for human action
recognition: Probabilistic versus max margin. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(7):1310
1323, 2011.
[25] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from
arbitrary views using 3d exemplars. In International Conference
on Computer Vision (ICCV), 2007.
[26] S. Xiang, F. Nie, Y. Song, and C. Zhang. Contour graph
based human tracking and action sequence recognition. Pattern
Recognition, 41(12):36533664, 2008.
[27] A. Yao, J. Gall, and L. Van Gool. Coupled action recognition
and pose estimation from multiple views. In IEEE International
journal of computer vision (ICCV), 2012.
[28] Z. Zhang, C. Wang, B. Xiao, W. Zhou, S. Liu, and C. Shi.
Cross-view action recognition via a continuous virtual path. In
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2012.
[29] J. Zheng, Z. Jiang, P. Phillips, and R. Chellappa. Cross-view
action recognition via a transferable dictionary pair. In British
Machine Vision Conference (BMVC), 2012.
[30] F. Zhu and L. Shao. Weakly-supervised cross-domain dictionary learning for visual recognition. International Journal of
Computer Vision, 2014.
[31] F. Zhu, L. Shao, and M. Lin. Multi-view action recognition
using local similarity random forests and sensor fusion. Pattern
Recognition Letters, 34(1):2024, 2013.