Documente Academic
Documente Profesional
Documente Cultură
DOI 10.1007/s00138-010-0298-4
ORIGINAL PAPER
Motion history image: its variants and applications
Md. Atiqur Rahman Ahad J. K. Tan H. Kim
S. Ishikawa
Received: 4 January 2010 / Revised: 21 May 2010 / Accepted: 10 September 2010 / Published online: 22 October 2010
Springer-Verlag 2010
Abstract The motion history image (MHI) approach is a
view-based temporal template method which is simple but
robust in representing movements and is widely employed
by various research groups for action recognition, motion
analysis and other related applications. In this paper, we pro-
vide an overview of MHI-based human motion recognition
techniques and applications. Since the inception of the MHI
template for motion representation, various approaches have
been adopted to improve this basic MHI technique. We pres-
ent all important variants of the MHI method. This paper
points some areas for further research based on the MHI
method and its variants.
Keywords MHI MEI Motion recognition
Action analysis Computer vision
1 Introduction
There are excellent surveys onhumanmotionrecognitionand
analysis [13, 7, 13, 37, 61, 67, 69, 85, 102104, 111, 114, 118,
131, 144, 145, 157, 169]. These papers cover many detailed
approaches and issues, and most of these have cited the
motion history image (MHI) method [31] as one of the impor-
tant methods. This paper surveys human motion and behavior
analysis based on the MHI and its variants for various appli-
cations. Action recognition approaches can be categorized
into one of the three groups: (i) template matching, (ii) state-
space approaches and (iii) semantic description of human
behaviors [2, 144]. The MHI method is a template matching
Md. A. R. Ahad (B) J. K. Tan H. Kim S. Ishikawa
Faculty of Engineering, Kyushu Institute of Technology,
1-1, Sensui-cho, Tobata, Kitakyushu, Fukuoka 804-0012, Japan
e-mail: atiqahad@yahoo.com
approach. Approaches based on template matching rst con-
vert an image sequence into a static shape pattern (e.g., MHI,
MEI), and then compare it to pre-stored action prototypes
during recognition [144]. Template matching approaches
are easy to implement and require less computational load,
though they are more prone to noise and more suscepti-
ble to the variations of the time interval of the movements.
Some template matching approaches are presented in Refs.
[6, 31, 96, 117, 123]. Moreover, recognition approaches can
be divided into (i) appearance- or view-based approaches, (ii)
generic human model recovery, and (iii) direct motion-based
recognition approaches [31]. Appearance-based motion rec-
ognition is one of the most practical recognition methods for
recognizing a gesture without any incorporation of sensors
on the human body or its neighborhoods. The MHI is a view-
based or appearance-based template-matching approach.
In the MHI, the silhouette sequence is condensed into
gray scale images, while dominant motion information is
preserved. Therefore, it can represent a motion sequence in
compact manner. This MHI template is also not so sensitive
to silhouette noises, like holes, shadows, and missing parts.
These advantages make these templates a suitable candidate
for motion and gait analysis [89]. It keeps a history of tem-
poral changes at each pixel location, which then decays over
time [159]. The MHI expresses the motion ow or sequence
by using the intensity of every pixel in temporal manner.
The motion history recognizes general patterns of move-
ment; thus, it can be implemented with cheap cameras and
lower powered CPUs [33, 34]. It can also be implemented in
low light areas where structure can not be easily detected.
The paper is organized as follows: Sect. 2 introduces
the basic MHI approach, and then we sum up the variants
of the MHI method in Sect. 3. Section 4 presents various
applications based on these approaches. Section 5 discusses
some issues related to the MHI approach and its variants for
123
256 Md. A. R. Ahad et al.
Fig. 1 Development of the MHI images for two different actions. The produced MHI images are shown under the actions sequentially
future research perspectives. Finally, Sect. 6 concludes the
paper.
2 Overview of the motion history image method
This section presents an overviewof the MHI method. Impor-
tance of various parameters is analyzed. Finally, several lim-
itations of the basic MHI method are pointed out.
2.1 MHI and MEI templates
Bobick and Davis in [30] rst propose a representation and
recognition theory that decomposed motion-based recogni-
tion by rst describing where there is motion (the spatial
pattern) and then describing how the object is moving. They
[30, 31] present the construction of a binary motion energy
image (MEI) or binary motion region (BMR) [47], which
represents where motion has occurred in an image sequence.
The MEI describes the motion-shape and spatial distribution
of a motion. Next, an MHI is generated. Intensity of each
pixel in the MHI is a function of motion density at that loca-
tion. One of the advantages of the MHI representation is that
a range of times may be encoded in a single frame, and in this
way, the MHI spans the time scale of human gestures [33].
Taken together, the MEI and the MHI can be considered
as a two-component version of a temporal template, a vec-
tor-valued image where each component of each pixel is
some function of the motion at that pixel position [31]. These
view-specic templates are matched against the stored mod-
els of views of known movements. Incorporation of both the
MHI and the MEI templates constitute the MHI method. The
MHI H
(x, y, t )
=
_
if (x, y, t )=1
max (0, H
(x, y, ). Now
we will dene the MEI. The MEI is the cumulative binary
motion image that can describe where a motion is in the
video sequence, computed from the start frame to the nal
frame. The moving objects sequence sweeps out a particu-
lar region of the image and the shape of that region (where
there is motioninstead of how as in the MHI concept) can
be used to suggest the movement occurring region [57]. As
the update function (x, y, t ) represents a binary image
sequence indicating regions of motion, the MEI E
(x, y, t )
can be dened as:
E
(x, y, t ) =
1
_
i =0
D(x, y, t i )
The MEI can be deduced from the MHI (by thresholding the
MHI above zero [31]),
E
(x, y, t ) =
_
1 if H
(x, y, t ) 1
0 otherwise
A benet of using the gray-scale MHI is that it is sensi-
tive to the direction of motion, unlike the MEIs; hence the
MHI is better suited for discriminating between actions of
opposite directions (e.g., sitting down versus standing up)
[47]. However, both the MHI and the MEI images are impor-
tant for representing motion information. The two images
together provide better discrimination than either alone [31].
Figure 2 shows typical MHI and MEI for two mirror-actions
of one hand-waving and sideway body-bending.
2.2 Dependence on and
Figure 3 shows the dependence on in producing the MHI.
In this action of waving up the left-hand (with 26 frames),
we produce different MHIs with different values. If the
value is smaller than the number of frames, then we loss
prior information of the action in its MHI. For example, when
= 15 for an action having 26 frames, we lose the motion
information of the rst frame after 15 frames if the value of
the decay parameter () is 1. On the other hand, if the tem-
poral duration value is set at very high value compared to the
number of frames (e.g., 250 in this case for an action with 26
frames), then the changes of pixel values in the MHI template
is less significant. Therefore, this point should be considered
while producing MHIs.
Figure 4 shows the dependence on the decay parameter ()
while calculating the MHI image. In the basic MHI method
[31], is replaced by 1. While loading the frames, if there
is no change (or no presence) of motion in a specic pixel
where earlier there was a motion, the pixel value is reduced
by . However, having different values may provide slightly
different information; hence the value can be chosen empir-
ically. Researchers need to consider this parameter, while
working with the MHI. The top-row of Fig. 4 shows nal
MHI images for the same action (as shown in Fig. 1 (top
row)) with different values (i.e., 1, 3, 5 and 10). We notice
that higher values for remove the earlier trail of motion
sequence. The second row presents a running action. The
rst two images are for = 1, and the latter two for = 3,
while the 1st and 3rd images are taken mid-way and the 2nd
and 4th images are taken at the end of the sequence. We note
that when = 3, part of the earlier motion information is
123
258 Md. A. R. Ahad et al.
Fig. 3 Dependence on to
develop MHI images
Fig. 4 Dependence on in
calculating the MHI template
missing. Similarly, the 3rd row shows the MHIs for walking
action. The bottomrowpresents MHIs (1st and3rd) andMEIs
(2nd and 4th) for walking action when is set as 250 instead
of its number of frames 100. The rst two images considered
= 3, while the last two images considered = 5. These
information are important based on the demand and action
sets, we can modulate the values of and while producing
the MHI and the MEI.
Regarding the parameters, a question may arise: Under
what circumstances, does one want a faster versus slower
decay? From the above discussion, it is clear that together
the values of and combine to determine how long it takes
for a motion to decay to 0, thus determining the temporal win-
dow size. However, different settings can lead to the same
temporal window (e.g., = 10 and = 1 leads to the
same temporal window as = 100 and = 10). The joint
effect of and is to determine how many levels of quanti-
zation the MHI will have; thus the combination of a large
and a small yields a slowly-changing continuous gradient,
whereas a large and large provide a more step-like, dis-
crete quantization of motion. This provides insight into not
only what parameters and design choices one has, but also
into the impact of choosing different parameters or designs.
2.3 Selection of update function (x, y, t ) for motion
segmentation
Many vision-based human motion analysis systems start
with human detection [144]. Human detection aims at seg-
menting regions of interest corresponding to people from
the rest of an image. It is a significant issue in a human
motion analysis system since the subsequent processes such
123
Motion history image: its variants and applications 259
as tracking and action recognition are greatly dependent on
the performance and proper segmentation of the region of
interest. Background subtraction, frame differencing, opti-
cal ow, statistical methods for subtraction are renowned
approaches for motion segmentation. Based on the static
or dynamic background, the performance and method for
background subtraction vary. For static background (i.e.,
with no background motion), it is trivial to subtract the
background, when other factors like outdoor or cluttered
scenes are absent. Few dominant methods are enlisted
in Refs. [22, 56, 65, 74, 95, 101, 135, 136, 158, 160]. Some of
these methods employ various statistical approaches, adap-
tive background models (e.g., [74, 135]) and incorporation of
other features (e.g., color and gradient information in [95])
with an adaptive model, in order to negotiate dynamic back-
ground or other complex issues related to background sub-
traction. For the MHI generation, background subtraction is
employed initially by [31, 47].
Frame-to-frame differencing methods are also widely
used for motion segmentation [21, 28, 43, 76, 88, 146]. These
temporal differencing methods employed among two [21, 43,
88, 146] or three consecutive frames [28, 76] are adaptive to
dynamic environments, though we can note poor extraction
of the relevant feature pixels in general. Unless the thresholds
are dened properly, a generation of holes (see Fig. 6) inside
moving objects is a major concern. To generate the MHI and
the MEI, temporal differencing methods are employed (e.g.,
by [8]) as well.
Optical owmethods [27, 29, 66, 91, 94, 113, 138, 149, 153]
can be used for the generation of the MHI and motion
segmentation for various purposes. Ahad et al. [6, 8, 10]
employed optical owin their variants of the MHI for motion
segmentation to extract the moving object. Computing qual-
ity optical owfromconsecutive image frames is a challeng-
ing task. To produce better results in a motions presence and
its directions from optical ow, RANSAC (RANdom Sam-
ple Consensus) method [60] can be employed to reduce out-
liers. Based on this rened optical ow vectors, the MHI can
be constructed, thus providing better direction and a clearer
picture of the motions presence. Ahad et al. [6] employed
optical ows four channels to compute the MHIs. In this
case, instead of background or frame subtraction, a gradi-
ent-based optical ow vector [42] ( (x, y, t )) is computed
between two consecutive frames and split it into four chan-
nels (as depicted in Fig. 5). It is based on the concept of
motion descriptors on smoothed and aggregated optical ow
measurements [55].
Though optical ow can produce good results even in the
presence of a bit of camera motion, it is computationally
complex and very sensitive to noise and the presence of tex-
ture. Moreover, for real-time perspective, special optical ow
methods can be tried to ascertain whether it can achieve bet-
ter results without any incorporation of special hardware.
Fig. 5 Optical owis split into four different channels, which are used
to calculate directional MHIs
Beauchemin and Barron [27] and McCane et al. [94] have
presented various methods for optical ow. Seven differ-
ent methods are tested by Novins et al. [94]s method for
benchmarking optical ow methods. Several real-time opti-
cal methods [29, 138, 149] are developed for various motion
segmentation and computations.
The changes in weather, illumination variation, repetitive
motion, and presence of camera motion or cluttered envi-
ronment hinder the performance of motion segmentation
approaches. Therefore, a proper approach is crucial based
on the dataset or environment, especially for outdoor envi-
ronment. Extraction of shadow and removal of it from the
motion part is another concern in computer vision, and most
importantly in generating the MHI template.
As pointed out above, one important concern is the selec-
tion of the update function, (x, y, t ) for motion segmenta-
tion and its threshold value (). Figure 6 demonstrates typical
examples on the selection of threshold values for the frame
subtraction method. The top-rowpresents MHIs for an action
with different threshold values (i.e., 30, 50, 75 and 150 from
left to right). We note the presence of noisy background when
the threshold is employed at 30 (rst image of 1st row).
However, if we increase , we also miss some part of the
motion information (note the presence of hole in the right-
most image on the top-row). In another example (as shown in
the bottom-row images), we use walking motion in a differ-
ent environment and depth. The noisy MHI and MEI images
(1st two images) for a walking action has employed = 12,
whereas the next two images (without any noisy background
but with missing information) used = 150. Therefore, the
selection of the update function and its are very crucial for
calculating motion history/energy templates.
123
260 Md. A. R. Ahad et al.
Fig. 6 Importance of the
selection of threshold value ()
for the update function,
(x, y, t ). Note the presence of
noises or holes in various images
Fig. 7 Changes in standing
position of a person (top-row)
create MHIs (1st and 3rd
images) and MEIs (2nd and 4th
images) wider, as shown in the
bottom-row (1st two images are
computed at 45 frames and the
rest two images are at the nal
frame)
Another issue is the change of the standing position of a
person while executing an action that is supposed to be in one
specic location. For example, Fig. 7 depicts a person mov-
ing its standing position; hence the nal MHI becomes wider.
Therefore, if an action does not incorporate movement from
its initial position, then tracking on the central point of the
moving body is required for this kind of position changing.
Another useful option is to normalize the size of the entire
moving body and then create the MHI based on the normal-
ized moving portion; or normalize the MHI and the MEI for
further processing. This is crucial for recognition purposes
by employing the MHI images, because the MHI method
takes into account the global calculation of the image, and
hence changing its position makes the nal MHI wider than
the object of interest and incorporates some unwanted region
of interest.
2.4 Feature vector analysis and classication
Figure 8 shows the system ow of the basic MHI approach
for motion classication and recognition. According to the
basic MHI method [31], feature vectors are calculated using
seven Hu moments [68] from the MHI and MEI images. Hu
moments are widely used for shape representation [6, 8, 10,
11, 15, 31, 33, 34, 47, 48, 122, 123, 170]. There are other differ-
ent approaches to get the shape for calculating feature vectors
from the templates. Figure 9 shows various other options
for shape representation based on [68, 80, 81, 168]. Though
Fig. 8 Typical system ow diagram of the MHI method for action
recognition
Hu invariants are widely employed for the MHI or related
methods, other approaches (e.g., Zernike moments [15, 16],
global geometric shape descriptors [15], Fourier transform
[143]) are also utilized for creating feature vectors. Several
researchers [15, 16, 38, 39, 122] employ the PCAto reduce the
dimensions of the feature vectors.
After feature vectors are developed, classication is done
and unknown motion is recognized. These steps are shown
in the system ow diagram of the MHI method (Fig. 8).
For classication, the support vector machine (SVM) [15,
16, 39, 9699], K-nearest neighbor (KNN) [6, 11, 23, 24, 112,
122, 141], multi-class nearest neighbor [15, 16], Mahalanobis
distance [31, 33, 34, 47] and maximumlikelihood (ML) [110]
are employed.
One could employ (i) the re-substitution method (training
and test sets are the same); (ii) the holdout method (half the
data is used for training and the rest data is used for test-
ing); (iii) Leave-one-out method; (iv) the rotation method or
123
Motion history image: its variants and applications 261
Fig. 9 Numerous approaches
for shape representations.
Region-based global Hu
moments [68] are considered by
many researchers [33, 34, 47, 48,
6, 8, 122, 123, 10, 11, 170, 15],
including Bobick and Davis [31]
N-fold cross validation (a compromise between leave-one-
out method and holdout method, which divides the samples
into P disjoint subsets, 1 P N. Use (P 1) subsets for
training and the remaining subset for test); and (v) the boot-
strap method for partitioning scheme [70]. In most of the
cases, leave-one-out cross validation scheme is used for the
partitioning scheme (e.g., [6, 11, 110]). This means that out
of N samples from each of the cclasses per database, N-1 of
them are used to train (design) the classier and the remain-
ing one to test it [81]. This process is repeated N times, each
time leaving a different sample out. Therefore, all of the sam-
ples are ultimately used for testing. This process is repeated
and the resultant recognition rate is averaged. Usually, this
estimate is unbiased.
2.5 Limitations of the basic MHI method
Though successful in constrained situations, there are a few
limitations of the basic MHI method. The MHI method is
not suitable for dynamic background with its basic rep-
resentation (which is based on background subtraction or
image differencing approaches) [129]. However, by employ-
ing approaches that can segment motion information from
a dynamic background, the MHI method can be useful in
dynamic cases too. Occlusions of the body, or improper
implementation of the update function, (x, y, t ) results in
serious recognition failures [31, 2].
The MHI method does not need trajectory analysis [46].
However, the non-trajectory nature of it can be a problem for
cases where tracking could be necessary to analyze a moving
car or a person [8]. The MHI representation is exploited with
tracking information for some applications (e.g., by [123]).
It is also limited to label-based (token) recognition, where it
could not yield any information other than specic identity
matches (e.g., it could not report that upward motion was
occurring at a particular image location) [24, 49, 50]. This is
due to the fact that the holistic generation (and matching)
of the moment features are computed from the entire tem-
plate [49].
Another limitation of this method is the requirement of
having stationary objects, and the insufciency of the rep-
resentation to discriminate among similar motions [123].
The MHI method is appearance-based method. However,
by employing several cameras from different directions and
by combining moment features from these directions, action
recognition can be achieved. However, due to some similar
representations for different actions (but from different cam-
era-views), it may produce false recognition for an action.
Another key problem of this method is its failure to
separate the motion information when there is motion
self-occlusion or overwriting [8, 19, 73, 96, 112, 141]. In this
problem, if an action has opposite directions (e.g., from
sitting down to a standing position) in its atomic actions,
then the previous motion information (e.g., sitting down) is
deleted or overwritten by the latter motion information (e.g.,
standing) (Fig. 10). Therefore, if a person sits down, and then
stands up, the nal MHI image should contain brighter pix-
els in the upper part of the image to represent the stand-up
motion only. It can not vividly distinguish the direction of the
motion. This self-occlusion of the moving object or person
overwrites the prior information. Like any template matching
approach, the MHI also has the drawback that it is sensitive
to the variance of movement duration [145].
3 Motion history image-based approaches
Inthis section, MHI-basedapproaches are presented. We start
with direct implementations of the MHI method for numer-
ous applications, and afterwards with some modications.
123
262 Md. A. R. Ahad et al.
Fig. 10 Motion overwriting problem (due to self-occlusion) of the
MHI method
We also categorize and analyze important developments of
the MHI in 2D and 3D domains.
3.1 Various approaches employing the MHI method
3.1.1 Direct implementation of the MHI
Due to its simple representation of an action, the MHI method
is employed by different researchers without any modica-
tion for their respective demonstrations. Rosales [122] and
Rosales and Sclaroff [123] employ the MHI method with
seven Hu moments and Rosales [122] uses principal com-
ponents analysis (PCA) to reduce the dimensionality of this
representation. The systemis trained using different subjects
performing a set of examples of every action to be recog-
nized. Given these samples, K-nearest neighbor, Gaussian,
and Gaussian mixture classiers are used to recognize new
actions. Experiments are conducted using instances of eight
human actions performed by seven different subjects and
good recognition results are achieved. Rosales and Sclar-
off [123] propose a trajectory-guided recognition method. It
tracks an action by employing extended Kalman lter and
then uses the MHI for action recognition via a mixture of
Gaussian classier. They test the system to recognize differ-
ent dynamic outdoor activities.
Jan [71] hypothesizes that a suspicious person in a
restricted parking lot would display erratic pattern in his/her
walking trajectories (to inspect vehicles and its belongings
for possible malicious attempts). To this aim, trajectory infor-
mation is collected and its MHI (based on proles of changes
in velocity and in acceleration) is computed. The highest,
average and median MHI values are proled for each indi-
vidual on the scene. Though It is a simple hypothesis, col-
lections of real-time data from surveillance devices seem
challenging. Nonetheless, it is an initial phase to make an
attempt to analyze the information, which can be exploited
to decide possible suspicious behaviors. Apparently there can
be far more features than just trajectories (and its velocity and
accelerations).
Alahari and Jawahar [18] model action characteristics by
MHIs for some hand gesture recognition and four different
actions (i.e., jumping, squatting, limping and walking). They
introduce discriminative actions, which describe the use-
fulness of the fundamental units in distinguishing between
events. They achieve average 30.29% reduction in error for
some event pairs.
Shanet al. [129] employthe MHI for handgesture recogni-
tion considering the trajectories of the motion. They employ
the mean shift embedded particle lter, which enables a robot
to robustly track natural hand motion in real-time. Then, an
MHI for a hand gesture is created based on the hand track-
ing results. In this manner, spatial trajectories are retained
in a static image, and the trajectories are called temporal
template-based trajectories (TTBT). Hand gestures are rec-
ognized based on statistical shape and orientation analysis of
TTBT. By applying this hand tracking algorithm and gesture
recognition approach in a wheelchair, they have realized a
real-time hand control interface for the robot. Meng et al.
[100] developed a simple system based on SVM classier
and MHI representations, which is implemented on a recon-
gurable embedded computer vision architecture for real-
time gesture recognition. In another work by Vafadar and
Behrad [140], the MHI is employed for gesture recognition
for interaction with handicapped people. In this approach,
after constructing the MHI for each gesture, a motion orien-
tation histogram vector is extracted. These vectors are then
used for the training of hidden Markov Model (HMM) and
hand gesture recognition.
Yau et al. [162, 163] decompose MHIs into wavelet sub-
images using stationary wavelet transform(SWT). The moti-
vation of using the MHI in visual speech recognition is
123
Motion history image: its variants and applications 263
the ability of the MHI to remove static elements from the
sequence of images and preserve the short duration complex
mouth movement. The MHI is also invariant to the skin color
of the speakers due to the difference of frame and image sub-
traction process involved in the generation of the MHI. Here,
the SWT is used to denoise and to minimize the variations
between the different MHIs of the same consonant. Three
moment-based features are extracted from SWT sub-images
to classify three consonants only.
The MHI is used to produce input images for the line
tter, which is a system for tting lines to a video sequence
that describe its motion [58]. It uses the MHI method for
summarizing the motion depicted in video clips; however,
it fails with rotational motion. Rotations are not encoded in
the MHIs because the moving objects occupy the same pixel
locations from frame to frame, and new information over-
writes old information. Another failure example is that of
curved motion. Obviously, the straight line model is inade-
quate here. In order to improve performance, a more exible
model is needed.
Orrite et al. [110] propose a silhouette-based action mod-
eling for recognition where they employ the MHI directly
as the input feature of the actions. Then these 2D templates
are projected into a new subspace by means of the Kohonen
self organizing feature map (SOM). Action recognition is
accomplished by a maximum likelihood (ML) classier. In
another experiment, Tan and Ishikawa [139] employe the
MHI method and their proposed method to compare six dif-
ferent actions. Their results produce poor recognition rate.
After analyzing the datasets, it seems that actions inside the
dataset have motion overwriting; hence, it is understandable
that the MHI method may have poor recognition rate for this
type of dataset. Also, Meng et al. [9699] and Ahad et al. [8]
compare the recognition performances of the MHI method
with their HMHH and DMHI methods, respectively, for sev-
eral different datasets (one with radio-aerobics dataset and
another with KTHdataset). These datasets have motion over-
writing due to self-occlusion, and therefore, their approaches
outshine the MHI method in terms of the average recognition
rates.
3.1.2 Implementation of the MHI with some modications
In this sub-section, several methods and applications are pre-
sented where the MHI method is exploited with little mod-
ication, or considered almost similar route in developing
the motion cues. The MHI method or the MHI and/or MEI
templates are implemented with some modications by sev-
eral researchers in different applications. To start with, Han
and Bhanu [63, 64] proposed the gait energy image (GEI)
that targets specic normal human walking representation,
based on the concept of the MEI. The GEI is implemented as
a gait template for individual gait recognition. As compared
to the MEI and the MHI, the GEI targets specic normal
human walking representation [63]. Given the preprocessed
binary gait silhouette images at time t in a video sequence,
the gray-level GEI is dened as,
G (x, y) =
1
N
N
t =1
B
t
(x, y)
where N is the number of frames in the complete cycle(s) of
a silhouette sequence, t is the frame number in the sequence
(moment of time) [63]. Therefore, this GEI becomes a time-
normalized accumulative energy image of human walking
in the complete cycle(s). Though it performs very well for
gait recognition, it seems from the construction of the equa-
tion that for humans activity recognition, this approach
might not perform as smartly as the MHI method does. In
the similar fashion, Zou and Bhanu [170] employ the GEI
and co-evolutionary genetic programming (CGP) for human
activity classication. They extract Hu moments and normal-
ize histogram bins from the original GEIs as input features.
The CGP is employed to reduce the feature dimensionality
and learn the classiers. Bashir et al. [25, 26] and Yang et al.
[161] implement the GEI directly for human identication
with different feature analyses.
Similar to the development of the GEI, an action energy
image (AEI) is proposed for activity classication by
Chandrashekhar and Venkatesh [38]. They use eigen decom-
position of an AEI in eigen activity space obtained by PCA,
which best represents the AEI data in least-square sense.
AEIs are computed by averaging silhouettes and unlike the
MEI that captures only where the motion occurred; the AEI
captures where and how much the motion occurred. The
MEI carries less structural information since it is computed
by accumulating motion images obtained by image differ-
encing, while the AEI incorporates the information about
both structure and motion. They experiment with their AEI
concept for walking and running motions and achieve good
result. On the other hand, Liu and Zheng [89] propose a
method called gait history image (GHI) for gait representa-
tion and recognition. The GHI inherits the idea of the MHI
in the sense that temporal information and the spatial infor-
mation can be recorded in both cases. The GHI preserves
the temporal information besides the spatial information. It
overcomes the shortcoming of no temporal variation in the
GEI. However, each cycle only obtains one GEI or GHI tem-
plate, which easily leads to the problem of insufcient train-
ing cycles [41].
Moreover, the gait moment energy (GMI) method is devel-
oped by Ma et al. [92] based on the GEI. The GMI is the
gait probability image at each key moment of all gait cycles.
In this approach, the corresponding gait images at a key
moment are averaged as the GEI of this key moment. They
introduce moment deviationimage (MDI) byusingsilhouette
123
264 Md. A. R. Ahad et al.
images and GMIs. As a good complement of the GEI, the
MDI provides more motion features than the GEI. Both MDI
and GEI are utilized to present a subject. However, it is not
easy for the GMI to select key moments fromcycles with dif-
ferent periods. Therefore, to compensate this problem, Chen
et al. [41] propose a clustered-based GEI approach. In this
case, the GEIs are computed from several clusters and the
Dominant Energy Image (DEI) is obtained by denoising the
averaged image of each cluster. The frieze and wavelet fea-
tures are adopted and HMM is employed for recognition.
This approach performs better than the GEI, the GHI and the
GMI representations, as it is superior (due to its clustered
concept) when the silhouette has incompleteness or noise.
Wang and Suter [147] directly convert an associated
sequence of human silhouettes derived from videos into two
types of computationally efcient representations, namely,
average motion energy (AME) and mean motion shape
(MMS), to characterize actions. These representations are
used for recognition. The MMS is proposed based on shapes,
not silhouettes (in a similar manner to the AME). The process
of generating the AME is computationally inexpensive and
can be employed in real-time applications [166]. This AME
is computed exactly the similar manner of the computation
of the GEI, though the former is exploited for action recogni-
tion whereas the latter method is used for gait recognition. In
calculating the AME, Wang and Suter [147] employ the sum
of absolute difference (SAD) for action recognition purpose
and obtain adequate recognition results. However, for large
image size or database, the computation of SADis inefcient
and computationally expensive. This constraint is addressed
by Yu et al. [166] who propose a histogram-based approach,
which can efciently compute the similarity among patterns.
As an initial approach, an AME image is converted to the
motion energy histogram (MEH).
From a histogram point of view, we can regard AME as
a two-dimensional histogram whose bin value represents the
frequency on position during time interval. Thus, we can
reform the AME to the MEH by using:
MEH =
AME(x, y)
x,yAME
AME(x, y)
Then, a multi-resolution structure is adopted to construct the
multi-resolution motion energy histogram (MRMEH).
Ahigh-speed human motion recognition technique is pro-
posed based on a modied-MHI and a superposed motion
image (SMI) [108]. Using a multi-valued differential image
( f
i
(x, y, t )) to extract information about human posture,
they propose modied-MHI that can be dened as
H
(x, y, t ) = max ( f
i
(x, y, t ) , H
(x, y, t 1)) ,
where H
t
(x, y, t )) is computed from the EMHI of the previ-
ous frame EMHI
t 1
(x, y, t ) as:
EMHI
t
(x, y, t )
=
_
if B
t
(x, y, t )=1
max
_
0, EMHI
t 1
(x, y, t )1
_
otherwise
In this equation, the basic feature of this EMHI is edges.
Later, they manage the scale adaptation and noises (as exist-
ing edge detection algorithms are sensitive to noises). The
motion history concept can help to smooth noises and pro-
vide historic motion clues to help a human vision system for
building correspondences on edge points [40]. They develop
layered Gaussian mixture model (LGMM) to exploit these
features for classifying various shots in video.
Another conceptually similar work to the MHI method
[31] is proposed by Masoud and Papanikolopoulos [93]. This
different method extracts motion directly from the image
sequence. At each frame, motion information is represented
by a feature image, which is calculated efciently using an
innite impulse response (IIR) lter. In particular, they use
the response of the IIR lter as a measure of motion in
the image. The idea is to represent motion by its recent-
ness: recent motion is represented as brighter than older
motion, just like [31]. This technique, also called recursive
ltering, is simple and time-efcient. Unlike the MHI method
[31], an action is represented by several feature images [93]
rather than just two images (namely, the MHI and the MEI
images) [31].
3.2 Variants of the MHI method in 2D
3.2.1 Solutions to motion self-occlusion problem
One of the key limitations of the MHI method is its inabil-
ity to perform well in presence of motion overwriting due to
self-occlusion. Several attempts have been targeted to mit-
igate this issue, so that multi-directional activities can be
represented by the concept of the MHI representation. One
initial approach is the multiple-level MHI (MMHI) method
[112, 141, 142]. It aims at overcoming the problem of motion
self-occlusion by recording motion history at multiple time
intervals (i.e., multi-level MHIs). It creates all MHIs to have
a xed number of history levels n. So, each image sequence
is sampled to (n+1) frames. The MMHI is computed as
follows:
MMHI
t
(x, y, t )
=
_
s t if (x, y, t ) = 1
MMHI
t
(x, y, t 1) otherwise
where s = (255/n) is the intensity step between two his-
tory levels. MMHI
t
(x, y, t ) = 0 for t 0. The nal tem-
plate is found by iteratively computing the above equation for
t = 1, . . . , n +1. This method encodes motion occurrences
at different time instances on the same pixel location in such
a manner that it can be uniquely decodable afterwards. For
this purpose, it uses a simple bit-wise coding scheme. If a
motion occurs at time t at pixel location (x, y), it adds 2
t 1
to the old motion value of the MMHI as follows:
MMHI (x, y, t )=MMHI (x, y, t 1) + (x, y, t ) 2
(t 1)
Due to this bitwise coding scheme, one can separate multi-
ple actions occurring at the same position [141]. It focuses
on automatic detection of facial actions units that compose
expressions. It requires sophisticated registration system,
because all employed image sequences must have the faces at
the same position and on the same scale. The result does not
clearly demonstrate the superiority of the MMHI with respect
to basic MHI [20, 141]. Even in their reports, the MMHI
produces lower recognition result than the MHI [142]. How-
ever, they point out that self-occlusion due to motion over-
writing problem might be solved using this MMHI. Ahad
et al. [11] implement the MMHI with aerobics dataset and
another action dataset, but it is found that the MMHI method
shows poor recognition result.
Motion overwriting or self-occlusion problem of the MHI
method is robustly solved by the directional motion history
image (DMHI) method [8]. In this approach, instead of back-
ground or frame subtraction, gradient-based optical ow is
123
266 Md. A. R. Ahad et al.
calculated between two consecutive frames and split it into
four channels (see Fig. 5, as shown above). Based on this
strategy, one can get four-directional motion templates for
left, right, up and down directions. The corresponding four
history images are calculated as:
DMHI
(x, y, t )
=
_
if
(x, y, t ) >
max(0, DMHI
(x, y, t ) and
DMHI
x
(x, y, t ) are
computed:
med H
(x, y, t ) = med
_
DMHI
(x, y, t )
_
where med() is the function for median lter. We compute
four MEIs after thresholding these templates above zero:
DMEI
(x, y, t ) =
_
1 if med H
(x, y, t ) 1
0 otherwise
This method can solve overwriting problem significantly.
Several complex actions and aerobics (which have more than
one direction in these actions) are tested. More than 94%
recognition with the DMHI method is achieved, whereas
the MHI shows around 50% recognition result. The DMHI
method requires four history templates and four energy tem-
plates for four directions; hence the size of the feature vector
becomes large, and hence it becomes computationally a bit
more expensive than the MHI. With recent work based on this
approach, various reduced-sized feature vectors have been
proposed which can recognize motions faster with almost
the same recognition result [4]. Moreover, having the com-
bined cues from the DMHI and the MEI representations for
each action (with an outdoor action dataset), the achieved
result is also satisfactory [17]. The DMHI is also employed
for low-resolution action recognition, because it keeps infor-
mation on the motion components even though the resolution
is poor. With low-resolution video sequences (from320240
to 6448 pixels), the recognition results are very promising
[5]. However, with very low-resolution, due to lack of pixel
information, it becomes difcult to achieve significant infor-
mation for recognition. If there is no motion information in
the nal history or energy templates then feature vectors can
not be computed for these templates. Another improvement
is proposed [12] called timed-DMHI to cover similar actions
having different speed. This concept is simple but not robust.
Earlier, Meng et al. [96] propose a SVM-based sys-
tem called hierarchical motion history histogram (HMHH).
In [97100], they compare other methods (i.e., modied-
MHI, MHI) to demonstrate the robustness of HMHH in
recognizing several actions. This representation retains more
motion information than the MHI, and also remains inexpen-
sive to compute [100]. In this approach, to solve the overwrit-
ing problem, they dene some patterns P
i
in the motion mask
(D(x, y, :)) sequences, based on the number of connected
1, e.g.,
P
1
= 010, P
2
= 0110, P
3
= 01110, . . . , P
M
= 01 10
. , .
M1s
Now dene a subsequence C
i
= b
n1
, b
n2
, . . . , b
ni
and
denote the set of all sub-sequences of D(x, y, :) as
{D(x, y, :)}. Then for each pixel, count the number of
occurrences of each specic pattern P
i
in the sequence of
D(x, y, :) as shown,
HMHH(x, y, P
i
) =
j
1
{C
j
=P
i |C
j
{D(x,y,:)} }
Here, 1 is the indicator function. Hence, from each pattern
P
i
, we construct one gray-scale image (called motion his-
tory histogram, MHH), and in aggregation, we call all MHH
images as hierarchical MHH, HMHH). They use these nal
feature images for classication and then recognition using
SVM.
These solutions are compared [11] to show their respec-
tive robustness in solving the overwriting problem of the
MHI method [31]. The employed dataset has some activi-
ties that are complex in nature and have motion overwriting.
For the HMHHmethod, four patterns are considered as more
than four patterns do not provide significant information. For
everyactivity, the recognitionresult withthe DMHI represen-
tation is very satisfactory (about 94% recognition). Though
the HMHH representation achieved better results than the
MHI and the MMHI representations, the performance of the
HMHHis unacceptable, as it achievedabout 67%recognition
rate.
Kellokumpu et al. [73] extract spatially enhanced local
binary pattern (LBP) histograms from the MHI and the MEI
temporal templates and model their temporal behavior with
HMMs. They select a xed frame number. The computed
MHI is divided into four sub-regions through the centroid of
the silhouette. All MHI and MEI LBP features are concate-
nated into one histogram and normalized so that the sum of
the histogramequals to one. In this case, the temporal model-
ing is done by using HMMs. This texture-based description
of movements can handle overwrite problem of the MHI.
One concern of this approach is the choice of the sub-regions
division scheme for every action.
123
Motion history image: its variants and applications 267
3.2.2 Solutions to some issues of the MHI in 2D
To overcome several constraints of the MHI method [31],
various developments are proposed both in 2D and 3D
domains. This sub-sub-section covers some other variants
of the MHI method in 2D, and 3D extensions will be cov-
ered in Sub-section afterwards. Davis [51] presents a method
for recognizing movement that relies on localized regions of
motion, which are derived from the MHI. He offers a real-
time solution for recognizing some movements by gathering
and matching multiple overlapping histograms of the motion
orientations fromthe MHI. In this extension fromthe original
work [31], Davis explains a method to handle variable-length
movements as well as occlusion issue. The directional histo-
gram for each body region has twelve bins (30 degree each),
and the feature vector is a concatenation of the histograms of
different body regions.
In another update, the MHI is generalized by directly
encoding the actual time in a oating point format, which is
called timed-motion history image (tMHI) [33, 34]. In tMHI,
a new silhouette values are copied in with a oating-point
time stamp. This MHI representation is updated as follows,
not by considering the frame numbers but time stamp of the
video sequence [34]:
tMHI
(x, y) =
_
if current silhouette at (x, y)
0 else if tMHI
(x, y) < ( ),
where is the current time stamp, and is the maximumtime
duration constant (typically a few seconds) associated with
the template. This method makes the representation indepen-
dent of the system speed or frame rate (within limits) so that
a given gesture can cover the same MHI area at different
capture rates. They also present a method of motion seg-
mentation based on segmenting layered motion regions that
are meaningfully connected to movements of the object of
interest. The segmented regions are not motion blobs, but
motion regions that are naturally connected to parts of mov-
ing objects. This is motivated by the fact that segmentation
by collecting blobs of similar directional motion does not
guarantee the correspondence of the motion over time. This
motion segmentation, together with silhouette pose recogni-
tion, provides a very general and useful tool for gesture and
motion recognition [34]. This approach is later employed by
Senior and Tosunoglu [128] for tracking objects in real-time.
They use the tMHI for motion segmentation.
The motion gradient orientation (MGO) is also computed
by Bradski and Davis [34] from the interior silhouette pix-
els of the tMHI. These orientation gradients are employed
for recognition. Wong and Cipolla [154, 155] exploit MGO
images to form motion features for gesture recognition.
Pixels in the MGO image encode the change in orientation
between nearest moving edges shown on the MHI and the
region of interest is dened as the largest rectangle cov-
ering all bright pixels in the MEI. Therefore, the MGO
contains information about where and how a motion has
occurred [155].
The MHIs limitation relating to the global image feature
calculations can be overcome by computing dense local
motion vector eld directly from the MHI for describing
the movement [49]. Davis [49] extends the original MHI
representation into a hierarchical image pyramid format to
provide with a means of addressing the gradient calculation
of multiple image speeds. An image pyramid is constructed
by recursively low-pass ltering and sub-sampling an image
(i.e., power-of-2 reduction with anti-aliasing) until reaching
a desired size of spatial reduction. The result is a hierarchy of
motion elds, where the resulting computed motion in each
level is tuned to a particular speed (i.e., with faster speeds
residing at higher levels). The hierarchical MHI (HMHI) is
not directly created from the original MHI, but through the
pyramid representation of the silhouette images. Afterwards,
based on the orientations of the motion ow (computed from
the MHI pyramid), a motion orientation histogram (MOH)
is produced. The resulting motion is characterized by a polar
histogram. The HMHI approach remains a computationally
inexpensive algorithm to represent, characterize and recog-
nize human motion in video [100].
3.2.3 Motion separation and identication approach
Based on the DMHI template [8], complex motions tem-
poral segmentation or separation scheme to its primitives is
proposed [6]. This temporal motion segmentation method
can demonstrate the intermediate interpretation of complex
motion into four directions, namely, right, left, up and down.
After having the motion templates for a complex action or
activity, it calculates the volume of pixel values (
) by sum-
ming up the brightness levels of the motion templates. For
consecutive frames, it is
t
=
M
x=1
N
y=1
DMHI
(x, y, t )
One can decide the label {up, down, lef t, ri ght } of
the segmented motion based on threshold values (that deter-
mines the starting point for a motion
) as shown in
t +k
=
t +k
t
=
_
if
t
<
if
t
>
Here,
t
is the difference betweentwovolume of pixel values
(
t
is more than a starting threshold value
,
we can decide the label of the segmented motion. However,
123
268 Md. A. R. Ahad et al.
when the
t
reduces to
>
t
, we can say that the scene
is static or an earlier motion is no longer existent (). There-
fore, based on this mechanism from the motion history tem-
plates, they easily segment a complex motion sequence into
four directions. This is very useful for an intelligent robot to
decide the directions of the human movement. Thus an action
can be understood based on some consecutive leftrightup
down combination [6].
3.2.4 Other 2D developments
An advantage of the MHI is that although it is a representa-
tion of the history of pixel-level changes, only one previous
frame needs to be stored. However, at each pixel location,
explicit informationabout its past is alsolost inthe MHI when
current changes are updated to the model with their corre-
sponding MHI values jumping to the maximal value [159].
To overcome this problem, Ng and Gong [105] propose a
pixel signal energy (PSE) in order to measure the mean mag-
nitude of pixel-level temporal energy over a period of time.
It is dened by a backward window. The size of the window
determines the number of frames (history) to be stored [106].
Another recent development on the MHI representation
is pixel change history (PCH) [159]. This can measure the
multi-scale temporal changes at each pixel. The PCH of a
pixel
_
P
,
(x, y, t )
_
can be dened by
P
,
(x, y, t )
=
min
_
P
,
(x, y, t 1) +
255
, 255
_
if D(x, y, t )=1
max
_
P
,
(x, y, t 1)
255
, 0
_
otherwise,
where D(x, y, t ) is the binary foreground image, is an
accumulation factor and is a decay parameter. When
D(x, y, t ) = 1, the value of a PCH increases gradually
according to the accumulation factor, instead of jumping to
the maximum value. When no significant pixel-level visual
change is detected at location (x, y) in the current frame,
the pixel (x, y) is treated as part of the background and the
corresponding PCH starts to decay. The speed of decay is
controlled by a decay factor. In fact, the MHI is a special case
of the PCH. A PCH image is equivalent to an MHI image
when a parameter called accumulation factor () is set to 1.
Compared to the PCH, the MHI has weaker discriminative
power to distinguish different types of visual changes. More-
over, similar to that of the PSE [105], a PCHcan also capture
a zero order pixel-level change, i.e., the mean magnitude of
change over time [159].
MHIs can also be used to detect and interpret actions in
compressed video data. Compressed domain human motion
is recognized at the top of the MHI approach by the intro-
duction of motion ow history (MFH) [23, 24]. The MFH
quanties the motion in compressed video domain. Motion
vectors are extracted from the compressed MPEG stream by
partial decoding. Then noise is reduced and the coarse MHI
and the corresponding MFH are constructed at macro-block
resolution instead at pixel resolution. By this approach, they
reduce the computation by 16 times. The MFH can be com-
puted according to the following equations:
MFH
(x, y, t )=
_
v
(x, y, t ) if E (v
(x, y, t ))<
M (v
(x, y, t )) otherwise
where, E (v
(x, y, t )) = v
(x, y, t )
med (v
(x, y, t ) . . . v
(x, y, t ))
2
M (v
(x, y, t )) = med (v
(x, y, t ) . . . v
(x, y, t ))
Here med() refers to median lter, v
x
) or vertical (v
y
(x, y, z, t )
=
_
if D(x, y, z, t ) = 1
max (0, MHV
x
images for
top-row of b show more ripple-shaped information than that of walking motion in a
difcult with the present manifestation of the MHI. Sim-
ilar to the MHI method, other variants show almost sim-
ilar motion templates for both walking and running, and
hence demonstrate poor recognition results. Though the AEI
is presented in [38], and they claim that walking and run-
ning motion can be easily recognized, their action datasets
are limited for this work. One intuitive way to achieve bet-
ter features for separating walking and running motion is to
employ the DMHI method and use the decay parameter ()
as a higher value so that we can achieve the ripples at the top
of the template (notice the more evident ripples at top of the
white patches in the column H/E
x
of Fig. 12 [9]) and use
these features for recognition.
When multiple moving persons/objects are present in the
scene, these approaches cannot solve the problemof multiple
object identication [37]. Image depth analysis can aid to
solve this problem. Researchers may think about the camera
movement and its effect. Usually, camera motion compensa-
tion is difcult and the effect of camera movement and the
employment of the MHI are not solved, though Davis et al.
[52] apply the MHI with PTZ camera.
Another important issue is whether the MHI and the MEI
representations are still required when there are several other
approaches in different directions. In the last decade, spatio-
temporal interest feature points (STIP), histogramof oriented
gradients (HOG) [45], histogramof oriented ow(HOF) [44]
and few other methods have become prominent for action
representations and recognition apart from the MHI-based
approaches. But among these approaches, the MHI (and its
variants) attains notable attentions in the computer vision
arena according to our analyses. We know that interest point
detection in static images is a well-studied topic in com-
puter vision. Laptev and Lindeberg [83] pioneer to propose
a spatio-temporal extension, building on the Harris-Laplace
detector. Several spatiotemporal interest/feature point (STIP)
detectors are recently exploited in video analysis for action
recognition. Feature points are detected using a number of
measures, namely entropy-based saliency [75, 79, 109, 152],
global texture [156], cornerness [84, 83], periodicity [53, 79]
and volumetric optical ow [77]. These are mainly based on
intensity [53], texture, color and motion information [119].
In the spatiotemporal domain, however, it is unclear which
features indicate useful interest points [79]. Most of the STIP
detectors are usually computationally expensive (compare to
the straightforward computation of the MHI) and are there-
fore restricted to the processing of short or low resolution
videos (e.g., [53, 83, 84, 109]). Detection of reduced number
of features is a prerequisite to keep the computational cost
under control [152]. Furthermore, in some cases, all input
videos need to be preprocessed [156].
Though these approaches are proven in recognizing var-
ious actions, they have additional theoretical and computa-
tional complexity than that of the MHI method. The MHI
method is very simple and computationally not expensive.
Moreover, it covers every motion details and these seg-
mented motion regions are employed for various applica-
tions. We notice that the MHI and MHI-based approaches
are employed, exploited and modulated for a good number of
applications in various domains and dimensions (see above).
Therefore, we stronglyfeel that the MHI methodis still useful
and that the limitations that are still unsolved can be managed
in the future. Moreover, the MHI method at its basic format
is very easy to understand and implement. This is a key ben-
ecial feature for the MHI. From the MHI and MEI image,
using Hu moments or other shape representation approaches,
we can easily get the feature vectors for recognition. How-
ever, the MHI is a global-based approach; hence, motions
from objects that are not the target of interest will deter the
performance (STIP-based methods are better in this context).
The MHI is a representation of choice for action recognition
when temporal segmentation is available; when actors are
fully visible and can be separated from each other; and when
123
276 Md. A. R. Ahad et al.
they do not move along the z-axis of the camera. In other
cases, other representations are probably needed, including
bag-of-features (BOF) methods based on STIP, HOF and
HOG, which have been shown to overcome those limitations.
The STIP-based approaches and HOG/HOF-based devel-
opments can be incorporated along with the MHI/MEI rep-
resentations for future research. Integration of multiple cues
(e.g., motion, shape, edge information (e.g., [39]), color or
texture), or a fusion of information will produce a better
result [134]. Presence of multiple moving subjects, moving
camera, view-invariance issues, image depth analysis and
overall a better and robust image segmentation technique
(for producing the update function in outdoor and cluttered
environments) are the major challenges ahead for the MHI
method. We feel that the above discussions will open some
doors for further research to improve the methods for real-life
applications.
6 Conclusions
Human motion analysis is a challenging problemdue to large
variations in human motion and appearance, camera view-
point and environment settings [118]. The eld of action
and activity representation and recognition is relatively old,
yet not well-understood [104]. Some important but common
motion recognition problems are even now unsolved prop-
erly by the computer vision community. However, in the
last decade, a number of good approaches are proposed and
evaluated subsequently by many researchers. Among those
methods, one method gets significant attention from many
researchers in the computer vision eld. Therefore, though
there are various approaches for motion analysis and recogni-
tion, this paper analyzes the MHI method. It is one of the key
methods, and a number of variants are developed from this
concept. The MHI is simple to understand and implement;
hence many researchers employ this method or its variants
for various action/gesture recognition and motion analysis,
with different datasets.
We present a tutorial that covers important issues for this
representation and method. Afterwards, several key limita-
tions are mentioned. In this work, we categorize and present
various implementations of the MHI and its developments.
This paper also discusses several issues to solve in future.
The motion self-occlusion problem of the MHI is addressed
and solved with satisfactory recognition rate. Though 3D
approaches are proposed as view-invariant methods at the
top of 2D MHI, these are computationally expensive. Nev-
ertheless, several essential concerns of the MHIrelated to
self-occlusion due to motion, motion overlapping or multiple
repetitions, significant occlusion from multiple moving per-
sons, and objects motion towards the optical axis of the
camera, should further be investigated rigorously in future
so that this simple approach can be extended to various real-
life applications with better performance. We hope that this
paper would be benecial to various researchers (especially
to inspire new researchers) to understand the MHI method,
its variants and applications.
Acknowledgments The authors are grateful to the anonymous
reviewers for their excellent reviews and constructive comments that
helped to improve the manuscript. The work is supported by the Japan
Society for the Promotion of Science (JSPS), Japan.
References
1. Aggarwal, J., Cai, Q.: Human motion analysis: a review. In: Proc.
Nonrigid and Articulated Motion Workshop, pp. 90102 (1997)
2. Aggarwal, J.K., Cai, Q.: Human motion analysis: a review.
Comput. Vis. Image Underst. 73, 428440 (1999)
3. Aggarwal, J.K., Park, S.: Human motion: modeling and recogni-
tion of actions and interactions. In: Proc. Int. Symposium on 3D
Data Processing, Visualization, and Transmission (3DPVT04),
p. 8 (2004)
4. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Lower-
dimensional feature sets for template-based motion recognition
approaches. J. Comput. Sci. 6(8), 920927 (2010)
5. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: A simple
approach for low-resolution activity recognition. Int. J. Comput.
Vis. Biomech. 3(1) (2010)
6. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Temporal
motion recognition and segmentation approach. Int. J. Imaging
Syst. Technol. 19, 9199 (2009)
7. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Human activity
recognition: various paradigms. In: Proc. Int. Conf. on Control,
Automation and Systems, pp. 18961901, October 2008
8. Ahad, Md.A.R., Ogata, T., Tan, J.K., Kim, H., Ishikawa, S.: A
complex motion recognition technique employing directional
motion templates. Int. J. Innov. Comput. Inf. Control 4(8), 1943
1954 (2008)
9. Ahad, Md.A.R., Ogata, T., Tan, J.K., Kim, H., Ishikawa, S.:
Moment-basedhumanmotionrecognitionfromthe representation
of DMHI templates. In: SICE Annual Conference, pp. 578583,
August 2008
10. Ahad, Md.A.R., Ogata, T., Tan, J.K., Kim, H., Ishikawa, S.:
A smart automated complex motion recognition technique. In:
Proc. Workshop on Multi-dimensional and Multi-viewImage Pro-
cessing (with ACCV), pp. 142149 (2007)
11. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Analysis of
motion self-occlusion problem due to motion overwriting for
human activity recognition. J. Multimed. 5(1), 3646 (2009)
12. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Action rec-
ognition with various speeds and timed-DMHI feature vectors.
In: Proc. Int. Conf. on Computer and Info. Tech., pp. 213218,
December 2008
13. Ahad, Md.A.R., Tan J.K., Kim H., Ishikawa, S.: Human activity
analysis: concentrating on motion history image and its variants.
In: SICE-ICASE Joint Annual Conf., pp. 54015406 (2009)
14. Ahmad, M., Parvin, I., Lee, S.-W.: Silhouette history and energy
image information for human movement recognition. J. Multi-
media 5(1), 1221 (2010)
15. Ahmad, M., Lee, S.-W.: Recognizing human actions based on
silhouette energy image and global motion description. In: Proc.
IEEE Automatic Face and Gesture Recognition, pp. 523588
(2008)
123
Motion history image: its variants and applications 277
16. Ahmad, M., Hossain, M.Z.: SEI and SHI representations for
human movement recognition. In: Proc. Int. Conf. on Computer
and Information Technology (ICCIT), pp. 521526 (2008)
17. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Action rec-
ognition by employing combined directional motion history and
energy images. In: IEEE Computer Society Conf. on Computer
Vision and Pattern Recognitions Workshop on CVCG, p. 6 (2010)
18. Alahari, K., Jawahar, C.V.: Discriminative actions for recogniz-
ing events. In: Indian Conf. on Computer Vision, Graphics and
Image Processing (ICVGIP06), LNCS, vol. 4338, pp. 552563
(2006)
19. Albu, A.B., Beugeling, T.: A three-dimensional spatiotempo-
ral template for interactive human motion analysis. J. Multi-
media 2(4), 4554 (2007)
20. Albu, A., Trevor, B., Naznin, V., Beach, C.: Analysis of irregular-
ities in human actions with volumetric motion history images. In:
Proc. IEEE Workshop on Motion and Video Computing, Texas,
USA, p. 16, February 2007
21. Anderson, C., Bert, P., Wal, G.V.: Change detection and track-
ing using pyramids transformation techniques. In: Proc. SPIE-
Intelligent Robots and Computer Vision, vol. 579, pp. 7278
(1985)
22. Arseneau, S., Cooperstock, J.R.: Real-time image segmentation
for action recognition. In: Proc. IEEE Pacic Rim Conf. on Com-
munications, Computers and Signal Processing, pp. 8689 (1999)
23. Babu, R., Ramakrishnan, K.: Compressed domain human motion
recognition using motion history information. In: Proc. ICIP,
vol. 2, pp. 321324 (2003)
24. Babu, R., Ramakrishnan, K.: Recognition of human actions
using motion history information extracted from the compressed
video. Image Vis. Comput. 22, 597607 (2004)
25. Bashir, K., Xiang, T., Gong, S.: Feature selection for gait rec-
ognition without subject cooperation. In: British Machine Vision
Conference, p. 10 (2008)
26. Bashir, K., Xiang, T., Gong, S.: Feature selection on gait energy
image for human identication. In: IEEE Int. Conf. on Acoustics,
Speech and Signal Processing, pp. 985988 (2008)
27. Beauchemin, S.S., Barron, J.L.: The computation of optical
ow. ACM Comput. Surv. 27(3), 433467 (1995)
28. Bergen, J.R., Burt, P., Hingorani, R., Peleg, S.: Athree frame algo-
rithm for estimating two-component image motion. IEEE Trans.
PAMI 14(9), 886896 (1992)
29. Bimbo, A.D., Nesi, P.: Real-time optical owestimation. In: Proc.
Int. Conf. on Systems Engineering in the Service of Humans, Sys-
tems, Man and Cybernetics, vol. 3, pp. 1319 (1993)
30. Bobick, A., Davis, J.: An appearance-based representation of
action. In: Intl. Conf. on Pattern Recognition, pp. 307312 (1996)
31. Bobick, A., Davis, J.: The recognition of human movement
using temporal templates. IEEE Trans. PAMI 23(3), 257267
(2001)
32. Bobick, A., Intille, S., Davis, J., Baird, F., Pinhanez, C., Campbell,
L., Ivanov, Y., Schutte, A., Wilson, A.: The Kidsroom: a percep-
tually-based interactive and immersive story environment. Pres-
ence: Teleoperators Virtual Environ. 8(4), 367391 (1999)
33. Bradski, G., Davis, J.: Motion segmentation and pose recogni-
tion with motion history gradients. In: Proc. IEEE Workshop on
Applications of Computer Vision, pp. 174184, December 2000
34. Bradski, G., Davis, J.: Motion segmentation and pose recog-
nition with motion history gradients. Mach. Vis. Appl. 13(3),
174184 (2002)
35. Canton-Ferrer, C., Casas, J.R., Pardas, M.: Human model and
motion based 3D action recognition in multiple view scenarios.
In: Proc. Conf. European Signal Process, Italy, pp. 15, September
2006
36. Canton-Ferrer, C., Casas, J.R., Pards, M., Sargin, M.E., Tekalp,
A.M.: 3D human action recognition in multiple view scenarios.
In: Proc. Jornades de Recerca en Automtica, Visi i Robtica,
Barcelona (Spain), p. 5, 46 July 2006
37. Cedras, C., Shah, M.: A survey of motion analysis from moving
light displays. In: Proc. IEEE CVPR, pp. 214221 (1994)
38. Chandrashekhar, V., Venkatesh, K.S.: Action energy images for
reliable human action recognition. In: Proc. of Asian Symposium
on Information Display (ASID), pp. 484487 (2006)
39. Chen, D., Yang, J.: Exploiting high dimensional video features
using layered Gaussian mixture models. In: Proc. IEEE ICPR,
p. 4 (2006)
40. Chen, D., Yan, R., Yang, J.: Activity analysis in privacy-pro-
tected video, p. 11. (2007). http://www.informedia.cs.cmu.edu/
documents/T-MM_Privacy_J2c.pdf
41. Chen, C., Liang, J., Zhao, H., Hu, H., Tian, J.: Frame differ-
ence energy image for gait recognition with incomplete silhou-
ettes. Pattern Recognit. Lett. 30(11), 977984 (2003)
42. Christmas, W.J.: Spatial ltering requirements for gradient-based
optical ow measurement. In: 9th British Machine Vision Con-
ference, pp. 185194 (1998)
43. Collins, R.T., Lipton, A., Kanade, T., Fujiyoshi, H., Duggins, D.,
Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P.,
Wixson, L.: A system for video surveillance and monitoring.
VSAM nal report, CMU-RI-TR-00-12, Technical Report, Car-
negie Mellon University, p. 69 (2000)
44. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented
histograms of ow and appearance. In: European Conference on
Computer Vision, pp. 428441 (2006)
45. Dalal, N., Triggs, B.: Histograms of oriented gradients for human
detection. In: Intl. Conf. on Computer Vision and Pattern Recog-
nition, pp. 886893 (2005)
46. Davis, J.: Sequential reliable-inference for rapid detection of
human actions. In: Proc. IEEE Workshop on Detection and Rec-
ognition of Events in Video, pp. 19, July 2004
47. Davis, J.W.: Appearance-based motion recognition of human
actions. M.I.T. Media Lab Perceptual Computing Group Tech.
Report No. 387, p. 51 (1996)
48. Davis, J., Bradski, G.: Real-time motion template gradients using
Intel CVLib. In: Proc. ICCV Workshop on Frame-Rate Vision,
pp. 120, September 1999
49. Davis, J.: Hierarchical motion history images for recognizing
human motion. In: Proc. IEEE Workshop on Detection and Rec-
ognition of Events in Video, pp. 3946 (2001)
50. Davis, J., Bobick, A.: Virtual PAT: a virtual personal aerobics
trainer. In: Proc. Perceptual User Interfaces, pp. 1318, November
1998
51. Davis, J.: Recognizing movement using motion histograms. MIT
Media Lab. Perceptual Computing Section Tech. Report No. 487
(1998)
52. Davis, J.W., Morison, A.M., Woods, D.D.: Buildingadaptive cam-
era models for video surveillance. In: Proc. IEEE Workshop on
Applications of Computer Vision (WACV07), p. 6 (2007)
53. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recog-
nition via sparse spatiotemporal features. In: Intl. Workshop on
Visual Surveillance and Performance Evaluation of Tracking and
Surveillance, pp. 6572, October 2005
54. Digital Imaging Research Centre, K.U.L.: Virtual Human
Action Silhouette (ViHASi) Database. http://dipersec.king.ac.uk/
VIHASI/
55. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action
at a distance. In: Proc. ICCV, pp. 726733 (2003)
56. Elgammal, A., Harwood, D., David, L.S.: Nonparametric back-
ground model for background subtraction. In: Proc. European
Conference on Computer Vision, p. 17 (2000)
57. Essa, I., Pentland, S.: Facial expression recognition using a
dynamic model and motion energy. In: Proc. IEEE CVPR, p. 8,
June 1995
123
278 Md. A. R. Ahad et al.
58. Forbes, K.: Summarizing motion in video sequences, pp. 17.
http://thekrf.com/projects/motionsummary/MotionSummary.pdf.
Accessed 9 May 2004
59. Full-body Gesture Database, Korea University. http://gesturedb.
korea.ac.kr/
60. Fischler, M.A., Bolles, R.C.: Random sample consensus: a par-
adigm for model tting with applications to image analysis and
automated cartography. Commun. ACM 24(6), 381395 (1981)
61. Gavrilla, D.: The visual analysis of human movement: a sur-
vey. Comput. Vis. Image Underst. 73, 8298 (1999)
62. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.:
Actions as space-time shapes. IEEE Trans. PAMI 29(12),
22472253 (2007)
63. Han, J., Bhanu, B.: Individual recognition using gait energy
image. IEEE Trans. PAMI 28(2), 316322 (2006)
64. Han, J., Bhanu, B.: Gait energy image representation: compara-
tive performance evaluation on USF HumanIDdatabase. In: Proc.
Joint Intl. Workshop VS-PETS, pp. 133140 (2003)
65. Haritaoglu, I., Harwood, D., Davis, L.S.: W
4
: real-time surveil-
lance of people and their activities. IEEE Trans. PAMI 22(8),
809830 (2000)
66. Horn, B., Schunck, B.G.: Determining optical ow. Artif. Intell.
17, 185203 (1981)
67. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual sur-
veillance of object motion and behaviors. IEEE Trans. SMC-Part
C. 34(3), 334352 (2004)
68. Hu, M.K.: Visual pattern recognition by moment invariants. IRE
Trans. Info. Theory 8, 179187 (1962)
69. Jaimes, A., Sebe, N.: Multimodal human-computer interaction: a
survey. Comput. Vis. Image Underst. 108(12), 116134 (2007)
70. Jain, A., Duin, R., Mao, J.: Statistical pattern recognition: a
review. IEEE Trans. PAMI 2(1), 437 (2000)
71. Jan, T.: Neural network based threat assessment for automated
visual surveillance. In: Proc. IEEE Joint Conf. on Neural Net-
works, vol. 2, pp. 13091312, July 2004
72. Jin, T., Leung, M.K.H., Li, L.: Temporal human body segmen-
tation. In: Villanieva, J.J. (ed.) IASTED Int. Conf. Visualization,
Imaging, and Image Processing (VIIP04). Acta Press, Marbella.
ISSN: 1482-7921, 68 September 2004
73. Kellokumpu, V., Zhao, G., Pietikinen, M.: Texture based descrip-
tion of movements for activity analysis. In: Proc. Conf. Com-
puter Vision Theory and Applications (VISAPP08), vol. 2, pp.
368374, Portugal (2008)
74. Kilger, M.: A shadow handler in a video-based real-time trafc
monitoring system. In: Proc. IEEE Workshop on Applications of
Computer Vision, pp. 10601066 (1992)
75. Kadir, T., Brady, M.: Scale, saliency and image description.
IJCV 45(2), 83105 (2001)
76. Kameda, Y., Minoh, M.: A human motion estimation method
using 3-successive video frames. In: Proc. Int. Conf. on Virtual
Systems and Multimedia, p. 6 (1996)
77. Ke, Y., Sukthankar, R., Hebert, M.: Efcient visual event detection
using volumetric features. In: ICCV, vol. 1, pp. 166173 (2005)
78. Kellokumpu, V., Pietikinen, M., Heikkil, J.: Human activity
recognition using sequences of postures. Mach. Vis. Appl., pp.
570573 (2005)
79. Kienzle, W., Scholkopf, B., Wichmann, F.A., Franz, M.O.: Howto
nd interesting locations in video: a spatiotemporal interest point
detector learned from human eye movements. In: 29th DAGM
Symposium, pp. 405414, September 2007
80. Kindratenko, V.: Development and application of image analysis
techniques for identication and classication of microscopic par-
ticles. PhDthesis, University of Antwerp, Belgium(1997). http://
www.ncsa.uiuc.edu/~kindr/phd/index.pdf
81. Khotanzad, A., Hong, Y.H.: Invariant image recognition by
Zernike moments. IEEE Trans. PAMI 12(5), 489497 (1990)
82. Kumar, S., Kumar, D., Sharma, A., McLachlan, N.: Classica-
tion of hand movements using motion templates and geometrical
based moments. In: Proc. Intl Conf. on Intelligent Sensing and
Information Processing, pp. 299304 (2003)
83. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV,
vol. 1, p. 432 (2003)
84. Laptev, I.: On space-time interest points. IJCV 64(2), 107123
(2005)
85. LaViola, J.: A survey of hand posture and gesture recog-
nition techniques and technology. Tech. Report CS-99-11,
Brown University, p. 80, June 1999
86. Leman, K., Ankit, G., Tan, T.: PDA-based human motion rec-
ognition system. Int. J. Softw. Eng. Knowl. 2(15), 199205
(2005)
87. Li, L., Zeng, Q., Jiang, Y., Xia, H.: Spatio-temporal motion seg-
mentation and tracking under realistic condition. In: Proc. Intl
Symposium on Systems and Control in Aerospace and Astronau-
tics, pp. 229232 (2003)
88. Lipton, A.J., Fujiyoshi, H., Patil, R.S.: Moving target classica-
tion and tracking from real-time video. In: Proc. IEEE Workshop
on Applications of Computer Vision, pp. 814 (1998)
89. Liu, J., Zhang, N.: Gait history image: a novel temporal template
for gait recognition. In: Proc. IEEE Int. Conf. Multimedia and
Expo, pp. 663666 (2007)
90. Lo, C., Don, H.: 3-D moment forms: their construction and
application to object identication and positioning. IEEE Trans.
PAMI 11(10), 10531063 (1989)
91. Lucas, B., Kanade, T.: An iterative image registration technique
with an application to stereo vision. In: Proc. Int. Joint Conf. on
Articial Intelligence, pp. 674679 (1981)
92. Ma, Q., Wang, S., Nie, D., Qiu, J.: Recognizing humans based on
gait moment image. In: 8th ACIS Intl. Conf. on Software Engi-
neering, Articial Intelligence, Networking, and Parallel/Distrib-
uted Computing, pp. 606610 (2007)
93. Masoud, O., Papanikolopoulos, N.: A method for human action
recognition. Image Vis. Comput. 21, 729743 (2003)
94. McCane, B., Novins, K., Crannitch, D., Galvin, B.: On bench-
marking optical ow. Comput. Vis. Image Underst. 84, 126
143 (2001)
95. McKenna, S.J., Jabri, S., Duric, Z., Wechsler, H., Rosenfeld,
A.: Tracking groups of people. Comput. Vis. Image Underst.
80(1), 4256 (2000)
96. Meng, H., Pears, N., Bailey, C.: A human action recognition sys-
tem for embedded computer vision application. In: Proc. Work-
shoponEmbeddedComputer Vision(withCVPR), pp. 16(2007)
97. Meng, H., Pears, N., Bailey, C.: Human action classication
using SVM_2K classier on motion features. In: LNCS: Mul-
timedia Content Representation, Classication and Security,
vol. 4105/2006, pp. 458465 (2006)
98. Meng, H., Pears, N., Bailey, C.: Motion information combina-
tion for fast human action recognition. In: Proc. Conf. Computer
Vision Theory and Applications (VIASAPP07), Spain, March
2007
99. Meng, H., Pears, N., Bailey, C.: Recognizing human actions based
on motion information and SVM. In: Proc. IEE Int. Conf. Intelli-
gent Environments, pp. 239245 (2006)
100. Meng, H., Pears, N., Freeman, M., Bailey, C.: Motion history his-
tograms for human action recognition. In: Embedded Computer
Vision (Advances in Pattern Recognition), part II, pp. 139162.
Springer, London (2009)
101. Mittal, A., Paragois, N.: Motion-based background subtraction
using adaptive kernel density estimation. In: Proc. IEEE CVPR,
p. 8 (2004)
102. Moeslund, T.B.: Summaries of 107 computer vision-based human
motion capture papers. Tech. Report: LIA 99-01, University of
Aalborg, p. 83, March 1999
123
Motion history image: its variants and applications 279
103. Moeslund, T.B., Granum, E.: A survey of computer vision-based
human motion capture. Comput. Vis. Image Underst. 81, 231
268 (2001)
104. Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in
vision-based human motion capture and analysis. Comput. Vis.
Image Underst. 104, 90126 (2006)
105. Ng, J., Gong, S.: Learning pixel-wise signal energy for under-
standing semantics. In: Proc. BMVC, pp. 695704 (2001)
106. Ng, J., Gong, S.: Learning pixel-wise signal energy for under-
standing semantics. Image Vis. Comput. 21, 11831189 (2003)
107. Nguyen, Q., Novakowski, S., Boyd, J.E., Jacob, C., Hushlak, G.:
Motion swarms: video interaction for art in complex environ-
ments. In: Proc. ACM Int. Conf. Multimedia, CA, pp. 461469
(2006)
108. Ogata, T., Tan, J.K., Ishikawa, S.: High-speed human motion rec-
ognition based on a motion history image and an Eigenspace.
IEICE Trans. Inf. Syst. E89-D(1), 281289 (2006)
109. Oikonomopoulos, A., Patras, I., Pantic, M.: Spatiotemporal salient
points for visual recognition of human actions. IEEE Trans. Syst.
Man Cybern. B: Cybern. 36(3), 710719 (2006)
110. Orrite, C., Martnez, F., Herrero, E., Ragheb, H., Velastin, S.:
Independent viewpoint silhouette-based human action modelling
and recognition. In: Proc. Int. Workshop on Machine Learning
for Vision-based Motion Analysis (MLVMA08) with ECCV,
pp. 112 (2008)
111. Pantic, M., Pentland, A., Nijholt, A., Hunag, T.S.: Human com-
puting and machine understanding of human behavior: a sur-
vey. In: Proc. Int. Conf. on Multimodal Interfaces, pp. 239248
(2006)
112. Pantic, M., Patras, I., Valstar, M.F.: Learning spatio-temporal
models of facial expressions. In: Proc. Int. Conf. on Measuring
Behaviour, pp. 710, September 2005
113. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly
accurate optic ow computation with theoretically justied warp-
ing. Int. J. Comput. Vis. 67(2), 141158 (2006)
114. Pavlovic, V., Sharma, R., Huang, T.: Visual interpretation of hand
gestures for human-computer interaction: a review. IEEE Trans.
PAMI 19(7), 677695 (1997)
115. Piater, J., Crowley, J.: Multi-modal tracking of interacting tar-
gets using Gaussian approximations. In: Proc. IEEEWorkshop on
Performance Evaluation of Tracking and Surveillance at CVPR,
pp. 141147 (2001)
116. Petrs, I., Beleznai, C., Dedeo glu, Y., Pards, M., et al.: Flexi-
ble test-bed for unusual behavior detection. In: Proc. ACM Conf.
Image and Video Retrieval, pp. 105108 (2007)
117. Polana, R., Nelson, R.: Low level recognition of human motion.
In: Proc. IEEE Workshop on Motion of Non-rigid and Articulated
Objects, pp. 7782 (1994)
118. Poppe, R.: Vision-based human motion analysis: an over-
view. Comput. Vis. Image Underst. 108(12), 418 (2007)
119. Rapantzikos, K., Avrithis, Y., Kollias, S.: Dense saliency-based
spatiotemporal feature points for action recognition. In: Intl. Conf.
on Computer Vision and Pattern Recognition, pp. 18 (2009)
120. Rhne-Alpes, I.: The Inria XMAS (IXMAS) motion acquisition
sequences. http://charibdis.inrialpes.fr
121. Roh, M.-C., Shin, H.-K., Lee, S.-W., Lee, S.-W.: Volume motion
template for view-invariant gesture recognition. In: Proc. ICPR,
vol. 2, pp. 12291232 (2006)
122. Rosales, R.: Recognition of human action using moment-based
features. Boston University Computer Science Tech. Report, BU
98-020, 119, November 1998
123. Rosales, R., Sclaroff, S.: 3D trajectory recovery for tracking mul-
tiple objects and trajectory guided recognition of actions. In: Proc.
CVPR, vol. 2, pp. 117123 (1999)
124. Ryu, W., Kim, D., Lee, H.-S., Sung, J., Kim, D.: Gesture recog-
nition using temporal templates. In: Proc. ICPR, Demo Program,
Hong Kong, August 2006
125. Ruiz-del-Solar, J., Vallejos, P.A.: Motion detection and tracking
for an AIBO robot using camera motion compensation and Kal-
man ltering. In: Proc. RoboCup Int. Symposium 2004, Lisbon,
LNCS, vol. 3276, pp. 619627 (2005)
126. Sarkar, S., Phillips, P.J., Liu, Z., Vega, I.R., Grother, P.,
Bowyer, K.W.: The humanid gait challenge problem: data sets,
performance, and analysis. IEEE Trans. PAMI 27(2), 162177
(2005)
127. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions:
a local SVM approach. In: Proc. ICPR, vol. 3, pp. 3236 (2004)
128. Senior, A., Tosunoglu, S.: Hybrid machine vision control. In:
Florida Con. on Recent Advances in Robotics, pp. 16, May 2005
129. Shan, C., Wei, Y., Qiu, X., Tan, T.: Gesture recognition using
temporal template based trajectories. In: Proc. ICPR, vol. 3,
pp. 954957 (2004)
130. Shin, H.-K., Lee, S.-W., Lee, S.-W.: Real-time gesture recogni-
tion using 3Dmotion history model. In: Proc. Conf. on Intelligent
Computing, Part I, LNCS, vol. 3644, pp. 888898, China, August
2005
131. Sigal, L., Black, M.J.: HumanEva: Synchronized video and
motion capture dataset for evaluation of articulated human
motion. Department of Computer Science, Brown University,
Tech. Report CS-06-08, p. 18, September 2006
132. Singh, R., Seth, B., Desai, U.: A real-time framework for vision
based human robot interaction. In: Proc. IEEE/RSJ Conf. on Intel-
ligent Robots and Systems, pp. 58315836 (2006)
133. Son, D., Dinh, T., Nam, V., Hanh, T., Lam, H.: Detection and
localization of road area in trafc video sequences using motion
information and fuzzy-shadowed sets. In: Proc. IEEE Intl Symp.
Multimedia, pp. 725732, December 2005
134. Spengler, M., Schiele, B.: Towards robust multi-cue integration
for visual tracking. Mach. Vis. Appl. 14, 5058 (2003)
135. Stauffer, C., Grimson, W.: Adaptive background mixture models
for real-time tracking. In: Proc. IEEE CVPR, vol. 2, pp. 246252
(1999)
136. Sun, H.Z., Feng, T., Tan, T.N.: Robust extraction of moving
objects from image sequences. In: Proc. Asian Conference on
Computer Vision, pp. 961964 (2000)
137. Sziranyi, T.: with other partners UPC, SZTAKI, Bilkent and ACV:
real time detector for unusual behavior. http://www.muscle-noe.
org/content/view/147/64/
138. Talukder, A., Goldberg, S., Matthies, L., Ansar, A.: Real-time
detection of moving objects in a dynamic scene from moving
robotic vehicles. In: Proc. IEEE/RSJ Intl Conference on Intelli-
gent Robots and Systems, pp. 13081313 (2003)
139. Tan, J.K., Ishikawa, S.: High accuracy and real-time recognition
of human activities. In: 33rd Annual Conf. of IEEE Industrial
Electronics Society (IECON), pp. 23772382 (2007)
140. Vafadar, M., Behrad, A.: Human hand gesture recognition using
motion orientation histogram for interaction of handicapped per-
sons with computer. In: Elmoataz, A., et al. (eds.) ICISP 2008,
LNCS, vol. 5099, pp. 378385 (2008)
141. Valstar, M., Pantic, M., Patras, I.: Motion history for facial
action detection in video. In: Proc. IEEE Int. Conf. SMC, vol. 1,
pp. 635640 (2004)
142. Valstar, M., Patras, I., Pantic, M.: Facial action recogni-
tion using temporal templates. In: Proc. IEEE Workshop on
Robot and Human Interactive Communication, pp. 253258
(2004)
143. Vitaladevuni, S.N., Kellokumpu, V., Davis, L.S.: Action recogni-
tion using ballistic dynamics. In: Proc. CVPR, p. 8 (2008)
123
280 Md. A. R. Ahad et al.
144. Wang, L., Hu, W., Tan, T.: Recent developments in human motion
analysis. Pattern Recognit. 36, 585601 (2003)
145. Wang, J.J.L., Singh, S.: Video analysis of human dynamicsa
survey. Real-Time Imaging 9(5), 321346 (2006)
146. Wang, C., Brandstein, M.S.: A hybrid real-time face tracking sys-
tem. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Pro-
cessing, p. 4 (1998)
147. Wang, L., Suter, D.: Informative shape representations for
human action recognition. Intl. Conf. Pattern Recognit. 2, 1266
1269 (2006)
148. Watanabe, K., Kurita, T.: Motion recognition by higher order local
auto correlation features of motion history images. In: Proc. Bio-
inspired, Learning and Intelligent Systems for Security, pp. 5155
(2008)
149. Wei, J., Harle, N.: Use of temporal redundancy of motion vectors
for the increase of optical ow calculation speed as a contribu-
tion to real-time robot vision. In: Proc. IEEE TENCONSpeech
andImage Technologies for ComputingandTelecommunications,
pp. 677680 (1997)
150. Weinland, D., Ronfard, R., Boyer, E.: Automatic discovery
of action taxonomies from multiple views. In: Proc. CVPR,
pp. 16391645 (2006)
151. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action rec-
ognition using motion history volumes. Comput. Vis. Image Und-
erst. 104(2), 249257 (2006)
152. Willems, G., Tuytelaars, T., Gool, L.V.: An efcient dense and
scale-invariant spatio-temporal interest point detector. In: 10th
European Conference on Computer Vision, pp. 650663 (2008)
153. Wixson, L.: Detecting salient motion by accumulating direction-
ally-consistent ow. IEEE Trans. PAMI 22(8), 774780 (2000)
154. Wong, S.F., Cipolla, R.: Continuous gesture recognition using a
sparse Bayesian classier. In: Intl. Conf. on Pattern Recognition,
vol. 1, pp. 10841087 (2006)
155. Wong, S.F., Cipolla, R.: Real-time adaptive hand motion recogni-
tion using a sparse Bayesian classier. In: Intl. Conf. on Computer
Vision Workshop, pp. 170179 (2005)
156. Wong, S.F., Cipolla, R.: Extracting spatiotemporal interest points
using global information. In: ICCV, pp. 18 (2007)
157. Wren, R., Clarkson, B.P., Pentland, A.P.: Understanding purpose-
ful human motion. In: Proc. Intl Conf. on Automatic Face and
Gesture Recognition, pp. 1925 (1999)
158. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.P.:
Pnder: real-time tracking of the human body. IEEE Trans.
PAMI 19(7), 780785 (1997)
159. Xiang, T., Gong, S.: Beyond tracking: modelling activity and
understanding behaviour. Int. J. Comput. Vis. 67(1), 2151 (2006)
160. Yang, Y.H., Levine, M.D.: The background primal sketch: an
approach for tracking moving objects. Mach. Vis. Appl. 5, 1734
(1992)
161. Yang, X., Zhang, T., Zhou, Y., Yang, J.: Gabor phase embedding
of gait energy image for identity recognition. In: 8th IEEE Intl.
Conf. on Computer and Information Technology, pp. 361366,
July 2008
162. Yau, W., Kumar, D., Arjunan, S., Kumar, S.: Visual speech rec-
ognition using image moments and multiresolution wavelet. In:
Proc. Conf. on Computer Graphics, Imaging and Visualization,
pp. 194199 (2006)
163. Yau, W., Kumar, D., Arjunan, S.: Voiceless speech recognition
using dynamic visual speech features. In: Proc. HCSNet Work-
shop on the Use of Vision in HCI, Australia (2006)
164. Yilmaz, A., Shah, M.: Actions sketch: a novel action representa-
tion. In: IEEE Computer Society Conf. on Computer Vision and
Pattern Recognition, vol. 1, pp. 984989 (2005)
165. Yin, Z., Collins, R.: Movingobject localizationinthermal imagery
by forward-backward MHI. In: Proc. IEEE Workshop on Object
Tracking and Classication in and Beyond the Visible Spectrum,
NY, pp. 133140, June 2006
166. Yu, C.-C., Cheng, H.-Y., Cheng, C.-H., Fan, K.-C.: Efcient
human action and gait analysis using multiresolution motion
energy histogram. EURASIP J. Adv. Signal Process. 2010,
113 (2010)
167. Yu, S., Tan, D., Tan, T.: A framework for evaluating the effect of
view angle, clothing and carrying condition on gait recognition.
In: Intl. Conf. on Pattern Recognition, pp. 441444 (2006)
168. Zhang, D., Lu, G.: Reviewof shape representation and description
techniques. Pattern Recognit. 37, 119 (2004)
169. Zhou, H., Hu, H.: A surveyhuman movement tracking and
stroke rehabilitation. Tech. Report: CSM-420, Department of
Computer Sciences, University of Essex, p. 33, December 2004
170. Zou, X., Bhanu, B.: Human activity classication based on gait
energy image and co-evolutionary genetic programming. In: Proc.
ICPR, vol. 3, pp. 555559 (2006)
Author Biographies
Md. Atiqur Rahman Ahad
was born in Bangladesh and
has obtained his B.Sc. (Hons)
and Masters degrees from the
Department of Applied Physics,
Electronics and Communication
Engineering, Universityof Dhaka,
Bangladesh. He later received a
Masters degree from School of
Computer Science and Engineer-
ing, University of New South
Wales, Australia. He obtained his
Ph.D. degree from the Faculty of
Engineering, Kyushu Institute of
Technology, Japan. Since 2000,
he has taught in different universities and working in the University
of Dhaka, Bangladesh, since 2001 (currently on-leave). He has also
served as a Casual Academic in University of New South Wales during
three sessions from2002 to 2004. He is currently working as JSPS Post-
doctoral Research Fellowat Kyushu Institute of Technology, Japan. Mr.
Ahad is a student member of IEEE, IEEE IES and Society of Instru-
ment and Control Engineers (SICE). He has won the Best Student Paper
Award in the International Workshop on Combinatorial Image Analy-
sis (IWCIA), Buffalo, NY, in April 2008. He has also been awarded the
Biomedical Fuzzy Systems Associations Best Paper Award (Journal)
in 2008. His present research includes human motion recognition and
analysis, motion segmentation, motion tracking, etc.
123
Motion history image: its variants and applications 281
Joo Kooi Tan obtained his
Ph.D. from Kyushu Institute of
Technology in 2000. She is
presently an assistant profes-
sor with faculty of Mechanical
and Control Engineering in the
same university. Her current main
research interests include three-
dimensional shape and motion
recovery, human motion analy-
sis, human activity recognition
and understanding, and applica-
tions of computer vision. She
received the SICEKyushu Branch
Young Authors Award in 1999,
the AROB10th Young Authors Award in 2004, Young Authors Award
from IPSJ of Kyushu Branch in 2004, the Japanese Journal Best Paper
Award from BMFSA in 2008, the Best Paper Award from ISII in 2009,
and she has also won The Excellent Paper Award from Biomedical
Fuzzy SystemAssociation in 2010. She is a member of IEEE, The Soci-
ety of Instrument and Control Engineers, and The Information Process-
ing Society of Japan.
Hyoungseop Kim received his
B.A. degree in electrical engi-
neering from Kyushu Institute of
Technology in 1994, the Masters
and Ph.D. degree from Kyushu
Institute of Technology in 1996
and 2001, respectively. He is an
associate professor in the Depart-
ment of Control Engineering at
Kyushu Institute of Technology.
His research interests are focused
on medical application of image
analysis. He is currently work-
ing on automatic segmentation
of multi-organ of abdominal CT
image, and temporal subtraction of thoracic MDCT image sets.
Seiji Ishikawa obtained his
B.E., M.E., and D.E. degrees
from The University of Tokyo,
where he majored in Mathemat-
ical Engineering and Instrumen-
tation Physics. He joined Kyushu
Institute of Technology and he
is currently Professor of Depart-
ment of Control & Mechani-
cal Engineering, KIT. Professor
Ishikawa was a visiting research
fellow at Shefeld University,
U.K., from 1983 to 1984, and a
visiting professor at Utrecht Uni-
versity, The Netherlands, in 1996.
He was awarded BMFSA Best Paper Award in 2008 and 2010. His
research interests include three-dimensional shape/motion recovery,
and human detection and its motion analysis from car videos. He is
a member of IEEE, The Society of Instrument and Control Engineers,
The Institute of Electronics, Information and Communication Engi-
neers, and The Institute of Image Electronics Engineers of Japan.
123