Motion History Images

Machine Vision and Applications (2012) 23:255281
DOI 10.1007/s00138-010-0298-4
ORIGINAL PAPER
Motion history image: its variants and applications
Md. Atiqur Rahman Ahad J. K. Tan H. Kim
S. Ishikawa
Received: 4 January 2010 / Revised: 21 May 2010 / Accepted: 10 September 2010 / Published online: 22 October 2010
Springer-Verlag 2010
Abstract The motion history image (MHI) approach is a
view-based temporal template method which is simple but
robust in representing movements and is widely employed
by various research groups for action recognition, motion
analysis and other related applications. In this paper, we pro-
vide an overview of MHI-based human motion recognition
techniques and applications. Since the inception of the MHI
template for motion representation, various approaches have
been adopted to improve this basic MHI technique. We pres-
ent all important variants of the MHI method. This paper
points some areas for further research based on the MHI
method and its variants.
Keywords MHI MEI Motion recognition
Action analysis Computer vision
1 Introduction
There are excellent surveys onhumanmotionrecognitionand
analysis [13, 7, 13, 37, 61, 67, 69, 85, 102104, 111, 114, 118,
131, 144, 145, 157, 169]. These papers cover many detailed
approaches and issues, and most of these have cited the
motion history image (MHI) method [31] as one of the impor-
tant methods. This paper surveys human motion and behavior
analysis based on the MHI and its variants for various appli-
cations. Action recognition approaches can be categorized
into one of the three groups: (i) template matching, (ii) state-
space approaches and (iii) semantic description of human
behaviors [2, 144]. The MHI method is a template matching
Md. A. R. Ahad (B) J. K. Tan H. Kim S. Ishikawa
Faculty of Engineering, Kyushu Institute of Technology,
1-1, Sensui-cho, Tobata, Kitakyushu, Fukuoka 804-0012, Japan
e-mail: atiqahad@yahoo.com
approach. Approaches based on template matching rst con-
vert an image sequence into a static shape pattern (e.g., MHI,
MEI), and then compare it to pre-stored action prototypes
during recognition [144]. Template matching approaches
are easy to implement and require less computational load,
though they are more prone to noise and more suscepti-
ble to the variations of the time interval of the movements.
Some template matching approaches are presented in Refs.
[6, 31, 96, 117, 123]. Moreover, recognition approaches can
be divided into (i) appearance- or view-based approaches, (ii)
generic human model recovery, and (iii) direct motion-based
recognition approaches [31]. Appearance-based motion rec-
ognition is one of the most practical recognition methods for
recognizing a gesture without any incorporation of sensors
on the human body or its neighborhoods. The MHI is a view-
based or appearance-based template-matching approach.
In the MHI, the silhouette sequence is condensed into
gray scale images, while dominant motion information is
preserved. Therefore, it can represent a motion sequence in
compact manner. This MHI template is also not so sensitive
to silhouette noises, like holes, shadows, and missing parts.
These advantages make these templates a suitable candidate
for motion and gait analysis [89]. It keeps a history of tem-
poral changes at each pixel location, which then decays over
time [159]. The MHI expresses the motion ow or sequence
by using the intensity of every pixel in temporal manner.
The motion history recognizes general patterns of move-
ment; thus, it can be implemented with cheap cameras and
lower powered CPUs [33, 34]. It can also be implemented in
low light areas where structure can not be easily detected.
The paper is organized as follows: Sect. 2 introduces
the basic MHI approach, and then we sum up the variants
of the MHI method in Sect. 3. Section 4 presents various
applications based on these approaches. Section 5 discusses
some issues related to the MHI approach and its variants for
123
256 Md. A. R. Ahad et al.
Fig. 1 Development of the MHI images for two different actions. The produced MHI images are shown under the actions sequentially
future research perspectives. Finally, Sect. 6 concludes the
paper.
2 Overview of the motion history image method
This section presents an overviewof the MHI method. Impor-
tance of various parameters is analyzed. Finally, several lim-
itations of the basic MHI method are pointed out.
2.1 MHI and MEI templates
Bobick and Davis in [30] rst propose a representation and
recognition theory that decomposed motion-based recogni-
tion by rst describing where there is motion (the spatial
pattern) and then describing how the object is moving. They
[30, 31] present the construction of a binary motion energy
image (MEI) or binary motion region (BMR) [47], which
represents where motion has occurred in an image sequence.
The MEI describes the motion-shape and spatial distribution
of a motion. Next, an MHI is generated. Intensity of each
pixel in the MHI is a function of motion density at that loca-
tion. One of the advantages of the MHI representation is that
a range of times may be encoded in a single frame, and in this
way, the MHI spans the time scale of human gestures [33].
Taken together, the MEI and the MHI can be considered
as a two-component version of a temporal template, a vec-
tor-valued image where each component of each pixel is
some function of the motion at that pixel position [31]. These
view-specic templates are matched against the stored mod-
els of views of known movements. Incorporation of both the
MHI and the MEI templates constitute the MHI method. The
MHI H
(x, y, t ) can be computed from an update function

(x, y, t ):
H
(x, y, t )
=
_
if (x, y, t )=1
max (0, H
(x, y, t 1)) otherwise

Here, (x, y) and t showthe position and time, (x, y, t ) sig-
nals objects presence (or motion) in the current video image,
the duration decides the temporal extent of the movement
(e.g., in terms of frames), and is the decay parameter. This
update function is called for every new video frame ana-
lyzed in the sequence. The result of this computation is a
scalar-valued image where more recently moving pixels are
brighter and vice-versa [31, 93].
Figure 1 presents the development of the MHI images for
twodifferent actions sequentially. These illustrate clearlythat
a nal MHI image records the temporal history of motion in
it. Some possible image processing techniques for dening
this update function (x, y, t ) are background subtraction,
image differencing and optical ow [165]. More details on
this issue will be presented below. Usually, the MHI is gener-
ated froma binarized image, obtained fromframe subtraction
[162], using a threshold :
123
Motion history image: its variants and applications 257
Fig. 2 Example of the MHI and the MEI: First four columns are some
sequential frames; images in 5th column are the corresponding MHIs;
and images in the right-most column show the MEI images for both
actions (hand-wavingandbody-bending: upwards usingleft-hand(top
row) and downwards using right-hand (bottom row))
(x, y, t ) =
_
1 if D(x, y, t )
0 otherwise
where D(x, y, t ) is dened with difference distance as:
D(x, y, t ) = |I (x, y, t ) I (x, y, t )|
Here, I (x, y, t ) is the intensity value of pixel location with
coordinate (x, y) at the t th frame of the image sequence.
We can get the nal MHI template as H
(x, y, ). Now
we will dene the MEI. The MEI is the cumulative binary
motion image that can describe where a motion is in the
video sequence, computed from the start frame to the nal
frame. The moving objects sequence sweeps out a particu-
lar region of the image and the shape of that region (where
there is motioninstead of how as in the MHI concept) can
be used to suggest the movement occurring region [57]. As
the update function (x, y, t ) represents a binary image
sequence indicating regions of motion, the MEI E
(x, y, t )
can be dened as:
E
(x, y, t ) =
1
_
i =0
D(x, y, t i )
The MEI can be deduced from the MHI (by thresholding the
MHI above zero [31]),
E
(x, y, t ) =
_
1 if H
(x, y, t ) 1
0 otherwise
A benet of using the gray-scale MHI is that it is sensi-
tive to the direction of motion, unlike the MEIs; hence the
MHI is better suited for discriminating between actions of
opposite directions (e.g., sitting down versus standing up)
[47]. However, both the MHI and the MEI images are impor-
tant for representing motion information. The two images
together provide better discrimination than either alone [31].
Figure 2 shows typical MHI and MEI for two mirror-actions
of one hand-waving and sideway body-bending.
2.2 Dependence on and
Figure 3 shows the dependence on in producing the MHI.
In this action of waving up the left-hand (with 26 frames),
we produce different MHIs with different values. If the
value is smaller than the number of frames, then we loss
prior information of the action in its MHI. For example, when
= 15 for an action having 26 frames, we lose the motion
information of the rst frame after 15 frames if the value of
the decay parameter () is 1. On the other hand, if the tem-
poral duration value is set at very high value compared to the
number of frames (e.g., 250 in this case for an action with 26
frames), then the changes of pixel values in the MHI template
is less significant. Therefore, this point should be considered
while producing MHIs.
Figure 4 shows the dependence on the decay parameter ()
while calculating the MHI image. In the basic MHI method
[31], is replaced by 1. While loading the frames, if there
is no change (or no presence) of motion in a specic pixel
where earlier there was a motion, the pixel value is reduced
by . However, having different values may provide slightly
different information; hence the value can be chosen empir-
ically. Researchers need to consider this parameter, while
working with the MHI. The top-row of Fig. 4 shows nal
MHI images for the same action (as shown in Fig. 1 (top
row)) with different values (i.e., 1, 3, 5 and 10). We notice
that higher values for remove the earlier trail of motion
sequence. The second row presents a running action. The
rst two images are for = 1, and the latter two for = 3,
while the 1st and 3rd images are taken mid-way and the 2nd
and 4th images are taken at the end of the sequence. We note
that when = 3, part of the earlier motion information is
123
Fig. 3 Dependence on to
develop MHI images
Fig. 4 Dependence on in
calculating the MHI template
missing. Similarly, the 3rd row shows the MHIs for walking
action. The bottomrowpresents MHIs (1st and3rd) andMEIs
(2nd and 4th) for walking action when is set as 250 instead
of its number of frames 100. The rst two images considered
= 3, while the last two images considered = 5. These
information are important based on the demand and action
sets, we can modulate the values of and while producing
the MHI and the MEI.
Regarding the parameters, a question may arise: Under
what circumstances, does one want a faster versus slower
decay? From the above discussion, it is clear that together
the values of and combine to determine how long it takes
for a motion to decay to 0, thus determining the temporal win-
dow size. However, different settings can lead to the same
temporal window (e.g., = 10 and = 1 leads to the
same temporal window as = 100 and = 10). The joint
effect of and is to determine how many levels of quanti-
zation the MHI will have; thus the combination of a large
and a small yields a slowly-changing continuous gradient,
whereas a large and large provide a more step-like, dis-
crete quantization of motion. This provides insight into not
only what parameters and design choices one has, but also
into the impact of choosing different parameters or designs.
2.3 Selection of update function (x, y, t ) for motion
segmentation
Many vision-based human motion analysis systems start
with human detection [144]. Human detection aims at seg-
menting regions of interest corresponding to people from
the rest of an image. It is a significant issue in a human
motion analysis system since the subsequent processes such
123
as tracking and action recognition are greatly dependent on
the performance and proper segmentation of the region of
interest. Background subtraction, frame differencing, opti-
cal ow, statistical methods for subtraction are renowned
approaches for motion segmentation. Based on the static
or dynamic background, the performance and method for
background subtraction vary. For static background (i.e.,
with no background motion), it is trivial to subtract the
background, when other factors like outdoor or cluttered
scenes are absent. Few dominant methods are enlisted
in Refs. [22, 56, 65, 74, 95, 101, 135, 136, 158, 160]. Some of
these methods employ various statistical approaches, adap-
tive background models (e.g., [74, 135]) and incorporation of
other features (e.g., color and gradient information in [95])
with an adaptive model, in order to negotiate dynamic back-
ground or other complex issues related to background sub-
traction. For the MHI generation, background subtraction is
employed initially by [31, 47].
Frame-to-frame differencing methods are also widely
used for motion segmentation [21, 28, 43, 76, 88, 146]. These
temporal differencing methods employed among two [21, 43,
88, 146] or three consecutive frames [28, 76] are adaptive to
dynamic environments, though we can note poor extraction
of the relevant feature pixels in general. Unless the thresholds
are dened properly, a generation of holes (see Fig. 6) inside
moving objects is a major concern. To generate the MHI and
the MEI, temporal differencing methods are employed (e.g.,
by [8]) as well.
Optical owmethods [27, 29, 66, 91, 94, 113, 138, 149, 153]
can be used for the generation of the MHI and motion
segmentation for various purposes. Ahad et al. [6, 8, 10]
employed optical owin their variants of the MHI for motion
segmentation to extract the moving object. Computing qual-
ity optical owfromconsecutive image frames is a challeng-
ing task. To produce better results in a motions presence and
its directions from optical ow, RANSAC (RANdom Sam-
ple Consensus) method [60] can be employed to reduce out-
liers. Based on this rened optical ow vectors, the MHI can
be constructed, thus providing better direction and a clearer
picture of the motions presence. Ahad et al. [6] employed
optical ows four channels to compute the MHIs. In this
case, instead of background or frame subtraction, a gradi-
ent-based optical ow vector [42] ( (x, y, t )) is computed
between two consecutive frames and split it into four chan-
nels (as depicted in Fig. 5). It is based on the concept of
motion descriptors on smoothed and aggregated optical ow
measurements [55].
Though optical ow can produce good results even in the
presence of a bit of camera motion, it is computationally
complex and very sensitive to noise and the presence of tex-
ture. Moreover, for real-time perspective, special optical ow
methods can be tried to ascertain whether it can achieve bet-
ter results without any incorporation of special hardware.
Fig. 5 Optical owis split into four different channels, which are used
to calculate directional MHIs
Beauchemin and Barron [27] and McCane et al. [94] have
presented various methods for optical ow. Seven differ-
ent methods are tested by Novins et al. [94]s method for
benchmarking optical ow methods. Several real-time opti-
cal methods [29, 138, 149] are developed for various motion
segmentation and computations.
The changes in weather, illumination variation, repetitive
motion, and presence of camera motion or cluttered envi-
ronment hinder the performance of motion segmentation
approaches. Therefore, a proper approach is crucial based
on the dataset or environment, especially for outdoor envi-
ronment. Extraction of shadow and removal of it from the
motion part is another concern in computer vision, and most
importantly in generating the MHI template.
As pointed out above, one important concern is the selec-
tion of the update function, (x, y, t ) for motion segmenta-
tion and its threshold value (). Figure 6 demonstrates typical
examples on the selection of threshold values for the frame
subtraction method. The top-rowpresents MHIs for an action
with different threshold values (i.e., 30, 50, 75 and 150 from
left to right). We note the presence of noisy background when
the threshold is employed at 30 (rst image of 1st row).
However, if we increase , we also miss some part of the
motion information (note the presence of hole in the right-
most image on the top-row). In another example (as shown in
the bottom-row images), we use walking motion in a differ-
ent environment and depth. The noisy MHI and MEI images
(1st two images) for a walking action has employed = 12,
whereas the next two images (without any noisy background
but with missing information) used = 150. Therefore, the
selection of the update function and its are very crucial for
calculating motion history/energy templates.
123
Fig. 6 Importance of the
selection of threshold value ()
for the update function,
(x, y, t ). Note the presence of
noises or holes in various images
Fig. 7 Changes in standing
position of a person (top-row)
create MHIs (1st and 3rd
images) and MEIs (2nd and 4th
images) wider, as shown in the
bottom-row (1st two images are
computed at 45 frames and the
rest two images are at the nal
frame)
Another issue is the change of the standing position of a
person while executing an action that is supposed to be in one
specic location. For example, Fig. 7 depicts a person mov-
ing its standing position; hence the nal MHI becomes wider.
Therefore, if an action does not incorporate movement from
its initial position, then tracking on the central point of the
moving body is required for this kind of position changing.
Another useful option is to normalize the size of the entire
moving body and then create the MHI based on the normal-
ized moving portion; or normalize the MHI and the MEI for
further processing. This is crucial for recognition purposes
by employing the MHI images, because the MHI method
takes into account the global calculation of the image, and
hence changing its position makes the nal MHI wider than
the object of interest and incorporates some unwanted region
of interest.
2.4 Feature vector analysis and classication
Figure 8 shows the system ow of the basic MHI approach
for motion classication and recognition. According to the
basic MHI method [31], feature vectors are calculated using
seven Hu moments [68] from the MHI and MEI images. Hu
moments are widely used for shape representation [6, 8, 10,
11, 15, 31, 33, 34, 47, 48, 122, 123, 170]. There are other differ-
ent approaches to get the shape for calculating feature vectors
from the templates. Figure 9 shows various other options
for shape representation based on [68, 80, 81, 168]. Though
Fig. 8 Typical system ow diagram of the MHI method for action
recognition
Hu invariants are widely employed for the MHI or related
methods, other approaches (e.g., Zernike moments [15, 16],
global geometric shape descriptors [15], Fourier transform
[143]) are also utilized for creating feature vectors. Several
researchers [15, 16, 38, 39, 122] employ the PCAto reduce the
dimensions of the feature vectors.
After feature vectors are developed, classication is done
and unknown motion is recognized. These steps are shown
in the system ow diagram of the MHI method (Fig. 8).
For classication, the support vector machine (SVM) [15,
16, 39, 9699], K-nearest neighbor (KNN) [6, 11, 23, 24, 112,
122, 141], multi-class nearest neighbor [15, 16], Mahalanobis
distance [31, 33, 34, 47] and maximumlikelihood (ML) [110]
are employed.
One could employ (i) the re-substitution method (training
and test sets are the same); (ii) the holdout method (half the
data is used for training and the rest data is used for test-
ing); (iii) Leave-one-out method; (iv) the rotation method or
123
Fig. 9 Numerous approaches
for shape representations.
Region-based global Hu
moments [68] are considered by
many researchers [33, 34, 47, 48,
6, 8, 122, 123, 10, 11, 170, 15],
including Bobick and Davis [31]
N-fold cross validation (a compromise between leave-one-
out method and holdout method, which divides the samples
into P disjoint subsets, 1 P N. Use (P 1) subsets for
training and the remaining subset for test); and (v) the boot-
strap method for partitioning scheme [70]. In most of the
cases, leave-one-out cross validation scheme is used for the
partitioning scheme (e.g., [6, 11, 110]). This means that out
of N samples from each of the cclasses per database, N-1 of
them are used to train (design) the classier and the remain-
ing one to test it [81]. This process is repeated N times, each
time leaving a different sample out. Therefore, all of the sam-
ples are ultimately used for testing. This process is repeated
and the resultant recognition rate is averaged. Usually, this
estimate is unbiased.
2.5 Limitations of the basic MHI method
Though successful in constrained situations, there are a few
limitations of the basic MHI method. The MHI method is
not suitable for dynamic background with its basic rep-
resentation (which is based on background subtraction or
image differencing approaches) [129]. However, by employ-
ing approaches that can segment motion information from
a dynamic background, the MHI method can be useful in
dynamic cases too. Occlusions of the body, or improper
implementation of the update function, (x, y, t ) results in
serious recognition failures [31, 2].
The MHI method does not need trajectory analysis [46].
However, the non-trajectory nature of it can be a problem for
cases where tracking could be necessary to analyze a moving
car or a person [8]. The MHI representation is exploited with
tracking information for some applications (e.g., by [123]).
It is also limited to label-based (token) recognition, where it
could not yield any information other than specic identity
matches (e.g., it could not report that upward motion was
occurring at a particular image location) [24, 49, 50]. This is
due to the fact that the holistic generation (and matching)
of the moment features are computed from the entire tem-
plate [49].
Another limitation of this method is the requirement of
having stationary objects, and the insufciency of the rep-
resentation to discriminate among similar motions [123].
The MHI method is appearance-based method. However,
by employing several cameras from different directions and
by combining moment features from these directions, action
recognition can be achieved. However, due to some similar
representations for different actions (but from different cam-
era-views), it may produce false recognition for an action.
Another key problem of this method is its failure to
separate the motion information when there is motion
self-occlusion or overwriting [8, 19, 73, 96, 112, 141]. In this
problem, if an action has opposite directions (e.g., from
sitting down to a standing position) in its atomic actions,
then the previous motion information (e.g., sitting down) is
deleted or overwritten by the latter motion information (e.g.,
standing) (Fig. 10). Therefore, if a person sits down, and then
stands up, the nal MHI image should contain brighter pix-
els in the upper part of the image to represent the stand-up
motion only. It can not vividly distinguish the direction of the
motion. This self-occlusion of the moving object or person
overwrites the prior information. Like any template matching
approach, the MHI also has the drawback that it is sensitive
to the variance of movement duration [145].
3 Motion history image-based approaches
Inthis section, MHI-basedapproaches are presented. We start
with direct implementations of the MHI method for numer-
ous applications, and afterwards with some modications.
123
Fig. 10 Motion overwriting problem (due to self-occlusion) of the
MHI method
We also categorize and analyze important developments of
the MHI in 2D and 3D domains.
3.1 Various approaches employing the MHI method
3.1.1 Direct implementation of the MHI
Due to its simple representation of an action, the MHI method
is employed by different researchers without any modica-
tion for their respective demonstrations. Rosales [122] and
Rosales and Sclaroff [123] employ the MHI method with
seven Hu moments and Rosales [122] uses principal com-
ponents analysis (PCA) to reduce the dimensionality of this
representation. The systemis trained using different subjects
performing a set of examples of every action to be recog-
nized. Given these samples, K-nearest neighbor, Gaussian,
and Gaussian mixture classiers are used to recognize new
actions. Experiments are conducted using instances of eight
human actions performed by seven different subjects and
good recognition results are achieved. Rosales and Sclar-
off [123] propose a trajectory-guided recognition method. It
tracks an action by employing extended Kalman lter and
then uses the MHI for action recognition via a mixture of
Gaussian classier. They test the system to recognize differ-
ent dynamic outdoor activities.
Jan [71] hypothesizes that a suspicious person in a
restricted parking lot would display erratic pattern in his/her
walking trajectories (to inspect vehicles and its belongings
for possible malicious attempts). To this aim, trajectory infor-
mation is collected and its MHI (based on proles of changes
in velocity and in acceleration) is computed. The highest,
average and median MHI values are proled for each indi-
vidual on the scene. Though It is a simple hypothesis, col-
lections of real-time data from surveillance devices seem
challenging. Nonetheless, it is an initial phase to make an
attempt to analyze the information, which can be exploited
to decide possible suspicious behaviors. Apparently there can
be far more features than just trajectories (and its velocity and
accelerations).
Alahari and Jawahar [18] model action characteristics by
MHIs for some hand gesture recognition and four different
actions (i.e., jumping, squatting, limping and walking). They
introduce discriminative actions, which describe the use-
fulness of the fundamental units in distinguishing between
events. They achieve average 30.29% reduction in error for
some event pairs.
Shanet al. [129] employthe MHI for handgesture recogni-
tion considering the trajectories of the motion. They employ
the mean shift embedded particle lter, which enables a robot
to robustly track natural hand motion in real-time. Then, an
MHI for a hand gesture is created based on the hand track-
ing results. In this manner, spatial trajectories are retained
in a static image, and the trajectories are called temporal
template-based trajectories (TTBT). Hand gestures are rec-
ognized based on statistical shape and orientation analysis of
TTBT. By applying this hand tracking algorithm and gesture
recognition approach in a wheelchair, they have realized a
real-time hand control interface for the robot. Meng et al.
[100] developed a simple system based on SVM classier
and MHI representations, which is implemented on a recon-
gurable embedded computer vision architecture for real-
time gesture recognition. In another work by Vafadar and
Behrad [140], the MHI is employed for gesture recognition
for interaction with handicapped people. In this approach,
after constructing the MHI for each gesture, a motion orien-
tation histogram vector is extracted. These vectors are then
used for the training of hidden Markov Model (HMM) and
hand gesture recognition.
Yau et al. [162, 163] decompose MHIs into wavelet sub-
images using stationary wavelet transform(SWT). The moti-
vation of using the MHI in visual speech recognition is
123
the ability of the MHI to remove static elements from the
sequence of images and preserve the short duration complex
mouth movement. The MHI is also invariant to the skin color
of the speakers due to the difference of frame and image sub-
traction process involved in the generation of the MHI. Here,
the SWT is used to denoise and to minimize the variations
between the different MHIs of the same consonant. Three
moment-based features are extracted from SWT sub-images
to classify three consonants only.
The MHI is used to produce input images for the line
tter, which is a system for tting lines to a video sequence
that describe its motion [58]. It uses the MHI method for
summarizing the motion depicted in video clips; however,
it fails with rotational motion. Rotations are not encoded in
the MHIs because the moving objects occupy the same pixel
locations from frame to frame, and new information over-
writes old information. Another failure example is that of
curved motion. Obviously, the straight line model is inade-
quate here. In order to improve performance, a more exible
model is needed.
Orrite et al. [110] propose a silhouette-based action mod-
eling for recognition where they employ the MHI directly
as the input feature of the actions. Then these 2D templates
are projected into a new subspace by means of the Kohonen
self organizing feature map (SOM). Action recognition is
accomplished by a maximum likelihood (ML) classier. In
another experiment, Tan and Ishikawa [139] employe the
MHI method and their proposed method to compare six dif-
ferent actions. Their results produce poor recognition rate.
After analyzing the datasets, it seems that actions inside the
dataset have motion overwriting; hence, it is understandable
that the MHI method may have poor recognition rate for this
type of dataset. Also, Meng et al. [9699] and Ahad et al. [8]
compare the recognition performances of the MHI method
with their HMHH and DMHI methods, respectively, for sev-
eral different datasets (one with radio-aerobics dataset and
another with KTHdataset). These datasets have motion over-
writing due to self-occlusion, and therefore, their approaches
outshine the MHI method in terms of the average recognition
rates.
3.1.2 Implementation of the MHI with some modications
In this sub-section, several methods and applications are pre-
sented where the MHI method is exploited with little mod-
ication, or considered almost similar route in developing
the motion cues. The MHI method or the MHI and/or MEI
templates are implemented with some modications by sev-
eral researchers in different applications. To start with, Han
and Bhanu [63, 64] proposed the gait energy image (GEI)
that targets specic normal human walking representation,
based on the concept of the MEI. The GEI is implemented as
a gait template for individual gait recognition. As compared
to the MEI and the MHI, the GEI targets specic normal
human walking representation [63]. Given the preprocessed
binary gait silhouette images at time t in a video sequence,
the gray-level GEI is dened as,
G (x, y) =
1
N
N
t =1
B
t
(x, y)
where N is the number of frames in the complete cycle(s) of
a silhouette sequence, t is the frame number in the sequence
(moment of time) [63]. Therefore, this GEI becomes a time-
normalized accumulative energy image of human walking
in the complete cycle(s). Though it performs very well for
gait recognition, it seems from the construction of the equa-
tion that for humans activity recognition, this approach
might not perform as smartly as the MHI method does. In
the similar fashion, Zou and Bhanu [170] employ the GEI
and co-evolutionary genetic programming (CGP) for human
activity classication. They extract Hu moments and normal-
ize histogram bins from the original GEIs as input features.
The CGP is employed to reduce the feature dimensionality
and learn the classiers. Bashir et al. [25, 26] and Yang et al.
[161] implement the GEI directly for human identication
with different feature analyses.
Similar to the development of the GEI, an action energy
image (AEI) is proposed for activity classication by
Chandrashekhar and Venkatesh [38]. They use eigen decom-
position of an AEI in eigen activity space obtained by PCA,
which best represents the AEI data in least-square sense.
AEIs are computed by averaging silhouettes and unlike the
MEI that captures only where the motion occurred; the AEI
captures where and how much the motion occurred. The
MEI carries less structural information since it is computed
by accumulating motion images obtained by image differ-
encing, while the AEI incorporates the information about
both structure and motion. They experiment with their AEI
concept for walking and running motions and achieve good
result. On the other hand, Liu and Zheng [89] propose a
method called gait history image (GHI) for gait representa-
tion and recognition. The GHI inherits the idea of the MHI
in the sense that temporal information and the spatial infor-
mation can be recorded in both cases. The GHI preserves
the temporal information besides the spatial information. It
overcomes the shortcoming of no temporal variation in the
GEI. However, each cycle only obtains one GEI or GHI tem-
plate, which easily leads to the problem of insufcient train-
ing cycles [41].
Moreover, the gait moment energy (GMI) method is devel-
oped by Ma et al. [92] based on the GEI. The GMI is the
gait probability image at each key moment of all gait cycles.
In this approach, the corresponding gait images at a key
moment are averaged as the GEI of this key moment. They
introduce moment deviationimage (MDI) byusingsilhouette
123
images and GMIs. As a good complement of the GEI, the
MDI provides more motion features than the GEI. Both MDI
and GEI are utilized to present a subject. However, it is not
easy for the GMI to select key moments fromcycles with dif-
ferent periods. Therefore, to compensate this problem, Chen
et al. [41] propose a clustered-based GEI approach. In this
case, the GEIs are computed from several clusters and the
Dominant Energy Image (DEI) is obtained by denoising the
averaged image of each cluster. The frieze and wavelet fea-
tures are adopted and HMM is employed for recognition.
This approach performs better than the GEI, the GHI and the
GMI representations, as it is superior (due to its clustered
concept) when the silhouette has incompleteness or noise.
Wang and Suter [147] directly convert an associated
sequence of human silhouettes derived from videos into two
types of computationally efcient representations, namely,
average motion energy (AME) and mean motion shape
(MMS), to characterize actions. These representations are
used for recognition. The MMS is proposed based on shapes,
not silhouettes (in a similar manner to the AME). The process
of generating the AME is computationally inexpensive and
can be employed in real-time applications [166]. This AME
is computed exactly the similar manner of the computation
of the GEI, though the former is exploited for action recogni-
tion whereas the latter method is used for gait recognition. In
calculating the AME, Wang and Suter [147] employ the sum
of absolute difference (SAD) for action recognition purpose
and obtain adequate recognition results. However, for large
image size or database, the computation of SADis inefcient
and computationally expensive. This constraint is addressed
by Yu et al. [166] who propose a histogram-based approach,
which can efciently compute the similarity among patterns.
As an initial approach, an AME image is converted to the
motion energy histogram (MEH).
From a histogram point of view, we can regard AME as
a two-dimensional histogram whose bin value represents the
frequency on position during time interval. Thus, we can
reform the AME to the MEH by using:
MEH =
AME(x, y)
x,yAME
AME(x, y)
Then, a multi-resolution structure is adopted to construct the
multi-resolution motion energy histogram (MRMEH).
Ahigh-speed human motion recognition technique is pro-
posed based on a modied-MHI and a superposed motion
image (SMI) [108]. Using a multi-valued differential image
( f
i
(x, y, t )) to extract information about human posture,
they propose modied-MHI that can be dened as
H
(x, y, t ) = max ( f
i
(x, y, t ) , H
(x, y, t 1)) ,
where H
(x, y, t ) is a modied-MHI, the parameter is a

vanishing rate which is set at 0 < 1. An SMI is the max-
imum value image that is generated from summing the past
successive images with an equal weight and we can get the
SMI byputting = 1inthe modiedMHI. Employingthese
images, a motion is described in an eigenspace as a set of
points, and each SMI plays a role of a reference point. By cal-
culating the correlation between reference SMIs and the MHI
generated from an unknown motion sequence, a match is
found with images described in the eigenspace to recog-
nize the unknown motion. Experimental results show good
performances [99, 108] with different datasets. This method,
however, is highly dependent on the parameter and it is
database-specic.
An approach to generate motion-based pattern where
moving objects are rstly segmented by employing adaptive
threshold-based change detection is proposed by [72]. They
used scalar valued rear-MHI and front-MHI to represent how
motion is evolved. It is used to segment and measure the
motion. After that, the motion vectors with orientation and
magnitude are generated from chamfer distance. Finally, an
approach is derived to generate intra-MHI for inside moving
parts.
Singh et al. [132] use the MHI and the MEI, and develop
a motion color image (MCI) by combining motion and color
cues. The MCI is constructed by bit-wise OR from the MEI
of the four previous levels and color localization data. They
dynamically control the frame differencing. Later they divide
the MCI intonine boxes andcompute motionpixels of motion
data in the MHI in each of the nine boxes. Feature vectors
are calculated as the sum of motion pixels in each box for
classication. Its recognition rate is highly dependent on the
training data.
A scene-specic, adaptive camera navigation model is
constructed for video surveillance by automatically learning
locations of high activity [52]. It measures activity from the
MHI at each viewacross the full viewable eld of a PTZcam-
era. For each MHI blob of the scene, it determines whether
the blob is a potential candidate for human activity. Later the
intensity fade of each MHI blob is examined against noise.
Using this iterative candidacy-classication-reduction pro-
cess, one can produce an activitymap, where brighter areas
correspond to locations with more activity.
Vitaladevuni et al. [143] present a Bayesian framework for
recognizing actions through ballistic dynamics. It temporally
segments videos into its atomic movements. It enhances the
performance of the popular MHI feature. This ballistic seg-
mentation with the MHI improves the recognition rate (over
those obtained by using only the MHI).
Ahmad et al. [1416] propose spatio-temporal silhouette
representations, called silhouette energy image (SEI) and sil-
houette history image (SHI) to characterize motion and shape
properties for recognition of human movements. The SEI
and the SHI are constructed by using the silhouette image
sequence of an action. They employ the Korea University
gesture database and the KTHdatabase [127] for recognition.
123
The computations of the SHI and the SEI are exactly the same
concept of the MHI and the GEI, respectively. These are com-
puted from the silhouette images, rather than direct motion
images of the actions (though the MHI can be computed from
silhouette images). From the SEI and the SHI, they compute
human shape variability model to approximate the variability
of anthropometry of different actions [14].
Watanabe and Kurita [148] propose new features for
motion recognition: called higher-order local autocorrela-
tion (HLAC) features. These are extracted from MHIs, and
have good properties for motion recognition. These features
are tested by using image sequences of pitching in baseball
games. They achieve good recognition results for their action
datasets.
An edge motion history image (EMHI) method [39, 40]
is computed by combining edge detection and MHI tech-
nique. It is extracted as a temporal-compressed feature vec-
tor from a short video sequence. Usually, background is not
easy to be extracted in news and sports videos with com-
plex background scenes. Moreover, stereo depth informa-
tion is usually not available in video data either. Therefore,
instead of using the MHI directly, they propose to use edge
information detected in each frame, instead of silhouettes
to compute an EMHI. Let B
t
(x, y, t ) be a binary value to
indicate if a pixel is located on an edge at time t . An EMHI
(EMHI
t
(x, y, t )) is computed from the EMHI of the previ-
ous frame EMHI
t 1
(x, y, t ) as:
EMHI
t
(x, y, t )
=
_
if B
t
(x, y, t )=1
max
_
0, EMHI
t 1
(x, y, t )1
_
otherwise
In this equation, the basic feature of this EMHI is edges.
Later, they manage the scale adaptation and noises (as exist-
ing edge detection algorithms are sensitive to noises). The
motion history concept can help to smooth noises and pro-
vide historic motion clues to help a human vision system for
building correspondences on edge points [40]. They develop
layered Gaussian mixture model (LGMM) to exploit these
features for classifying various shots in video.
Another conceptually similar work to the MHI method
[31] is proposed by Masoud and Papanikolopoulos [93]. This
different method extracts motion directly from the image
sequence. At each frame, motion information is represented
by a feature image, which is calculated efciently using an
innite impulse response (IIR) lter. In particular, they use
the response of the IIR lter as a measure of motion in
the image. The idea is to represent motion by its recent-
ness: recent motion is represented as brighter than older
motion, just like [31]. This technique, also called recursive
ltering, is simple and time-efcient. Unlike the MHI method
[31], an action is represented by several feature images [93]
rather than just two images (namely, the MHI and the MEI
images) [31].
3.2 Variants of the MHI method in 2D
3.2.1 Solutions to motion self-occlusion problem
One of the key limitations of the MHI method is its inabil-
ity to perform well in presence of motion overwriting due to
self-occlusion. Several attempts have been targeted to mit-
igate this issue, so that multi-directional activities can be
represented by the concept of the MHI representation. One
initial approach is the multiple-level MHI (MMHI) method
[112, 141, 142]. It aims at overcoming the problem of motion
self-occlusion by recording motion history at multiple time
intervals (i.e., multi-level MHIs). It creates all MHIs to have
a xed number of history levels n. So, each image sequence
is sampled to (n+1) frames. The MMHI is computed as
follows:
MMHI
t
(x, y, t )
=
_
s t if (x, y, t ) = 1
MMHI
t
(x, y, t 1) otherwise
where s = (255/n) is the intensity step between two his-
tory levels. MMHI
t
(x, y, t ) = 0 for t 0. The nal tem-
plate is found by iteratively computing the above equation for
t = 1, . . . , n +1. This method encodes motion occurrences
at different time instances on the same pixel location in such
a manner that it can be uniquely decodable afterwards. For
this purpose, it uses a simple bit-wise coding scheme. If a
motion occurs at time t at pixel location (x, y), it adds 2
t 1
to the old motion value of the MMHI as follows:
MMHI (x, y, t )=MMHI (x, y, t 1) + (x, y, t ) 2
(t 1)
Due to this bitwise coding scheme, one can separate multi-
ple actions occurring at the same position [141]. It focuses
on automatic detection of facial actions units that compose
expressions. It requires sophisticated registration system,
because all employed image sequences must have the faces at
the same position and on the same scale. The result does not
clearly demonstrate the superiority of the MMHI with respect
to basic MHI [20, 141]. Even in their reports, the MMHI
produces lower recognition result than the MHI [142]. How-
ever, they point out that self-occlusion due to motion over-
writing problem might be solved using this MMHI. Ahad
et al. [11] implement the MMHI with aerobics dataset and
another action dataset, but it is found that the MMHI method
shows poor recognition result.
Motion overwriting or self-occlusion problem of the MHI
method is robustly solved by the directional motion history
image (DMHI) method [8]. In this approach, instead of back-
ground or frame subtraction, gradient-based optical ow is
123
calculated between two consecutive frames and split it into
four channels (see Fig. 5, as shown above). Based on this
strategy, one can get four-directional motion templates for
left, right, up and down directions. The corresponding four
history images are calculated as:
DMHI
(x, y, t )
=
_
if
(x, y, t ) >
max(0, DMHI
(x, y, t 1)) otherwise,

where to denote the four different directions, we have
{up(
+y
), down(
y
), ri ght (
+x
), lef t (
x
)}. For posi-
tive and negative horizontal direction, DMHI
+x
(x, y, t ) and
DMHI
x
(x, y, t ) image templates are achieved as motion

history templates. Also, DMHI
y
(x, y, t ) and DMHI

+y
(x, y, t ) represent the positive and negative vertical direc-

tions, respectively. These four motion history templates
resemble the directions of the motion vectors. Each DMHI
template is passed into a median lter to smooth noisy pat-
terns and hence smoothed DMHI images med H
(x, y, t ) are
computed:
med H
(x, y, t ) = med
_
DMHI
(x, y, t )
_
where med() is the function for median lter. We compute
four MEIs after thresholding these templates above zero:
DMEI
(x, y, t ) =
_
1 if med H
(x, y, t ) 1
0 otherwise
This method can solve overwriting problem significantly.
Several complex actions and aerobics (which have more than
one direction in these actions) are tested. More than 94%
recognition with the DMHI method is achieved, whereas
the MHI shows around 50% recognition result. The DMHI
method requires four history templates and four energy tem-
plates for four directions; hence the size of the feature vector
becomes large, and hence it becomes computationally a bit
more expensive than the MHI. With recent work based on this
approach, various reduced-sized feature vectors have been
proposed which can recognize motions faster with almost
the same recognition result [4]. Moreover, having the com-
bined cues from the DMHI and the MEI representations for
each action (with an outdoor action dataset), the achieved
result is also satisfactory [17]. The DMHI is also employed
for low-resolution action recognition, because it keeps infor-
mation on the motion components even though the resolution
is poor. With low-resolution video sequences (from320240
to 6448 pixels), the recognition results are very promising
[5]. However, with very low-resolution, due to lack of pixel
information, it becomes difcult to achieve significant infor-
mation for recognition. If there is no motion information in
the nal history or energy templates then feature vectors can
not be computed for these templates. Another improvement
is proposed [12] called timed-DMHI to cover similar actions
having different speed. This concept is simple but not robust.
Earlier, Meng et al. [96] propose a SVM-based sys-
tem called hierarchical motion history histogram (HMHH).
In [97100], they compare other methods (i.e., modied-
MHI, MHI) to demonstrate the robustness of HMHH in
recognizing several actions. This representation retains more
motion information than the MHI, and also remains inexpen-
sive to compute [100]. In this approach, to solve the overwrit-
ing problem, they dene some patterns P
i
in the motion mask
(D(x, y, :)) sequences, based on the number of connected
1, e.g.,
P
1
= 010, P
2
= 0110, P
3
= 01110, . . . , P
M
= 01 10
. , .
M1s
Now dene a subsequence C
i
= b
n1
, b
n2
, . . . , b
ni
and
denote the set of all sub-sequences of D(x, y, :) as
{D(x, y, :)}. Then for each pixel, count the number of
occurrences of each specic pattern P
i
in the sequence of
D(x, y, :) as shown,
HMHH(x, y, P
i
) =
j
1
{C
j
=P
i |C
j
{D(x,y,:)} }
Here, 1 is the indicator function. Hence, from each pattern
P
i
, we construct one gray-scale image (called motion his-
tory histogram, MHH), and in aggregation, we call all MHH
images as hierarchical MHH, HMHH). They use these nal
feature images for classication and then recognition using
SVM.
These solutions are compared [11] to show their respec-
tive robustness in solving the overwriting problem of the
MHI method [31]. The employed dataset has some activi-
ties that are complex in nature and have motion overwriting.
For the HMHHmethod, four patterns are considered as more
than four patterns do not provide significant information. For
everyactivity, the recognitionresult withthe DMHI represen-
tation is very satisfactory (about 94% recognition). Though
the HMHH representation achieved better results than the
MHI and the MMHI representations, the performance of the
HMHHis unacceptable, as it achievedabout 67%recognition
rate.
Kellokumpu et al. [73] extract spatially enhanced local
binary pattern (LBP) histograms from the MHI and the MEI
temporal templates and model their temporal behavior with
HMMs. They select a xed frame number. The computed
MHI is divided into four sub-regions through the centroid of
the silhouette. All MHI and MEI LBP features are concate-
nated into one histogram and normalized so that the sum of
the histogramequals to one. In this case, the temporal model-
ing is done by using HMMs. This texture-based description
of movements can handle overwrite problem of the MHI.
One concern of this approach is the choice of the sub-regions
division scheme for every action.
123
3.2.2 Solutions to some issues of the MHI in 2D
To overcome several constraints of the MHI method [31],
various developments are proposed both in 2D and 3D
domains. This sub-sub-section covers some other variants
of the MHI method in 2D, and 3D extensions will be cov-
ered in Sub-section afterwards. Davis [51] presents a method
for recognizing movement that relies on localized regions of
motion, which are derived from the MHI. He offers a real-
time solution for recognizing some movements by gathering
and matching multiple overlapping histograms of the motion
orientations fromthe MHI. In this extension fromthe original
work [31], Davis explains a method to handle variable-length
movements as well as occlusion issue. The directional histo-
gram for each body region has twelve bins (30 degree each),
and the feature vector is a concatenation of the histograms of
different body regions.
In another update, the MHI is generalized by directly
encoding the actual time in a oating point format, which is
called timed-motion history image (tMHI) [33, 34]. In tMHI,
a new silhouette values are copied in with a oating-point
time stamp. This MHI representation is updated as follows,
not by considering the frame numbers but time stamp of the
video sequence [34]:
tMHI
(x, y) =
_
if current silhouette at (x, y)
0 else if tMHI
(x, y) < ( ),
where is the current time stamp, and is the maximumtime
duration constant (typically a few seconds) associated with
the template. This method makes the representation indepen-
dent of the system speed or frame rate (within limits) so that
a given gesture can cover the same MHI area at different
capture rates. They also present a method of motion seg-
mentation based on segmenting layered motion regions that
are meaningfully connected to movements of the object of
interest. The segmented regions are not motion blobs, but
motion regions that are naturally connected to parts of mov-
ing objects. This is motivated by the fact that segmentation
by collecting blobs of similar directional motion does not
guarantee the correspondence of the motion over time. This
motion segmentation, together with silhouette pose recogni-
tion, provides a very general and useful tool for gesture and
motion recognition [34]. This approach is later employed by
Senior and Tosunoglu [128] for tracking objects in real-time.
They use the tMHI for motion segmentation.
The motion gradient orientation (MGO) is also computed
by Bradski and Davis [34] from the interior silhouette pix-
els of the tMHI. These orientation gradients are employed
for recognition. Wong and Cipolla [154, 155] exploit MGO
images to form motion features for gesture recognition.
Pixels in the MGO image encode the change in orientation
between nearest moving edges shown on the MHI and the
region of interest is dened as the largest rectangle cov-
ering all bright pixels in the MEI. Therefore, the MGO
contains information about where and how a motion has
occurred [155].
The MHIs limitation relating to the global image feature
calculations can be overcome by computing dense local
motion vector eld directly from the MHI for describing
the movement [49]. Davis [49] extends the original MHI
representation into a hierarchical image pyramid format to
provide with a means of addressing the gradient calculation
of multiple image speeds. An image pyramid is constructed
by recursively low-pass ltering and sub-sampling an image
(i.e., power-of-2 reduction with anti-aliasing) until reaching
a desired size of spatial reduction. The result is a hierarchy of
motion elds, where the resulting computed motion in each
level is tuned to a particular speed (i.e., with faster speeds
residing at higher levels). The hierarchical MHI (HMHI) is
not directly created from the original MHI, but through the
pyramid representation of the silhouette images. Afterwards,
based on the orientations of the motion ow (computed from
the MHI pyramid), a motion orientation histogram (MOH)
is produced. The resulting motion is characterized by a polar
histogram. The HMHI approach remains a computationally
inexpensive algorithm to represent, characterize and recog-
nize human motion in video [100].
3.2.3 Motion separation and identication approach
Based on the DMHI template [8], complex motions tem-
poral segmentation or separation scheme to its primitives is
proposed [6]. This temporal motion segmentation method
can demonstrate the intermediate interpretation of complex
motion into four directions, namely, right, left, up and down.
After having the motion templates for a complex action or
activity, it calculates the volume of pixel values (
) by sum-
ming up the brightness levels of the motion templates. For
consecutive frames, it is
t
=
M
x=1
N
y=1
DMHI
(x, y, t )
One can decide the label {up, down, lef t, ri ght } of
the segmented motion based on threshold values (that deter-
mines the starting point for a motion
and ending point of

that motion
) as shown in
t +k
=
t +k

t
=
_
if
t
<
if
t
>
Here,
t
is the difference betweentwovolume of pixel values
(
) for two frames. Variable k is the frame number. When

the difference
t
is more than a starting threshold value
,
we can decide the label of the segmented motion. However,
123
when the
t
reduces to
>
t
, we can say that the scene
is static or an earlier motion is no longer existent (). There-
fore, based on this mechanism from the motion history tem-
plates, they easily segment a complex motion sequence into
four directions. This is very useful for an intelligent robot to
decide the directions of the human movement. Thus an action
can be understood based on some consecutive leftrightup
down combination [6].
3.2.4 Other 2D developments
An advantage of the MHI is that although it is a representa-
tion of the history of pixel-level changes, only one previous
frame needs to be stored. However, at each pixel location,
explicit informationabout its past is alsolost inthe MHI when
current changes are updated to the model with their corre-
sponding MHI values jumping to the maximal value [159].
To overcome this problem, Ng and Gong [105] propose a
pixel signal energy (PSE) in order to measure the mean mag-
nitude of pixel-level temporal energy over a period of time.
It is dened by a backward window. The size of the window
determines the number of frames (history) to be stored [106].
Another recent development on the MHI representation
is pixel change history (PCH) [159]. This can measure the
multi-scale temporal changes at each pixel. The PCH of a
pixel
_
P
,
(x, y, t )
_
can be dened by
P
,
(x, y, t )
=
min
_
P
,
(x, y, t 1) +
255
, 255
_
if D(x, y, t )=1
max
_
P
,
(x, y, t 1)
255
, 0
_
otherwise,
where D(x, y, t ) is the binary foreground image, is an
accumulation factor and is a decay parameter. When
D(x, y, t ) = 1, the value of a PCH increases gradually
according to the accumulation factor, instead of jumping to
the maximum value. When no significant pixel-level visual
change is detected at location (x, y) in the current frame,
the pixel (x, y) is treated as part of the background and the
corresponding PCH starts to decay. The speed of decay is
controlled by a decay factor. In fact, the MHI is a special case
of the PCH. A PCH image is equivalent to an MHI image
when a parameter called accumulation factor () is set to 1.
Compared to the PCH, the MHI has weaker discriminative
power to distinguish different types of visual changes. More-
over, similar to that of the PSE [105], a PCHcan also capture
a zero order pixel-level change, i.e., the mean magnitude of
change over time [159].
MHIs can also be used to detect and interpret actions in
compressed video data. Compressed domain human motion
is recognized at the top of the MHI approach by the intro-
duction of motion ow history (MFH) [23, 24]. The MFH
quanties the motion in compressed video domain. Motion
vectors are extracted from the compressed MPEG stream by
partial decoding. Then noise is reduced and the coarse MHI
and the corresponding MFH are constructed at macro-block
resolution instead at pixel resolution. By this approach, they
reduce the computation by 16 times. The MFH can be com-
puted according to the following equations:
MFH
(x, y, t )=
_
v
(x, y, t ) if E (v
(x, y, t ))<
M (v
(x, y, t )) otherwise
where, E (v
(x, y, t )) = v
(x, y, t )
med (v
(x, y, t ) . . . v
(x, y, t ))
2
M (v
(x, y, t )) = med (v
(x, y, t ) . . . v
(x, y, t ))
Here med() refers to median lter, v
x
(x, y, t ) can be hori-

zontal (v
x
) or vertical (v
y
) components of the motion vector

located at (x, y) in frame , and indicates the number of
previous frames to be considered for median ltering. The
function E() checks the reliability of the current motion vec-
tor withrespect toformer non-zeromotionvectors at the same
location against a predened threshold . The MFHgives the
information about the extent of the motion at each macro-
block (where and how much the motion has occurred). The
MHI, which has spatio-temporal information but no motion
vector information, is complemented by the MFH. The fea-
tures, extracted from MHI and MFH are used to train classi-
ers for recognizing a set of seven human actions. However,
self occlusion or overlapping of motion on the image plane
may result in the loss of a part of the motion information.
Yalmaz and Shah [164] propose two modications on the
MHI representation. In their representation, motion regions
are represented by contours, not by the entire silhouettes.
Contours in multiple frames are not compressed into one
image but directly represented as a spatial-temporal volume
(STV) by computing the correspondences of contour points
across frames.
3.3 Extensions of the MHI method in view-invariant
methods
All the above developments are based on 2D-based MHI and
hence these are not view-invariant. Several 3D extensions
of the basic MHI method are proposed for view-invariant
3D motion recognition [20, 121, 130, 151]. Also, approaches
by Davis [46, 49] have looked at the problem of combining
MHIs frommultiple views (e.g., eight different views [46]) to
perform view-invariant recognition. The motion history vol-
ume (MHV) is introduced in 3D instead of the MHI for 2D.
For feature extraction, 3Dmoments are employed [90], as an
extension to the 2DHu invariants [68]. Shin et al. [130] pres-
ent a novel method for real-time gesture recognition with 3D
motion history model (MHM). Utilizing this 3D-MHM with
disparity information, not only is the camera view problem
123
solved but also the reliability of recognition and the scala-
bility of system are improved. Apart from the view-invari-
ance issue, Shin et al. [130] also propose a dynamic history
buffering (DHB) to solve the gesture duration problem that
comes from the variation of gesture velocity at every per-
forming time. The DHB mitigates the problem by using the
magnitude of motion. Based on their work, the system using
3D-MHM achieves better results of recognition than using
only2Dmotioninformation. Another (similar to[130]) view-
invariant 3Drecognition method, called volume motion tem-
plate (VMT) is proposed [121]. It extracts silhouette images
using background subtraction and disparity maps. Then it
calculates volume object in 3D space to construct a VMT.
With 10 gestures, it achieves good recognition result.
Weinland et al. [150, 151] develop a 3D extension of the
MHI method, called motion history volume (MHV) based
on visual hull for viewpoint-independent action recognition.
The proposed transition from 2D to 3D is straightforward:
pixels are replaced with voxels, and the standard image dif-
ferencing function D(x, y, t ) is substituted with the space
occupancy function D(x, y, z, t ), which is estimated using
silhouettes and thus, corresponds to a visual hull. Voxel val-
ues in the MHV at time t are dened as:
MHV
(x, y, z, t )
=
_
if D(x, y, z, t ) = 1
max (0, MHV
(x, y, z, t 1)1) otherwise

They automatically segment action sequences into primitive
actions which can be represented by a single MHV. Later they
cluster the resulting MHVs into a hierarchy of action classes,
which allow recognizing multiple occurrences of repeating
actions [150]. The MHV demonstrates that temporal seg-
mentation is a much easier process in 3D than in 2D, so the
temporal scale and parameters can be set automatically. The
MHV is used both for supervised and unsupervised learn-
ing of action primitives. It offers an interesting alternative
to action recognition with multiple cameras. However, intro-
duction of additional computational complexity due to cal-
ibration, synchronization of multiple cameras, and parallel
background subtraction are not discussed [19] in these works.
Similar to the MHV [151], Canton-Ferrer et al. [35, 36]
propose another 3D version of the MHI by adding informa-
tion regarding the position of the human body limbs, employ-
ing multiple calibrated cameras. An ellipsoid body model is
t to the incoming 3D data to capture body part in which the
gesture occurs. In their temporal analysis module, they rst
compute motion energy volume (MEV) from binary dataset,
which indicates the region of motion. This measure captures
the 3D locations where there is a motion in last few frames
(). In this approach, selection of the parameter is a crucial
factor in dening the temporal extent of a gesture. To repre-
sent the temporal evolution of the motion, they also dene a
motion history volume (MHV) where the intensity of each
voxel is a functionof the temporal historyof the motionat that
3D location. They exploit 3D invariant statistical moments
[90] for shape analysis and classication. This method is
implemented by Petras et al. [116], who develop a exible
test bed for unusual behavior detection.
Albu et al. [19, 20] presents a new 3D motion represen-
tation, called the volumetric motion history image (VMHI),
to be used for the analysis of irregularities in human actions.
Such irregularities may occur either in speed or orientation
and are strong indicators of the balance abilities and of the
condence level of the subject performing the activity. The
VMHI can be computed by
VMHI (x, y, k)
=
_
S (x, y, k) S (x, y, k+1) if S (x, y, k)=S (x, y, k+1)
1 otherwise
where S (x, y, k) is the one pixel thick contour of the binary
silhouette in frame k and stands for the symmetric differ-
ence operator. It attempts to overcome the limitation of the
basic MHI related to motion self-occlusion, speed variability
and variable-length motion sequences. This 3D representa-
tion is different from other 3D variants of the MHI because
its concentration is to analyze motion instead of motion rec-
ognition. It does not need to evaluate the temporal duration
of the MHI.
4 Application realms of MHI-based methods
The MHI method and the concept of the MHI/MEI represen-
tations are widely employed and analyzed by various com-
puter vision communities. Though most of these methods are
presented in the above sections, the objective of this section
is to elaborate their applications and categorize them. We cat-
egorize numerous applications (which are based on the MHI
method and its variants) into three broad groups: (1) gesture
or action recognition; (2) motion analysis; and (3) interactive
systems. Table 1 summarizes these applications.
4.1 The MHI for action or gesture recognition
To begin with, the MHI method is used to recognize different
actions by [31, 47]. Later on, this method is used for recog-
nition of human movements and moving object tracking by
various groups (e.g., Refs. [8, 10, 18, 23, 24, 3335, 48, 49, 51,
63, 73, 82, 86, 89, 9699, 108, 115, 121124, 129, 130, 132, 139,
143, 150, 151, 159]). Xiang and Gong [159] introduce a sim-
ilar concept for recognizing indoor shopping activities and
outdoor aircraft cargo activities. Leman et al. [86] develop
a PDA-based recognition system based on the MHI method.
Rosales in [122] performs experiments by using instances
of eight human actions and achieves satisfactory recognition
result. Singh et al. [132] also use the MHI and the MEI along
with a new MCI template for a real-time recognition system.
123
Table 1 Various applications by employing MHI and its variants
[Ref.] (Year) Employed databases (DB)/Applications Results (RR)/Features (F)/Classier (C)/
Comments
A. Recognition
[122] (1998) DB: 8 actions by 7 subjects RR: good; F: Hu moment, PCA, Gaussian &
Gaussian Mixture; C: KNN
[31] (2001) DB: 18 aerobic exercises by 1
instructortaken several times
RR: good; F: Hu moment; C: Mahalanobis
distance
[23] (2003) [24] (2004) DB: 7 actions by 5 subjects with 10
repetitions
RR: 98%; F: Compressed
domain motion; C: KNN, Neural
Network, SVM & Bayes
classier
[82] (2004) DB: 5 hand gestures by 5 subjects with 10
repetitions
RR: 96%; F: Hu moment;C: Back
propagation based multilayer perception
ANN
[46] (2004) DB: Walk, run, stand by 3 subjects, 8
different viewpoints, from thermal camera
RR: 77%; F: Sequential reliable-inference
likelihood, Hu moment; C: Bayesian
Information Criterion
[129] (2004) DB: 7 hand gestures recognition by a robot in
real-time
RR: high; F: Hu moments, Tracking by Mean
Shift Embedded PF; C: Mahalanobis
distance,
[130] (2005) DB: 4 gestures (walking, sitting, arm-up,
bowing) from calibrated stereo camera
RR: 90%; F: 3D global gradient
orientations; Likelihood by least square
method. Duration issue is considered
[86] (2005) PDA-based recognition system Employ PDA and limited scope
[132] (2006) DB: 11 gestures for robot behavior in
real-time
RR: 90%; F: Motion Color Cue with MHI,
MEI
[121] (2006) DB: 10 gestures from 7 viewpoints with 10
repetitions
RR: 90%; 3D method, by using VMT
[108] (2006) DB: 6 actions by 9 subjects with 3 repetitions RR: 79.9% Eigenspace & calculating
reference points for all actions; recognition
by mapping on eigenspace
[35, 36] (2006) DB: 8 actions from 5 calibrated wide lens
cameras, in a SmartRoom, multiple people
RR: 98%; F: Ellipsoid body model, 3D
moments, PCA; C: Bayesian classier
[151] (2006) [150] (2006) DB: INRIA IXMAS action dataset [120]: 11
actions by 10 subjects (5 males, 5 females),
with 3 repetitions from 5 cameras
RR: 93.3%; F: Fourier transformation in
cylindrical coordinates; C: Mahalanobis
distance, Linear Discriminant Analysis
(LDA); Visual Hull, MHV, 3D approach
[39] (2006) DB: TRECVID05, TRECVID03: 6 types of
actions/shots from video (total 100 shots)
RR: 63%; F: Layered Gaussian Mixture
Model; C: PCA, SVM; EMHI
[124] (2006) DB: 10 different gestures from several
subjects
RR: 90%; Indoor environment, stereo camera
for a robot
[18] (2006) DB: (i) Marcels Dynamic Hand Gesture
database: 15 video sequences having 4
gestures (click, no, stop-grasp-ok, rotate)
(ii) 4 actions (jump, squat, limp, walk) by
20 subjects
RR: (i) 92% on some action pairs of DB,
(ii) 90% on pairs of 4 actions;F:
Discriminant vectors on MHI; C: Fisher
Discriminant Analysis
[63, 170] (2006) [64] (2003) DB: USF HumanID Gait DB [126] RR: overall 71% (rank 5); F: Frequency &
phase estimation; C: PCA+MDA (Multiple
Discriminant Analysis)
[38] (2006) DB: 9 activities by 9 subjects [62] (one action
less than [62])
RR: 93.8%; F: AEI from GMM background
model; C: PCA
123
Table 1 continued
[Ref.] (Year) Employed databases (DB)/Applications Results (RR)/Features (F)/Classier (C)/
Comments
[147] (2006) DB: 10 activities by 9 subjects [62] RR: 100%; F: AME, mean motion shape
(MSS); C: KNN, NN, NN with class
exemplar (ENN), Summation of Absolute
Difference, Mahalanobis distance
[139] (2007) DB: 6 actions by 9 subjects from 4 cameras
[108]
RR: Poor result due to the presence of motion
overwriting in some actions; MHI method is
used
[96] (2007) DB: 6 actions by 25 subjects [127] RR: 80.3%;F: Motion Geometric Distribution
from MHI+HMHH; C: SVM light
[92] (2007) DB: USF HumanID Gait DB [126] RR: overall 66% (rank 5);F: Gait period
estimation, gait moment image, moment
deviation image+GEI; C: Nearest Neighbor
[10] (2007) [8] (2008) [17] (2010) (i) 5 actions by 9 subjects, indoor, from
multiple cameras (ii) 10 aerobics by 8
subjects, indoor (iii) 10 actions by 8
subjects, outdoor
RR: (i) 93%, (ii) 94%, (iii) 90%; F: Optical
ow-based DMHI+DMEI/MEI, Hu
moments; C: KNN
[161] (2008) DB: USF HumanID Gait DB [126] RR: A bit better than GEI [63]; F: Gabor
phase spectrum of GEI; C: Low
dimensional manifold
[143] (2008) DB: (i) 14 gestures by 5 subjects with 5
repetitions (ii) 7 video sequences by 6
subjects (iii) INRIA XMAS Action Dataset
[120]
RR: (i) 92%, (ii) 85.5%, (iii) 87%; F:
Fourier-based MHI on ballistic segments;
C: Dynamic Time Warping, Dynamic
Programming
[26, 25] (2008) DB: CASIA gait databaseall [167] RR: overall 90.5%; F: GEI with feature
selection mask; C: Adaptive Component
and Discriminant Analysis
[73] (2008) DB: (i) 15 gestures by 5 subjects [78] (ii) 10
actions by 9 subjects [62]
RR: (i) 95%, (ii) 97.8%; F: LBP histograms
from MHI+MEI; C: HMM
[15, 16] (2008) [14] (2010) DB: (i) Full-body gesture DB14 actions by
20 subjects, ages ranging 6080 years [59]
(ii) 6 actions by 25 subjects [127]
RR: (i) 89.4%, (ii) 87.5%; F: Hu and Zernike
moments, global geometric shape
descriptors; C: multi-class SVM
[110] (2008) DB: (i) Virtual Human Action Silhouette DB
[54]20 different actions by 9 actors (ii)
INRIA IXMAS Action dataset [120]
RR: (i) 98.5%, (ii) 77.27%; F: Kohonen Self
Organizing feature Map; C: Maximum
likelihood (ML)
[148] (2008) Sequences of pitching in baseball games RR: 100% (when image is 90x90 pixels),
96.7% (when image is 25x25 pixels); F:
Higher-order Local Auto-Correlation
(HLAC) features from MHI; C: PCA,
Dynamic Programming
[41] (2009) DB: (i) CMU Mobo gait databaseby 25
subjects (ii) CASIA gait database (DB
B)by 124 subjects (93 males, 31 females)
[167]
RR: (i) 82% (better than [89, 63, 92]), (ii)
93.9%; F: Frieze and wavelet features
from Dominant Energy Image; C: HMM
[166] (2010) DB: (i) 7 actions (from 9 subjects [62] (3
actions less than [62]) + additional 10
subjects of their own) (ii) CASIA gait
database (DB B) [167]
RR: (i) 98.5% at normal resolution for
action recognition, (ii) 96.4%; F:
Multi-Resolution structure on Motion
Energy Histogram (HRMEH), quad-tree
decomposition; C: Histogram-matching
algorithm
[Ref.] (Year) Applications/Scenes Employed approaches/Comments
B. Motion analysis
[51] (1998) Discriminate different movements Histograms of motion orientations of MHI;
Mahalanobis distance
123
Table 1 continued
[Ref.] (Year) Applications/Scenes Employed approaches/Comments
[123] (1999) Dynamic outdoor activities from a single
camera
Trajectory-guided tracking using Extended
Kalman Filter based on MHI
[48] (1999) Action analysis and understand in real-time Motion gradients and MHI
[115] (2001) Tracking rst 3 sequences of PETS dataset-1 MHI tracker, Gaussian weighting, Kalman
lter, Mahalanobis distance
[125] (2004) Motion tracking for a moving robot in
real-time
Camera motion compensation, Kalman
ltering
[71] (2004) Threat assessment for surveillance in car
parking
Tracking, NN-based, not promising result
[58] (2004) Line tter Failure in rotational motion
[128] (2005) Tracking with CAMSHIFT algorithm Neural network is employed
[133] (2005) Detection and localization of road area in
trafc video
Fuzzy-shadowed sets are used
[159] (2006) Behavior understanding in indoor and
outdoor scenes
EM, BIC, DPN, Dynamically
Multi-Linked-HMM
[165] (2006) Moving object localization from thermal
imagery
RANSAC
[87] (2006) Moving object tracking MHI; overall approach is not great
[52] (2007) Adaptive camera models for video
surveillance using PTZ camera, outdoor in
various places
Need to include more features (e.g., texture,
color) for improvement
[20] (2007) [19] (2007) Analysis of sway and speed-related
abnormalities of human actions; 5 different
actions
Solved motion self-occlusion, action length
variability
[116] (2007) [137] (2007) Real-time detector for unusual behavior, 4
major partners (ACV, BILKENT, UPC and
SZTAKI) achieved this task
Employed 3D-MHI of [35], web server-based
real-life detection module, tracking, outdoor
[6] (2009) Temporal motion segmentation and action
analysisboth indoor and outdoor
Need to implement in an intelligent robot;
management of outliers in ow vectors is
required
[Ref.] (Year) Applications/Scenes Comments
C. Interactive systems
[50] (1998) Virtual aerobics trainer Watch and respond to the user as he/she
performs the workout
[48] (1999) Interactive art demonstration Map different colors to the various
timestamps within the MHI for fun
[32] (1999) The KidsRooman environment for kids to play An interactive, narrative play space for
children, with virtual monsters
[141] (2004) [142] (2004) [112] (2005) 21 or less facial Action Unit classes;
MMI-Face-DB; Cohn-Kanade Face DB
Poor recognition rate; higher pre-processing
load
[107] (2006) Interactive art demonstration In complex environment
[162, 163] (2006) Speech recognition3 consonants SWT; very limited result
For recognizing a set of seven human actions, Refs. [23, 24]
upgrade the MHI. Davis [49, 51] and Bradski and Davis
[33, 34, 48] improve the MHI in different ways for recog-
nizing various motions and gestures. In one of Daviss recent
works [46], a rapid-and-reliable action (run, walk and stand)
recognition approach is proposed using the MHI method.
To recognize complex activities and to solve the motion
overwriting problem of the MHI, Ahad et al. [8, 10] develop
the DMHI method for recognizing various aerobics and other
actions. Similarly, to solve the overwriting problem, Meng
et al. in their sequences of works [96100] propose the
HMHH approach to recognize various actions using SVM.
In another solution to overcome the overwriting problem of
the MHI, Kellokumpu et al. [73] extract spatially enhanced
local binary pattern (LBP) histograms from the MHI and the
MEI to classify various actions. This approach demonstrates
robustness against irregularities in data and partial occlusion.
Vitaladevuni et al. [143] enhance the performance of the MHI
feature for recognizing actions through ballistic dynamics.
It temporally segments videos into its atomic movements.
Using eigenspace, Ogata et al. [108] employ different MHIs
and SMI for recognizing several actions.
123
Kumar et al. [82] develop a system for hand gesture
classication by employing the MHI. In another hand ges-
ture recognition approach, Shan et al. [129] employ MHIs
by considering various trajectories of the motion in real-
time. Alahari and Jawahar [18] model action characteris-
tics by MHIs for some hand gesture recognition and four
different actions. Ryu et al. [124] demonstrate gesture rec-
ognition system by employing MHIs and MEIs, whereas
Refs. [35, 121, 150, 151] improve the MHI for view-invari-
ant motion recognition. Similarly, another view-invariant
approach is developed for real-time human gesture recogni-
tion by [130]. Gait recognition and person identications are
targeted by various researchers by modifying the MHI and
the MEI concept [26, 38, 41, 63, 64, 89, 92, 147, 161, 166, 170].
We can note that the HMHH method [96100] attempt
to solve motion overwriting problem of the MHI, but they
employ a database which is mainly one-dimensional and
hence the overwriting issue is insignificant. Therefore, to
judge its performance to solve the motion overwriting prob-
lem, different databases having complex actions [11] are
challenged. Similarly, most of the methods employed their
own databases for recognition. Apart from the various
approaches (whether it is the MHI, MEI, GHI, GEI, DMHI,
MMHI, HMHH, SMI, tMHI, HMHI, MFH, PCH, MHV,
VMT or MHM) for motion representations, the development
strategies of feature vectors and then the classication meth-
ods are varied for different methods. Even in some compar-
ative analyses (e.g., in [11, 96, 100, 166]) of some methods,
they do not follow the same strategies of the other methods
to compare. For example, Ahad et al. [11] compare the MHI,
the HMHH and the MMHI methods with the DMHI method.
They use seven Hu invariants [68] for feature vectors and
KNN for classication, even though the MHI method [31]
uses Mahalanobis distance and the HMHH method [9699]
uses SVM. Therefore, not only the issue of databases, but also
the selections of classication and feature analysis approach
are imperative to evaluate different methods.
4.2 The MHI for motion analysis
Apart from human action and activity recognition, the MHI
method is also employed for various motion detection and
localization, for automatic video surveillance and other pur-
poses. Automatically localizing and tracking moving person
or vehicle for an automatic visual surveillance systemis dem-
onstrated in [87, 165]. They employ the MHI method before
exploiting an extended mean shift approach [87]. Yin and
Collins [165] combined forward-MHI and backward-MHI to
achieve a contour shape for the moving object at the current
frame. But this system is not a complete tracking system but
localization approach from the fading trail of the MHI. Ro-
sales and Sclaroff [123] use the MHI for tracking several out-
door activities. In another approach, a multi-modal adaptive
tracking system is developed where the MHI is calculated
to nd the moving part [115]. Jan [71] develops a surveil-
lance system for threat assessment in car park. This system
employs MHIs to display any erratic pattern of a suspicious
person in restricted parking place. Using the MHI, a video
surveillance system, a PTZ camera is developed for automat-
ically learning locations of high activity [52]. Jin et al. [72]
temporally segment a human body and measure its motion
by employing the MHI. Son et al. [133] calculate the MHI
and then combine with a background model to detect the
candidate road image.
Albuet al. [19, 20] developthe MHI methodtouse it for the
analysis of irregularities in human actions. Such irregularities
may occur either in speed or orientation and are strong indi-
cators of the balance abilities and of the condence level of
the subject performing the activity. Petras et al. [116] devise
a exible test-bed for unusual behavior detection and auto-
matic event analysis using the MHI. In the human model
and motion-based unusual event detection (UPC) module,
the concept of the MHI/MEI is introduced to realize a simple
motion representation [137]. It extends this formulation to
represent view-independent 3D motion (with the concept of
Canton et al. [35]). A simple ellipsoid body model is corre-
sponded to the incoming 3D data to capture the body part
where gesture occurs. This improves the recognition ratio
and generates a more informative classication. In another
approach [125], an AIBOrobot detects motion by calculating
MHIs before tracking it by using a Kalman lter.
Based on four directional motion history templates, a
complex motion segmentation scheme is proposed [6]. This
temporal motion segmentation (TMS) method can split the
complex motion into leftrightupdown combinations, so
that a smart intelligence system can understand the seman-
tics of actions promptly in real-time. The MHI is also used
to produce input images for line tter, which is a system for
tting lines to a video sequence that describe its motion [58].
In this approach, MHI templates are utilized to summarize
the motion in video clips.
4.3 The MHI for interactive systems
Various interactive systems have been successfully con-
structed using motion history template as a primary sensing
mechanism. For example, using the MHI method, Davis et
al. [50] develop a virtual aerobics trainer that watches and
responds to the user as he/she performs the workout. An
interactive and narrative play space for children, called Kids-
Room [32] is developed using the MHI method successfully.
This is a perceptually-based environment in which children
could interact with monsters while playing in a story-telling
scenario.
Nguyen et al. [107] introduce the concept of a motion
swarm, a swarm of particles that moves in response to the
123
Fig. 11 Walking from camera
(1st two columns); and towards
camera (last two columns).
Bottom-row shows the
corresponding energy images
eld representing an MHI. A structure that is imposed on
the behavior of the swarm forces a response to MHIs. To
create interactive art, the art responds to the motion swarm
and not to the motion directly. Since they desire to have
swarm particles respond in a natural and predictable man-
ner, they smooth the MHI in space by convolving it with
a Gaussian kernel. Since the brightest pixels in the MHI
are where motion has most recently occurred, their particles
tend to followthe motion. Following this strategy, they create
interactive art that can be enjoyed by groups such as audi-
ences at public events. Another interactive art demonstration
is constructed from the motion templates by [48]. Yau et al.
[162, 163] develop a method for visual speech recognition by
employing MHIs. The video data of the speakers mouth is
represented by MHIs. Valstar et al. [141, 142] and Pantic et
al. [112] focus on automatic detection of facial actions units
(AU) that compose facial expressions.
Table 1 presents the three application areas that have been
discussed in this Section. The year of publication of the
referred papers is shown in () after the references in [ ]. All of
these methods employ the MHI or its variants; nevertheless
the databases or features or classications approaches vary
comprehensively.
5 Discussions
We present an overviewof the MHI method, its various appli-
cations, and important modications (we extensively search
related works based on the MHI method, and we cover all
the key variants and applications of the MHI method) in this
paper. We nd that this MHI and MEI representations and its
variants can be employed for action representations, under-
standing and recognition in various applications. This section
discusses some future issues that are still unsolved. More-
over, will the MHI stand the test of time?the answer of
this question is discussed here, by illustrating its key features
and their future implications in computer vision community.
In a sub-section above, we present fewmethods which are
developed to solve the view-invariance problem of the MHI.
Though these methods have shown good performance based
on some datasets, these methods depend on multiple-camera
systems, and subsequently contain extra computational over-
head. Besides, similar to the basic MHI method, most of the
other 2Dvariants are havingthe same problemof view-invari-
ance. By using multiple cameras, the recognition improves
but at the same time, for some actions, one camera mirrors
another action from another angle. These issues are not easy
to solve and more explorations are required. Incorporation
of image depth analysis will be a cue to tackle this issue.
Also, we nd that these 2D methods (e.g., MHI, DMHI,
MMHI) can not perform well or faces difculties in some
activities. For example, when more than one person are in
the scene, these method can not recognize properly, espe-
cially when all of them are moving in different directions.
Also, it can not recognize if the person is walking towards
the cameras optical axis or if it moves something like diago-
nal directions. Figure 11 shows the case for two actions: (i) a
personis walkingfromthe camera tomove outward-direction
with the optical axis line; and (ii) a person is walking towards
the camera from far, both almost in-line with the optical axis
of the camera [9]. It is evident fromthe energy images that the
systemcan not profoundly separate these two actions. So this
issue shouldbe solvedwithsome semantics or depthanalysis.
The energy of the moving regions can be analyzed intermit-
tently, and this information may be exploited to resolve this
problem.
Another important pair of activities are running and walk-
ing across the optical axis. Recognizing or distinguishing
walking from running motion for video surveillance is very
123
Fig. 12 Motion and its corresponding DMEI (top-row) and DMHI (bottom-row) images: a for walking; and b for running. The H/E
x
images for
top-row of b show more ripple-shaped information than that of walking motion in a
difcult with the present manifestation of the MHI. Sim-
ilar to the MHI method, other variants show almost sim-
ilar motion templates for both walking and running, and
hence demonstrate poor recognition results. Though the AEI
is presented in [38], and they claim that walking and run-
ning motion can be easily recognized, their action datasets
are limited for this work. One intuitive way to achieve bet-
ter features for separating walking and running motion is to
employ the DMHI method and use the decay parameter ()
as a higher value so that we can achieve the ripples at the top
of the template (notice the more evident ripples at top of the
white patches in the column H/E
x
of Fig. 12 [9]) and use
these features for recognition.
When multiple moving persons/objects are present in the
scene, these approaches cannot solve the problemof multiple
object identication [37]. Image depth analysis can aid to
solve this problem. Researchers may think about the camera
movement and its effect. Usually, camera motion compensa-
tion is difcult and the effect of camera movement and the
employment of the MHI are not solved, though Davis et al.
[52] apply the MHI with PTZ camera.
Another important issue is whether the MHI and the MEI
representations are still required when there are several other
approaches in different directions. In the last decade, spatio-
temporal interest feature points (STIP), histogramof oriented
gradients (HOG) [45], histogramof oriented ow(HOF) [44]
and few other methods have become prominent for action
representations and recognition apart from the MHI-based
approaches. But among these approaches, the MHI (and its
variants) attains notable attentions in the computer vision
arena according to our analyses. We know that interest point
detection in static images is a well-studied topic in com-
puter vision. Laptev and Lindeberg [83] pioneer to propose
a spatio-temporal extension, building on the Harris-Laplace
detector. Several spatiotemporal interest/feature point (STIP)
detectors are recently exploited in video analysis for action
recognition. Feature points are detected using a number of
measures, namely entropy-based saliency [75, 79, 109, 152],
global texture [156], cornerness [84, 83], periodicity [53, 79]
and volumetric optical ow [77]. These are mainly based on
intensity [53], texture, color and motion information [119].
In the spatiotemporal domain, however, it is unclear which
features indicate useful interest points [79]. Most of the STIP
detectors are usually computationally expensive (compare to
the straightforward computation of the MHI) and are there-
fore restricted to the processing of short or low resolution
videos (e.g., [53, 83, 84, 109]). Detection of reduced number
of features is a prerequisite to keep the computational cost
under control [152]. Furthermore, in some cases, all input
videos need to be preprocessed [156].
Though these approaches are proven in recognizing var-
ious actions, they have additional theoretical and computa-
tional complexity than that of the MHI method. The MHI
method is very simple and computationally not expensive.
Moreover, it covers every motion details and these seg-
mented motion regions are employed for various applica-
tions. We notice that the MHI and MHI-based approaches
are employed, exploited and modulated for a good number of
applications in various domains and dimensions (see above).
Therefore, we stronglyfeel that the MHI methodis still useful
and that the limitations that are still unsolved can be managed
in the future. Moreover, the MHI method at its basic format
is very easy to understand and implement. This is a key ben-
ecial feature for the MHI. From the MHI and MEI image,
using Hu moments or other shape representation approaches,
we can easily get the feature vectors for recognition. How-
ever, the MHI is a global-based approach; hence, motions
from objects that are not the target of interest will deter the
performance (STIP-based methods are better in this context).
The MHI is a representation of choice for action recognition
when temporal segmentation is available; when actors are
fully visible and can be separated from each other; and when
123
they do not move along the z-axis of the camera. In other
cases, other representations are probably needed, including
bag-of-features (BOF) methods based on STIP, HOF and
HOG, which have been shown to overcome those limitations.
The STIP-based approaches and HOG/HOF-based devel-
opments can be incorporated along with the MHI/MEI rep-
resentations for future research. Integration of multiple cues
(e.g., motion, shape, edge information (e.g., [39]), color or
texture), or a fusion of information will produce a better
result [134]. Presence of multiple moving subjects, moving
camera, view-invariance issues, image depth analysis and
overall a better and robust image segmentation technique
(for producing the update function in outdoor and cluttered
environments) are the major challenges ahead for the MHI
method. We feel that the above discussions will open some
doors for further research to improve the methods for real-life
applications.
6 Conclusions
Human motion analysis is a challenging problemdue to large
variations in human motion and appearance, camera view-
point and environment settings [118]. The eld of action
and activity representation and recognition is relatively old,
yet not well-understood [104]. Some important but common
motion recognition problems are even now unsolved prop-
erly by the computer vision community. However, in the
last decade, a number of good approaches are proposed and
evaluated subsequently by many researchers. Among those
methods, one method gets significant attention from many
researchers in the computer vision eld. Therefore, though
there are various approaches for motion analysis and recogni-
tion, this paper analyzes the MHI method. It is one of the key
methods, and a number of variants are developed from this
concept. The MHI is simple to understand and implement;
hence many researchers employ this method or its variants
for various action/gesture recognition and motion analysis,
with different datasets.
We present a tutorial that covers important issues for this
representation and method. Afterwards, several key limita-
tions are mentioned. In this work, we categorize and present
various implementations of the MHI and its developments.
This paper also discusses several issues to solve in future.
The motion self-occlusion problem of the MHI is addressed
and solved with satisfactory recognition rate. Though 3D
approaches are proposed as view-invariant methods at the
top of 2D MHI, these are computationally expensive. Nev-
ertheless, several essential concerns of the MHIrelated to
self-occlusion due to motion, motion overlapping or multiple
repetitions, significant occlusion from multiple moving per-
sons, and objects motion towards the optical axis of the
camera, should further be investigated rigorously in future
so that this simple approach can be extended to various real-
life applications with better performance. We hope that this
paper would be benecial to various researchers (especially
to inspire new researchers) to understand the MHI method,
its variants and applications.
Acknowledgments The authors are grateful to the anonymous
reviewers for their excellent reviews and constructive comments that
helped to improve the manuscript. The work is supported by the Japan
Society for the Promotion of Science (JSPS), Japan.
References
1. Aggarwal, J., Cai, Q.: Human motion analysis: a review. In: Proc.
Nonrigid and Articulated Motion Workshop, pp. 90102 (1997)
2. Aggarwal, J.K., Cai, Q.: Human motion analysis: a review.
Comput. Vis. Image Underst. 73, 428440 (1999)
3. Aggarwal, J.K., Park, S.: Human motion: modeling and recogni-
tion of actions and interactions. In: Proc. Int. Symposium on 3D
Data Processing, Visualization, and Transmission (3DPVT04),
p. 8 (2004)
4. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Lower-
dimensional feature sets for template-based motion recognition
approaches. J. Comput. Sci. 6(8), 920927 (2010)
5. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: A simple
approach for low-resolution activity recognition. Int. J. Comput.
Vis. Biomech. 3(1) (2010)
6. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Temporal
motion recognition and segmentation approach. Int. J. Imaging
Syst. Technol. 19, 9199 (2009)
7. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Human activity
recognition: various paradigms. In: Proc. Int. Conf. on Control,
Automation and Systems, pp. 18961901, October 2008
8. Ahad, Md.A.R., Ogata, T., Tan, J.K., Kim, H., Ishikawa, S.: A
complex motion recognition technique employing directional
motion templates. Int. J. Innov. Comput. Inf. Control 4(8), 1943
1954 (2008)
9. Ahad, Md.A.R., Ogata, T., Tan, J.K., Kim, H., Ishikawa, S.:
Moment-basedhumanmotionrecognitionfromthe representation
of DMHI templates. In: SICE Annual Conference, pp. 578583,
August 2008
10. Ahad, Md.A.R., Ogata, T., Tan, J.K., Kim, H., Ishikawa, S.:
A smart automated complex motion recognition technique. In:
Proc. Workshop on Multi-dimensional and Multi-viewImage Pro-
cessing (with ACCV), pp. 142149 (2007)
11. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Analysis of
motion self-occlusion problem due to motion overwriting for
human activity recognition. J. Multimed. 5(1), 3646 (2009)
12. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Action rec-
ognition with various speeds and timed-DMHI feature vectors.
In: Proc. Int. Conf. on Computer and Info. Tech., pp. 213218,
December 2008
13. Ahad, Md.A.R., Tan J.K., Kim H., Ishikawa, S.: Human activity
analysis: concentrating on motion history image and its variants.
In: SICE-ICASE Joint Annual Conf., pp. 54015406 (2009)
14. Ahmad, M., Parvin, I., Lee, S.-W.: Silhouette history and energy
image information for human movement recognition. J. Multi-
media 5(1), 1221 (2010)
15. Ahmad, M., Lee, S.-W.: Recognizing human actions based on
silhouette energy image and global motion description. In: Proc.
IEEE Automatic Face and Gesture Recognition, pp. 523588
(2008)
123
16. Ahmad, M., Hossain, M.Z.: SEI and SHI representations for
human movement recognition. In: Proc. Int. Conf. on Computer
and Information Technology (ICCIT), pp. 521526 (2008)
17. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Action rec-
ognition by employing combined directional motion history and
energy images. In: IEEE Computer Society Conf. on Computer
Vision and Pattern Recognitions Workshop on CVCG, p. 6 (2010)
18. Alahari, K., Jawahar, C.V.: Discriminative actions for recogniz-
ing events. In: Indian Conf. on Computer Vision, Graphics and
Image Processing (ICVGIP06), LNCS, vol. 4338, pp. 552563
(2006)
19. Albu, A.B., Beugeling, T.: A three-dimensional spatiotempo-
ral template for interactive human motion analysis. J. Multi-
media 2(4), 4554 (2007)
20. Albu, A., Trevor, B., Naznin, V., Beach, C.: Analysis of irregular-
ities in human actions with volumetric motion history images. In:
Proc. IEEE Workshop on Motion and Video Computing, Texas,
USA, p. 16, February 2007
21. Anderson, C., Bert, P., Wal, G.V.: Change detection and track-
ing using pyramids transformation techniques. In: Proc. SPIE-
Intelligent Robots and Computer Vision, vol. 579, pp. 7278
(1985)
22. Arseneau, S., Cooperstock, J.R.: Real-time image segmentation
for action recognition. In: Proc. IEEE Pacic Rim Conf. on Com-
munications, Computers and Signal Processing, pp. 8689 (1999)
23. Babu, R., Ramakrishnan, K.: Compressed domain human motion
recognition using motion history information. In: Proc. ICIP,
vol. 2, pp. 321324 (2003)
24. Babu, R., Ramakrishnan, K.: Recognition of human actions
using motion history information extracted from the compressed
video. Image Vis. Comput. 22, 597607 (2004)
25. Bashir, K., Xiang, T., Gong, S.: Feature selection for gait rec-
ognition without subject cooperation. In: British Machine Vision
Conference, p. 10 (2008)
26. Bashir, K., Xiang, T., Gong, S.: Feature selection on gait energy
image for human identication. In: IEEE Int. Conf. on Acoustics,
Speech and Signal Processing, pp. 985988 (2008)
27. Beauchemin, S.S., Barron, J.L.: The computation of optical
ow. ACM Comput. Surv. 27(3), 433467 (1995)
28. Bergen, J.R., Burt, P., Hingorani, R., Peleg, S.: Athree frame algo-
rithm for estimating two-component image motion. IEEE Trans.
PAMI 14(9), 886896 (1992)
29. Bimbo, A.D., Nesi, P.: Real-time optical owestimation. In: Proc.
Int. Conf. on Systems Engineering in the Service of Humans, Sys-
tems, Man and Cybernetics, vol. 3, pp. 1319 (1993)
30. Bobick, A., Davis, J.: An appearance-based representation of
action. In: Intl. Conf. on Pattern Recognition, pp. 307312 (1996)
31. Bobick, A., Davis, J.: The recognition of human movement
using temporal templates. IEEE Trans. PAMI 23(3), 257267
(2001)
32. Bobick, A., Intille, S., Davis, J., Baird, F., Pinhanez, C., Campbell,
L., Ivanov, Y., Schutte, A., Wilson, A.: The Kidsroom: a percep-
tually-based interactive and immersive story environment. Pres-
ence: Teleoperators Virtual Environ. 8(4), 367391 (1999)
33. Bradski, G., Davis, J.: Motion segmentation and pose recogni-
tion with motion history gradients. In: Proc. IEEE Workshop on
Applications of Computer Vision, pp. 174184, December 2000
34. Bradski, G., Davis, J.: Motion segmentation and pose recog-
nition with motion history gradients. Mach. Vis. Appl. 13(3),
174184 (2002)
35. Canton-Ferrer, C., Casas, J.R., Pardas, M.: Human model and
motion based 3D action recognition in multiple view scenarios.
In: Proc. Conf. European Signal Process, Italy, pp. 15, September
2006
36. Canton-Ferrer, C., Casas, J.R., Pards, M., Sargin, M.E., Tekalp,
A.M.: 3D human action recognition in multiple view scenarios.
In: Proc. Jornades de Recerca en Automtica, Visi i Robtica,
Barcelona (Spain), p. 5, 46 July 2006
37. Cedras, C., Shah, M.: A survey of motion analysis from moving
light displays. In: Proc. IEEE CVPR, pp. 214221 (1994)
38. Chandrashekhar, V., Venkatesh, K.S.: Action energy images for
reliable human action recognition. In: Proc. of Asian Symposium
on Information Display (ASID), pp. 484487 (2006)
39. Chen, D., Yang, J.: Exploiting high dimensional video features
using layered Gaussian mixture models. In: Proc. IEEE ICPR,
p. 4 (2006)
40. Chen, D., Yan, R., Yang, J.: Activity analysis in privacy-pro-
tected video, p. 11. (2007). http://www.informedia.cs.cmu.edu/
documents/T-MM_Privacy_J2c.pdf
41. Chen, C., Liang, J., Zhao, H., Hu, H., Tian, J.: Frame differ-
ence energy image for gait recognition with incomplete silhou-
ettes. Pattern Recognit. Lett. 30(11), 977984 (2003)
42. Christmas, W.J.: Spatial ltering requirements for gradient-based
optical ow measurement. In: 9th British Machine Vision Con-
ference, pp. 185194 (1998)
43. Collins, R.T., Lipton, A., Kanade, T., Fujiyoshi, H., Duggins, D.,
Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P.,
Wixson, L.: A system for video surveillance and monitoring.
VSAM nal report, CMU-RI-TR-00-12, Technical Report, Car-
negie Mellon University, p. 69 (2000)
44. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented
histograms of ow and appearance. In: European Conference on
Computer Vision, pp. 428441 (2006)
45. Dalal, N., Triggs, B.: Histograms of oriented gradients for human
detection. In: Intl. Conf. on Computer Vision and Pattern Recog-
nition, pp. 886893 (2005)
46. Davis, J.: Sequential reliable-inference for rapid detection of
human actions. In: Proc. IEEE Workshop on Detection and Rec-
ognition of Events in Video, pp. 19, July 2004
47. Davis, J.W.: Appearance-based motion recognition of human
actions. M.I.T. Media Lab Perceptual Computing Group Tech.
Report No. 387, p. 51 (1996)
48. Davis, J., Bradski, G.: Real-time motion template gradients using
Intel CVLib. In: Proc. ICCV Workshop on Frame-Rate Vision,
pp. 120, September 1999
49. Davis, J.: Hierarchical motion history images for recognizing
human motion. In: Proc. IEEE Workshop on Detection and Rec-
ognition of Events in Video, pp. 3946 (2001)
50. Davis, J., Bobick, A.: Virtual PAT: a virtual personal aerobics
trainer. In: Proc. Perceptual User Interfaces, pp. 1318, November
1998
51. Davis, J.: Recognizing movement using motion histograms. MIT
Media Lab. Perceptual Computing Section Tech. Report No. 487
(1998)
52. Davis, J.W., Morison, A.M., Woods, D.D.: Buildingadaptive cam-
era models for video surveillance. In: Proc. IEEE Workshop on
Applications of Computer Vision (WACV07), p. 6 (2007)
53. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recog-
nition via sparse spatiotemporal features. In: Intl. Workshop on
Visual Surveillance and Performance Evaluation of Tracking and
Surveillance, pp. 6572, October 2005
54. Digital Imaging Research Centre, K.U.L.: Virtual Human
Action Silhouette (ViHASi) Database. http://dipersec.king.ac.uk/
VIHASI/
55. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action
at a distance. In: Proc. ICCV, pp. 726733 (2003)
56. Elgammal, A., Harwood, D., David, L.S.: Nonparametric back-
ground model for background subtraction. In: Proc. European
Conference on Computer Vision, p. 17 (2000)
57. Essa, I., Pentland, S.: Facial expression recognition using a
dynamic model and motion energy. In: Proc. IEEE CVPR, p. 8,
June 1995
123
58. Forbes, K.: Summarizing motion in video sequences, pp. 17.
http://thekrf.com/projects/motionsummary/MotionSummary.pdf.
Accessed 9 May 2004
59. Full-body Gesture Database, Korea University. http://gesturedb.
korea.ac.kr/
60. Fischler, M.A., Bolles, R.C.: Random sample consensus: a par-
adigm for model tting with applications to image analysis and
automated cartography. Commun. ACM 24(6), 381395 (1981)
61. Gavrilla, D.: The visual analysis of human movement: a sur-
vey. Comput. Vis. Image Underst. 73, 8298 (1999)
62. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.:
Actions as space-time shapes. IEEE Trans. PAMI 29(12),
22472253 (2007)
63. Han, J., Bhanu, B.: Individual recognition using gait energy
image. IEEE Trans. PAMI 28(2), 316322 (2006)
64. Han, J., Bhanu, B.: Gait energy image representation: compara-
tive performance evaluation on USF HumanIDdatabase. In: Proc.
Joint Intl. Workshop VS-PETS, pp. 133140 (2003)
65. Haritaoglu, I., Harwood, D., Davis, L.S.: W
4
: real-time surveil-
lance of people and their activities. IEEE Trans. PAMI 22(8),
809830 (2000)
66. Horn, B., Schunck, B.G.: Determining optical ow. Artif. Intell.
17, 185203 (1981)
67. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual sur-
veillance of object motion and behaviors. IEEE Trans. SMC-Part
C. 34(3), 334352 (2004)
68. Hu, M.K.: Visual pattern recognition by moment invariants. IRE
Trans. Info. Theory 8, 179187 (1962)
69. Jaimes, A., Sebe, N.: Multimodal human-computer interaction: a
survey. Comput. Vis. Image Underst. 108(12), 116134 (2007)
70. Jain, A., Duin, R., Mao, J.: Statistical pattern recognition: a
review. IEEE Trans. PAMI 2(1), 437 (2000)
71. Jan, T.: Neural network based threat assessment for automated
visual surveillance. In: Proc. IEEE Joint Conf. on Neural Net-
works, vol. 2, pp. 13091312, July 2004
72. Jin, T., Leung, M.K.H., Li, L.: Temporal human body segmen-
tation. In: Villanieva, J.J. (ed.) IASTED Int. Conf. Visualization,
Imaging, and Image Processing (VIIP04). Acta Press, Marbella.
ISSN: 1482-7921, 68 September 2004
73. Kellokumpu, V., Zhao, G., Pietikinen, M.: Texture based descrip-
tion of movements for activity analysis. In: Proc. Conf. Com-
puter Vision Theory and Applications (VISAPP08), vol. 2, pp.
368374, Portugal (2008)
74. Kilger, M.: A shadow handler in a video-based real-time trafc
monitoring system. In: Proc. IEEE Workshop on Applications of
75. Kadir, T., Brady, M.: Scale, saliency and image description.
IJCV 45(2), 83105 (2001)
76. Kameda, Y., Minoh, M.: A human motion estimation method
using 3-successive video frames. In: Proc. Int. Conf. on Virtual
Systems and Multimedia, p. 6 (1996)
77. Ke, Y., Sukthankar, R., Hebert, M.: Efcient visual event detection
using volumetric features. In: ICCV, vol. 1, pp. 166173 (2005)
78. Kellokumpu, V., Pietikinen, M., Heikkil, J.: Human activity
recognition using sequences of postures. Mach. Vis. Appl., pp.
570573 (2005)
79. Kienzle, W., Scholkopf, B., Wichmann, F.A., Franz, M.O.: Howto
nd interesting locations in video: a spatiotemporal interest point
detector learned from human eye movements. In: 29th DAGM
Symposium, pp. 405414, September 2007
80. Kindratenko, V.: Development and application of image analysis
techniques for identication and classication of microscopic par-
ticles. PhDthesis, University of Antwerp, Belgium(1997). http://
www.ncsa.uiuc.edu/~kindr/phd/index.pdf
81. Khotanzad, A., Hong, Y.H.: Invariant image recognition by
Zernike moments. IEEE Trans. PAMI 12(5), 489497 (1990)
82. Kumar, S., Kumar, D., Sharma, A., McLachlan, N.: Classica-
tion of hand movements using motion templates and geometrical
based moments. In: Proc. Intl Conf. on Intelligent Sensing and
Information Processing, pp. 299304 (2003)
83. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV,
vol. 1, p. 432 (2003)
84. Laptev, I.: On space-time interest points. IJCV 64(2), 107123
(2005)
85. LaViola, J.: A survey of hand posture and gesture recog-
nition techniques and technology. Tech. Report CS-99-11,
Brown University, p. 80, June 1999
86. Leman, K., Ankit, G., Tan, T.: PDA-based human motion rec-
ognition system. Int. J. Softw. Eng. Knowl. 2(15), 199205
(2005)
87. Li, L., Zeng, Q., Jiang, Y., Xia, H.: Spatio-temporal motion seg-
mentation and tracking under realistic condition. In: Proc. Intl
Symposium on Systems and Control in Aerospace and Astronau-
tics, pp. 229232 (2003)
88. Lipton, A.J., Fujiyoshi, H., Patil, R.S.: Moving target classica-
tion and tracking from real-time video. In: Proc. IEEE Workshop
on Applications of Computer Vision, pp. 814 (1998)
89. Liu, J., Zhang, N.: Gait history image: a novel temporal template
for gait recognition. In: Proc. IEEE Int. Conf. Multimedia and
Expo, pp. 663666 (2007)
90. Lo, C., Don, H.: 3-D moment forms: their construction and
application to object identication and positioning. IEEE Trans.
PAMI 11(10), 10531063 (1989)
91. Lucas, B., Kanade, T.: An iterative image registration technique
with an application to stereo vision. In: Proc. Int. Joint Conf. on
Articial Intelligence, pp. 674679 (1981)
92. Ma, Q., Wang, S., Nie, D., Qiu, J.: Recognizing humans based on
gait moment image. In: 8th ACIS Intl. Conf. on Software Engi-
neering, Articial Intelligence, Networking, and Parallel/Distrib-
uted Computing, pp. 606610 (2007)
93. Masoud, O., Papanikolopoulos, N.: A method for human action
recognition. Image Vis. Comput. 21, 729743 (2003)
94. McCane, B., Novins, K., Crannitch, D., Galvin, B.: On bench-
marking optical ow. Comput. Vis. Image Underst. 84, 126
143 (2001)
95. McKenna, S.J., Jabri, S., Duric, Z., Wechsler, H., Rosenfeld,
A.: Tracking groups of people. Comput. Vis. Image Underst.
80(1), 4256 (2000)
96. Meng, H., Pears, N., Bailey, C.: A human action recognition sys-
tem for embedded computer vision application. In: Proc. Work-
shoponEmbeddedComputer Vision(withCVPR), pp. 16(2007)
97. Meng, H., Pears, N., Bailey, C.: Human action classication
using SVM_2K classier on motion features. In: LNCS: Mul-
timedia Content Representation, Classication and Security,
vol. 4105/2006, pp. 458465 (2006)
98. Meng, H., Pears, N., Bailey, C.: Motion information combina-
tion for fast human action recognition. In: Proc. Conf. Computer
Vision Theory and Applications (VIASAPP07), Spain, March
2007
99. Meng, H., Pears, N., Bailey, C.: Recognizing human actions based
on motion information and SVM. In: Proc. IEE Int. Conf. Intelli-
gent Environments, pp. 239245 (2006)
100. Meng, H., Pears, N., Freeman, M., Bailey, C.: Motion history his-
tograms for human action recognition. In: Embedded Computer
Vision (Advances in Pattern Recognition), part II, pp. 139162.
Springer, London (2009)
101. Mittal, A., Paragois, N.: Motion-based background subtraction
using adaptive kernel density estimation. In: Proc. IEEE CVPR,
p. 8 (2004)
102. Moeslund, T.B.: Summaries of 107 computer vision-based human
motion capture papers. Tech. Report: LIA 99-01, University of
Aalborg, p. 83, March 1999
123
103. Moeslund, T.B., Granum, E.: A survey of computer vision-based
human motion capture. Comput. Vis. Image Underst. 81, 231
268 (2001)
104. Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in
vision-based human motion capture and analysis. Comput. Vis.
Image Underst. 104, 90126 (2006)
105. Ng, J., Gong, S.: Learning pixel-wise signal energy for under-
standing semantics. In: Proc. BMVC, pp. 695704 (2001)
106. Ng, J., Gong, S.: Learning pixel-wise signal energy for under-
standing semantics. Image Vis. Comput. 21, 11831189 (2003)
107. Nguyen, Q., Novakowski, S., Boyd, J.E., Jacob, C., Hushlak, G.:
Motion swarms: video interaction for art in complex environ-
ments. In: Proc. ACM Int. Conf. Multimedia, CA, pp. 461469
(2006)
108. Ogata, T., Tan, J.K., Ishikawa, S.: High-speed human motion rec-
ognition based on a motion history image and an Eigenspace.
IEICE Trans. Inf. Syst. E89-D(1), 281289 (2006)
109. Oikonomopoulos, A., Patras, I., Pantic, M.: Spatiotemporal salient
points for visual recognition of human actions. IEEE Trans. Syst.
Man Cybern. B: Cybern. 36(3), 710719 (2006)
110. Orrite, C., Martnez, F., Herrero, E., Ragheb, H., Velastin, S.:
Independent viewpoint silhouette-based human action modelling
and recognition. In: Proc. Int. Workshop on Machine Learning
for Vision-based Motion Analysis (MLVMA08) with ECCV,
pp. 112 (2008)
111. Pantic, M., Pentland, A., Nijholt, A., Hunag, T.S.: Human com-
puting and machine understanding of human behavior: a sur-
vey. In: Proc. Int. Conf. on Multimodal Interfaces, pp. 239248
(2006)
112. Pantic, M., Patras, I., Valstar, M.F.: Learning spatio-temporal
models of facial expressions. In: Proc. Int. Conf. on Measuring
Behaviour, pp. 710, September 2005
113. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly
accurate optic ow computation with theoretically justied warp-
ing. Int. J. Comput. Vis. 67(2), 141158 (2006)
114. Pavlovic, V., Sharma, R., Huang, T.: Visual interpretation of hand
gestures for human-computer interaction: a review. IEEE Trans.
PAMI 19(7), 677695 (1997)
115. Piater, J., Crowley, J.: Multi-modal tracking of interacting tar-
gets using Gaussian approximations. In: Proc. IEEEWorkshop on
Performance Evaluation of Tracking and Surveillance at CVPR,
pp. 141147 (2001)
116. Petrs, I., Beleznai, C., Dedeo glu, Y., Pards, M., et al.: Flexi-
ble test-bed for unusual behavior detection. In: Proc. ACM Conf.
Image and Video Retrieval, pp. 105108 (2007)
117. Polana, R., Nelson, R.: Low level recognition of human motion.
In: Proc. IEEE Workshop on Motion of Non-rigid and Articulated
Objects, pp. 7782 (1994)
118. Poppe, R.: Vision-based human motion analysis: an over-
view. Comput. Vis. Image Underst. 108(12), 418 (2007)
119. Rapantzikos, K., Avrithis, Y., Kollias, S.: Dense saliency-based
spatiotemporal feature points for action recognition. In: Intl. Conf.
on Computer Vision and Pattern Recognition, pp. 18 (2009)
120. Rhne-Alpes, I.: The Inria XMAS (IXMAS) motion acquisition
sequences. http://charibdis.inrialpes.fr
121. Roh, M.-C., Shin, H.-K., Lee, S.-W., Lee, S.-W.: Volume motion
template for view-invariant gesture recognition. In: Proc. ICPR,
vol. 2, pp. 12291232 (2006)
122. Rosales, R.: Recognition of human action using moment-based
features. Boston University Computer Science Tech. Report, BU
98-020, 119, November 1998
123. Rosales, R., Sclaroff, S.: 3D trajectory recovery for tracking mul-
tiple objects and trajectory guided recognition of actions. In: Proc.
CVPR, vol. 2, pp. 117123 (1999)
124. Ryu, W., Kim, D., Lee, H.-S., Sung, J., Kim, D.: Gesture recog-
nition using temporal templates. In: Proc. ICPR, Demo Program,
Hong Kong, August 2006
125. Ruiz-del-Solar, J., Vallejos, P.A.: Motion detection and tracking
for an AIBO robot using camera motion compensation and Kal-
man ltering. In: Proc. RoboCup Int. Symposium 2004, Lisbon,
LNCS, vol. 3276, pp. 619627 (2005)
126. Sarkar, S., Phillips, P.J., Liu, Z., Vega, I.R., Grother, P.,
Bowyer, K.W.: The humanid gait challenge problem: data sets,
performance, and analysis. IEEE Trans. PAMI 27(2), 162177
(2005)
127. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions:
a local SVM approach. In: Proc. ICPR, vol. 3, pp. 3236 (2004)
128. Senior, A., Tosunoglu, S.: Hybrid machine vision control. In:
Florida Con. on Recent Advances in Robotics, pp. 16, May 2005
129. Shan, C., Wei, Y., Qiu, X., Tan, T.: Gesture recognition using
temporal template based trajectories. In: Proc. ICPR, vol. 3,
pp. 954957 (2004)
130. Shin, H.-K., Lee, S.-W., Lee, S.-W.: Real-time gesture recogni-
tion using 3Dmotion history model. In: Proc. Conf. on Intelligent
Computing, Part I, LNCS, vol. 3644, pp. 888898, China, August
2005
131. Sigal, L., Black, M.J.: HumanEva: Synchronized video and
motion capture dataset for evaluation of articulated human
motion. Department of Computer Science, Brown University,
Tech. Report CS-06-08, p. 18, September 2006
132. Singh, R., Seth, B., Desai, U.: A real-time framework for vision
based human robot interaction. In: Proc. IEEE/RSJ Conf. on Intel-
ligent Robots and Systems, pp. 58315836 (2006)
133. Son, D., Dinh, T., Nam, V., Hanh, T., Lam, H.: Detection and
localization of road area in trafc video sequences using motion
information and fuzzy-shadowed sets. In: Proc. IEEE Intl Symp.
Multimedia, pp. 725732, December 2005
134. Spengler, M., Schiele, B.: Towards robust multi-cue integration
for visual tracking. Mach. Vis. Appl. 14, 5058 (2003)
135. Stauffer, C., Grimson, W.: Adaptive background mixture models
for real-time tracking. In: Proc. IEEE CVPR, vol. 2, pp. 246252
(1999)
136. Sun, H.Z., Feng, T., Tan, T.N.: Robust extraction of moving
objects from image sequences. In: Proc. Asian Conference on
137. Sziranyi, T.: with other partners UPC, SZTAKI, Bilkent and ACV:
real time detector for unusual behavior. http://www.muscle-noe.
org/content/view/147/64/
138. Talukder, A., Goldberg, S., Matthies, L., Ansar, A.: Real-time
detection of moving objects in a dynamic scene from moving
robotic vehicles. In: Proc. IEEE/RSJ Intl Conference on Intelli-
gent Robots and Systems, pp. 13081313 (2003)
139. Tan, J.K., Ishikawa, S.: High accuracy and real-time recognition
of human activities. In: 33rd Annual Conf. of IEEE Industrial
Electronics Society (IECON), pp. 23772382 (2007)
140. Vafadar, M., Behrad, A.: Human hand gesture recognition using
motion orientation histogram for interaction of handicapped per-
sons with computer. In: Elmoataz, A., et al. (eds.) ICISP 2008,
LNCS, vol. 5099, pp. 378385 (2008)
141. Valstar, M., Pantic, M., Patras, I.: Motion history for facial
action detection in video. In: Proc. IEEE Int. Conf. SMC, vol. 1,
pp. 635640 (2004)
142. Valstar, M., Patras, I., Pantic, M.: Facial action recogni-
tion using temporal templates. In: Proc. IEEE Workshop on
Robot and Human Interactive Communication, pp. 253258
(2004)
143. Vitaladevuni, S.N., Kellokumpu, V., Davis, L.S.: Action recogni-
tion using ballistic dynamics. In: Proc. CVPR, p. 8 (2008)
123
144. Wang, L., Hu, W., Tan, T.: Recent developments in human motion
analysis. Pattern Recognit. 36, 585601 (2003)
145. Wang, J.J.L., Singh, S.: Video analysis of human dynamicsa
survey. Real-Time Imaging 9(5), 321346 (2006)
146. Wang, C., Brandstein, M.S.: A hybrid real-time face tracking sys-
tem. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Pro-
cessing, p. 4 (1998)
147. Wang, L., Suter, D.: Informative shape representations for
human action recognition. Intl. Conf. Pattern Recognit. 2, 1266
1269 (2006)
148. Watanabe, K., Kurita, T.: Motion recognition by higher order local
auto correlation features of motion history images. In: Proc. Bio-
inspired, Learning and Intelligent Systems for Security, pp. 5155
(2008)
149. Wei, J., Harle, N.: Use of temporal redundancy of motion vectors
for the increase of optical ow calculation speed as a contribu-
tion to real-time robot vision. In: Proc. IEEE TENCONSpeech
andImage Technologies for ComputingandTelecommunications,
pp. 677680 (1997)
150. Weinland, D., Ronfard, R., Boyer, E.: Automatic discovery
of action taxonomies from multiple views. In: Proc. CVPR,
pp. 16391645 (2006)
151. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action rec-
ognition using motion history volumes. Comput. Vis. Image Und-
erst. 104(2), 249257 (2006)
152. Willems, G., Tuytelaars, T., Gool, L.V.: An efcient dense and
scale-invariant spatio-temporal interest point detector. In: 10th
European Conference on Computer Vision, pp. 650663 (2008)
153. Wixson, L.: Detecting salient motion by accumulating direction-
ally-consistent ow. IEEE Trans. PAMI 22(8), 774780 (2000)
154. Wong, S.F., Cipolla, R.: Continuous gesture recognition using a
sparse Bayesian classier. In: Intl. Conf. on Pattern Recognition,
vol. 1, pp. 10841087 (2006)
155. Wong, S.F., Cipolla, R.: Real-time adaptive hand motion recogni-
tion using a sparse Bayesian classier. In: Intl. Conf. on Computer
Vision Workshop, pp. 170179 (2005)
156. Wong, S.F., Cipolla, R.: Extracting spatiotemporal interest points
using global information. In: ICCV, pp. 18 (2007)
157. Wren, R., Clarkson, B.P., Pentland, A.P.: Understanding purpose-
ful human motion. In: Proc. Intl Conf. on Automatic Face and
Gesture Recognition, pp. 1925 (1999)
158. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.P.:
Pnder: real-time tracking of the human body. IEEE Trans.
PAMI 19(7), 780785 (1997)
159. Xiang, T., Gong, S.: Beyond tracking: modelling activity and
understanding behaviour. Int. J. Comput. Vis. 67(1), 2151 (2006)
160. Yang, Y.H., Levine, M.D.: The background primal sketch: an
approach for tracking moving objects. Mach. Vis. Appl. 5, 1734
(1992)
161. Yang, X., Zhang, T., Zhou, Y., Yang, J.: Gabor phase embedding
of gait energy image for identity recognition. In: 8th IEEE Intl.
Conf. on Computer and Information Technology, pp. 361366,
July 2008
162. Yau, W., Kumar, D., Arjunan, S., Kumar, S.: Visual speech rec-
ognition using image moments and multiresolution wavelet. In:
Proc. Conf. on Computer Graphics, Imaging and Visualization,
pp. 194199 (2006)
163. Yau, W., Kumar, D., Arjunan, S.: Voiceless speech recognition
using dynamic visual speech features. In: Proc. HCSNet Work-
shop on the Use of Vision in HCI, Australia (2006)
164. Yilmaz, A., Shah, M.: Actions sketch: a novel action representa-
tion. In: IEEE Computer Society Conf. on Computer Vision and
Pattern Recognition, vol. 1, pp. 984989 (2005)
165. Yin, Z., Collins, R.: Movingobject localizationinthermal imagery
by forward-backward MHI. In: Proc. IEEE Workshop on Object
Tracking and Classication in and Beyond the Visible Spectrum,
NY, pp. 133140, June 2006
166. Yu, C.-C., Cheng, H.-Y., Cheng, C.-H., Fan, K.-C.: Efcient
human action and gait analysis using multiresolution motion
energy histogram. EURASIP J. Adv. Signal Process. 2010,
113 (2010)
167. Yu, S., Tan, D., Tan, T.: A framework for evaluating the effect of
view angle, clothing and carrying condition on gait recognition.
In: Intl. Conf. on Pattern Recognition, pp. 441444 (2006)
168. Zhang, D., Lu, G.: Reviewof shape representation and description
techniques. Pattern Recognit. 37, 119 (2004)
169. Zhou, H., Hu, H.: A surveyhuman movement tracking and
stroke rehabilitation. Tech. Report: CSM-420, Department of
Computer Sciences, University of Essex, p. 33, December 2004
170. Zou, X., Bhanu, B.: Human activity classication based on gait
energy image and co-evolutionary genetic programming. In: Proc.
ICPR, vol. 3, pp. 555559 (2006)
Author Biographies
Md. Atiqur Rahman Ahad
was born in Bangladesh and
has obtained his B.Sc. (Hons)
and Masters degrees from the
Department of Applied Physics,
Electronics and Communication
Engineering, Universityof Dhaka,
Bangladesh. He later received a
Masters degree from School of
Computer Science and Engineer-
ing, University of New South
Wales, Australia. He obtained his
Ph.D. degree from the Faculty of
Engineering, Kyushu Institute of
Technology, Japan. Since 2000,
he has taught in different universities and working in the University
of Dhaka, Bangladesh, since 2001 (currently on-leave). He has also
served as a Casual Academic in University of New South Wales during
three sessions from2002 to 2004. He is currently working as JSPS Post-
doctoral Research Fellowat Kyushu Institute of Technology, Japan. Mr.
Ahad is a student member of IEEE, IEEE IES and Society of Instru-
ment and Control Engineers (SICE). He has won the Best Student Paper
Award in the International Workshop on Combinatorial Image Analy-
sis (IWCIA), Buffalo, NY, in April 2008. He has also been awarded the
Biomedical Fuzzy Systems Associations Best Paper Award (Journal)
in 2008. His present research includes human motion recognition and
analysis, motion segmentation, motion tracking, etc.
123
Joo Kooi Tan obtained his
Ph.D. from Kyushu Institute of
Technology in 2000. She is
presently an assistant profes-
sor with faculty of Mechanical
and Control Engineering in the
same university. Her current main
research interests include three-
dimensional shape and motion
recovery, human motion analy-
sis, human activity recognition
and understanding, and applica-
tions of computer vision. She
received the SICEKyushu Branch
Young Authors Award in 1999,
the AROB10th Young Authors Award in 2004, Young Authors Award
from IPSJ of Kyushu Branch in 2004, the Japanese Journal Best Paper
Award from BMFSA in 2008, the Best Paper Award from ISII in 2009,
and she has also won The Excellent Paper Award from Biomedical
Fuzzy SystemAssociation in 2010. She is a member of IEEE, The Soci-
ety of Instrument and Control Engineers, and The Information Process-
ing Society of Japan.
Hyoungseop Kim received his
B.A. degree in electrical engi-
neering from Kyushu Institute of
Technology in 1994, the Masters
and Ph.D. degree from Kyushu
Institute of Technology in 1996
and 2001, respectively. He is an
associate professor in the Depart-
ment of Control Engineering at
Kyushu Institute of Technology.
His research interests are focused
on medical application of image
analysis. He is currently work-
ing on automatic segmentation
of multi-organ of abdominal CT
image, and temporal subtraction of thoracic MDCT image sets.
Seiji Ishikawa obtained his
B.E., M.E., and D.E. degrees
from The University of Tokyo,
where he majored in Mathemat-
ical Engineering and Instrumen-
tation Physics. He joined Kyushu
Institute of Technology and he
is currently Professor of Depart-
ment of Control & Mechani-
cal Engineering, KIT. Professor
Ishikawa was a visiting research
fellow at Shefeld University,
U.K., from 1983 to 1984, and a
visiting professor at Utrecht Uni-
versity, The Netherlands, in 1996.
He was awarded BMFSA Best Paper Award in 2008 and 2010. His
research interests include three-dimensional shape/motion recovery,
and human detection and its motion analysis from car videos. He is
a member of IEEE, The Society of Instrument and Control Engineers,
The Institute of Electronics, Information and Communication Engi-
neers, and The Institute of Image Electronics Engineers of Japan.
123

Motion History Images

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Motion History Images

Încărcat de

Drepturi de autor:

Formate disponibile

Machine Vision and Applications (2012) 23:255281

(x, y, t ) can be computed from an update function

(x, y, t 1)) otherwise

(x, y, t ) is a modied-MHI, the parameter is a

(x, y, t 1)) otherwise,

(x, y, t ) image templates are achieved as motion

(x, y, t ) and DMHI

(x, y, t ) represent the positive and negative vertical direc-

and ending point of

) for two frames. Variable k is the frame number. When

(x, y, t ) can be hori-

) components of the motion vector

(x, y, z, t 1)1) otherwise

S-ar putea să vă placă și