Meanshif Delphi

Adaptation of the Mean Shift Tracking Algorithm to
Monochrome Vision Systems for Pedestrian Tracking

Based on HoG-Features
2014-01-0170
Published 04/01/2014
Daniel Schugk and Anton Kummert

Univ. of Wuppertal
Christian Nunn
Delphi Automotive
CITATION: Schugk, D., Kummert, A., and Nunn, C., "Adaptation of the Mean Shift Tracking Algorithm to Monochrome
Vision Systems for Pedestrian Tracking Based on HoG-Features," SAE Technical Paper 2014-01-0170, 2014,
doi:10.4271/2014-01-0170.
Copyright 2014 SAE International
Abstract
The mean shift tracking algorithm has become a standard in
the field of visual object tracking, caused by its real time
capability and robustness to object changes in pose, size, or
illumination. The standard mean shift tracking approach is an
iterative procedure that is based on kernel weighted color
histograms for object modelling and the Bhattacharyyan
coefficient as a similarity measure between target and
candidate histogram model. The benefits of the approach could
not been transferred to monochrome vision systems yet,
because the loss of information from color to grey-scale
histogram object models is too high and the system
performance drops seriously. We propose a new framework
that solves this problem by using histograms of HoG-features
as object model and the SOAMST approach by Ning et al. for
track estimation. Mean shift tracking requires a histogram for
object modelling. In the proposed framework a set of high
dimensional HoG-features is clustered via K-means and
features inside the object area are matched to the clustercenters via a nearest neighbor search. This procedure is
comparable to a Bag of Words algorithm. The proposed system
is evaluated for advanced driver assistance systems and it is
shown that the framework can be used as a reliable visual
tracking system for a pedestrian recognition module.
Introduction
Many modern advanced driver assistance systems rely on low
cost monochrome vision systems, for analyzing the vehicle
surrounding. Especially for Pedestrian warning or autobreaking systems, vision sensors deliver highly required
information that make these additional functionalities possible.
Therefore a pedestrian object needs to be recognized and
followed over several consecutive image-frames. The so called
tracking is necessary for the calculation of several attributes of
the object like distance, velocity and acceleration. Tracking in

general is the calculation and estimation of an object path.
Therefore visual tracking algorithms can determine the motion
of an object between two video frames and so associate
detections in two frames to the same object.
In the field of computer vision, visual tracking still is an
important and relevant research area [1]. Different methods like
KLT [2], Mean Shift (MS) [3] or Template Matching [4] belong to
this algorithm class and have become standard algorithms.
The main challenges for visual tracking algorithms are realtime capability, accuracy and robustness to changes in the
object appearance, due to motion or illumination. Especially
the mean shift algorithm, also commonly known as kernel
based tracking, fulfils these criteria adequately.
MS was first introduced by Fukunaga [5] for data analysis in
the field of pattern recognition, as gradient ascent local mode
search algorithm for probability density distributions. Comaniciu
evolved the MS algorithm as a kernel based visual object
tracking method in 1999 [6], [7], using color histograms of
image regions for target and candidate modelling and the
Bhattacharyyan coefficient, a similarity measurement for
probability density distributions. This approach is limited to
spatial tracking and the object scale adaptation is estimated by
a brute force search, with different kernel bandwidths. Later
different approaches were introduced to handle the scale
estimation problem, like mean shift based blob tracking [8] or
multi kernel tracker [9], which divide the target into several
blocks, track each block separately and have the advantage of
additional spatial information of the separated object blocks.
In the field of face tracking, CAMSHIFT [10], [11] and EM-shift
[12] have been introduced since 1998, to estimate the scale
and orientation changes adaptively, by calculating moment
functions on weight values, which are estimated during the MSprocedure. Ning et al. have transferred this calculation to
SOAMST [13] as a pedestrian tracking module.
The performance of all MS algorithms is strongly depending on
the modelling of the target and the separability against the
background. Different approaches try to handle these
problems, like CBWH-MS by Ning et al. [14], who corrected the
BWH-MS approach of Comaniciu et al. [3] by reducing the
influence of prominent background features in the target model
calculation.
Object modelling is commonly done with color histograms, so
they have become a standard in combination with the MS
tracking algorithms and achieved reliable and robust tracking
results. For grey-scale image sensors, like they are used in
state of the art driver assistance systems, the object modelling
based on pixel intensity histograms is not discriminant enough
to reach a comparable performance [9].
Many different pedestrian detection and classification systems
are already based on HoG-features (histogram of oriented
Gradient), a standard feature in this research area [15], [16],
[17]. Introducing the HoG-features for object modelling inside
the MS algorithms requires a method to build up object model
histograms. We propose to use a Bag of Words (BoW)
approach that clusters similar HoG-features and summarizes
the object features inside a histogram based on a nearest
neighbor matching to the cluster centers. The BoW-histogram
of HoG-features fits perfectly to the requirements of the MS
algorithms.
The proposed algorithm adapts state of the art MS-algorithms,
described in the following part, to a monochrome image sensor
system for pedestrian tracking. Like in SOAMST [13], the target
is modelled with a weighted kernel function, but prominent
background features are suppressed additionally like in CBWH
[14]. In the next part the Introduction of HoG features to the
modelling process inside MS is described. The system will be
evaluated on selected video sequences of a driver assistance
system in the Experimental Results part. The paper is closed
by a Conclusion and a Future Work description.
Mean Shift Tracking

MS tracking is commonly used for color vision systems. The
basic algorithm is illustrated in the block-diagram in Figure 1.
At its first appearance the object is detected with a detector
module and defined by a rectangular bounding box. Based on
the color information of the n features f inside the object area,
the target object is modelled as a weighted histogram. The
weight of a feature is obtained by a kernel profile function k,
anti-proportional to the distance to the object center.
Figure 1. Generalized block-diagram of standard mean shift tracking

algorithm. The MS-tracking-algorithm requires a histogram object
model. Standardly color intensities are used as input for the histogram
modelling, but also more powerful features can be used. The feature
extraction block can also be combined with the histogram model
blocks, to reduce the number of features that are calculated.
The object model histogram is equal to the probability

distribution of the object features weighted by a kernel function.
The target object model q is defined as
(1)
with
wherein qu is the kernel weighted probability of the u-th

feature-bin of the m sized histogram, x is the object center, xi is
the position of the feature fi in pixel coordinates and r is the
kernel bandwidth describing the object size. The function b
matches the feature fi to a histogram bin u and c is a
normalization constant defined as
The candidate model is built up the same way, at position yi

with candidate size correlated bandwidth ry and the features xi
inside the candidate region. It is defined as
(2)
The key step for the tracking algorithm is the computation of a

displacement vector y based on the mean shift procedure to
obtain the object movement from the initial candidate position
to the target object center in the new frame:
(3)
(4)
The kernel function g(x) is defined as the negative derivative of

the kernel profile function g(x) = k(x). Comaniciu et al. [7]
derivate equation (4) based on the Bhattacharyya coefficient
(5)
a similarity measurement between the target and candidate

model.
MST is an iterative process over the equations (2), (3), (4) by
updating the candidate position every step with yj+1 = yj + y
until convergence or a constant number of iterations is
reached. The initial candidate model is calculated at y0 =x.
Scale and Orientation Estimation (Ning et al.)

Up to now only the new object position is estimated, without
any scale adaptation. The first algorithm containing an adaptive
scale and orientation estimation was introduced by Bradski [10]
as the face-tracking algorithm CAMSHIFT. Bradski made the
assumption, that the target will fit inside the candidate area, if
its size is enlarged before the MST. This assumption is valid for
most moving objects, if the frame rate of the camera system is
high enough, so that object shrinking or enlarging between
consecutive frames is a gradual process and scale changes
can be assumed to be smooth. The problem now is to estimate
the real object area and orientation of the candidate object.
Here CAMSHIFT uses the pixel value probability distribution
p(I(x, y)) of the image or the target model q inside
mathematical moment functions, which can estimate the shape
of an ellipsoidal point cloud.
Ning et al. [13] examined the weight values from eq. (3) with
different object sizes, compare Figure 2, and pointed out that
the mean shift weight values (MSWV) depend on the candidate
scale. Furthermore MSVW converge to 1, if the estimated
object size fits to the current object appearance. In contrast to
MSVW, the values of p(I(x, y)) are constant and independent
from candidate object size. Ning et al. [13] use this adaptive
character of the MSWV inside the moment calculation.
Figure 2. (a) Image of the target object. (b) - (d) Images of the
candidate bounding box (dashed lines) with the corresponding MSWV.
For scale estimation the object area is enlarged before the object
movement is tracked. This introduces background features to the
candidate region and results in lower weight values, what can be
clearly seen in the images above. If the candidate region fits to the real
object size the weights converge to the value 1. This observation leads
to the assumption, that the object scaling can be estimated out of the
MSWV [13].
By enlarging the object bounding box, background features are

introduced to the candidate model automatically. The
probability of the target model features is reduced, which leads
to a higher MSWV and lowers the influence of background
features in the area estimation of eq. (6) that is based on the
0th-moment of eq. (7), which is defined as the sum of all
MSVW of the features inside the bounding box.
(6)
(7)
The Bhatacharryan coefficient of eq. (5) is an indicator for the

quality of the area estimation by the 0th-moment and can be
used to adjust this estimate of eq. (6), wherein is an adaptive
variable, which should be chosen relatively to the object size.
To estimate the height, width and orientation of the object, first
(M01, M10) and second (M02, M11, M20) order moments need to
be calculated
(8)
together with the centroid of the distribution of MSWVs
(9)
Furthermore the second order central moments
(10)
can be combined to a covariance matrix
(11)
which eigenvalues 1 and 2 are used to estimate the object

parameters width and height and are calculated by the singular
value decomposition,
tracking. Comaniciu et al. [3] tried to reduce the influence of

prominent background features, by estimating a backgroundhistogram-model and introducing background-weights vu to
gain the background-weighted-histograms (BWH), by
multiplying target and candidate model with vu. Ning et al. [14]
proved that the application to both models has no effect on the
mean shift tracking algorithm and corrected the calculation by
applying the background weights just to the target model
(CBWH).
The background model v is calculated on the n-features around
the target object in an area that is three times bigger than the
object, without any kernel weighting
(17)
(12)
together with the eigenvectors (u11, u21)T and (u12, u22)T that
represent the two orientation vectors of the targets main axes
in the candidate region. Due to the usage of a kernel function,
the rectangular object-area is reduced to an ellipsoidal area,
which is a necessary condition for shape calculation based on
moment functions. The ellipsoid is defined by the two main
axes z1 and z2, which can be approximated by 1 and 2 and
lead to a scale factor = 1z1 = 2z2. Together with the
definition of the ellipse area a = z1z2 it can be derived that
(13)
so that the CBWH-target model results in
(18)
The relationship between the standard mean shift feature

weights wi in equation (3) and the CBWH feature weights wi
can be calculated as
(19)
(14)
After calculating the width and height of the estimated

candidate area, the covariance matrix needs to be updated
(15)
to be reused in the next frame, since it contains the orientation

estimation of the current candidate object.
The features inside the candidate region can be calculated
using the ellipsoid equation
(16)
Background Adaptation
The modelling of the target and candidate will contain
background features, since the detected bounding box may not
be positioned at the real object center, neither fit the exact
object shape or the candidate region is enlarged before
Figure 3. (a) Standard mean shift target model. (b) Background

histogram model. (c) Background weights calculated by eq. (18).
(d) CBWH histogram model. Comparing (a) and (b) it is obvious that
the target cannot be separated from the background easily based on
the pixel intensity model. After the background adaptation of CBWH
the model becomes more discriminative, since the influence of
prominent background features (bins 30 - 60) is reduced.
Introducing Histograms of Oriented

Gradients to Mean Shift
The state of the art mean shift tracking algorithms, like
SOAMST, show good performances and reliable results, when
they are used with color sensor systems. As already
mentioned, the object modelling is not discriminative enough
for monochrome vision systems and stronger features are
needed. Therefore we introduce the HoG-features, developed
by Dalal and Triggs [15], to the MST algorithm, which have the
advantage of describing an image patch instead of a single
pixel, by summarizing the gradient information of the image
patch in a histogram. Furthermore HoG-features are less noise
sensitive and have a higher independency to illumination
changes due to a low orientation-quantization
To calculate HoG-features, the image derivatives in x and y
direction are computed pixel-wise by the standard filters gx =
[1,0, 1] and gy = [1,0, 1]T. Based on gx and gy pixel-gradients
can be calculated and the so gained pixel-orientation is
quantized into 9 directions. A HoG-feature is defined by its
position (fx, fy) and size (sx, sy), describing an image-patch with
size of sx sy pixels, wherein the orientation of all pixels is
summarized into a histogram.
For MST highly discriminative features are needed, which we
obtain by sub-dividing the image patch into four rectangular
regions and calculating another four oriented histograms, like it
is visualized in Figure 4. Therefore, the minimum patch size
results in 4 4. Together with a 15-dim Haarlike feature vector
[18], calculated on the intensity values, the HoG-feature results
in a highly discriminant 60 dimensional feature f.
Figure 4. HoG-features are calculated on the oriented gradients inside

an image patch (red). All pixel gradients are quantized into nine
directions and summarized in a histogram and normalized, the dot
stands for no direction, which is the case, if the magnitude of the
gradient is below a threshold. To generate more discriminative
features, the patch is subdivided into four regions (green). Again the
gradients are summarized in histograms and normalized. Together with
a 15 - dimensional Haarlike feature, the HoG-feature results in a 60
dimensional vector.
With these high dimensional features an easy histogram

quantization modelling is not suitable anymore and other
methods need to be found. As a first step we propose to cluster
the features, to achieve a quantization similar to the histogram
modelling in color feature space. We use K-means clustering
so that every cluster center corresponds to one histogram bin
in the model representation. As a second step every feature of
the object area is matched to one cluster by a nearest neighbor
matching algorithm and weighted with a kernel function and the
modelling process is finished. This procedure is comparable to
the bag of words idea, wherein features are clustered and then
summarized inside a histogram, based on their cluster
matching, to obtain a visual word. In our case, all features
inside the surrounding area of the pedestrian are used for
clustering, so a more general, but still discriminative and
individual clustering is reached. The combination of all cluster
centers is often referenced as dictionary. We propose to use
this dictionary with a nearest neighbor search algorithm inside
the mean shift tracking algorithm as substitution of the feature
matching function b in eq. (1) and (2). The so gained target
model is comparable to a visual word. Calculating a dictionary
for every single object can be very inefficient and may corrupt
the real-time capability of the algorithm. To overcome this
problem it is possible to train a global dictionary based on a
feature set extracted from a training database with positive and
negative pedestrian examples. In this case it is also possible to
evaluate the optimal number of cluster centers used for the
modelling in an offline stage.
The proposed algorithm uses a constant grid of HoG-features
with a distance of df = 2 pixel. Because of the change in feature
sampling, the scale estimation needs to be adapted to the
feature distance. In the previous scale estimation the pixel
features, used in the moment calculation, were sampled on the
same dense grid as the image coordinate system used for
position and scale estimation. So the object area relative to the
coordinate system could be calculated, since every pixel is a
feature, a dense grid of feature weights can be calculated. If
the feature grid is subsampled, not every coordinate can be
matched to a feature weight directly. The moment calculation is
based on a dense feature space, and a missing feature-weight
would equal a zero weighting of the feature position. This
problem can be solved by interpolating the feature weights to
the coordinate-system. The interpolation can easily be done by
a nearest neighbor matching, which can be approximated by
multiplying the relative feature position with the feature
distance. The new moments result in:
(20)
(21)
(22)
The orientation calculation is not influenced by different feature

sampling and since we are not expecting the object to rotate
inside the image-plane, it is neglected anyway.
90px). The other sequences only contain one walking PED on

the right walkway each, while the vehicle is approaching. The
background differs from very dark (Seq. 3) to daylight (Seq. 4)
or contains structured background (Seq. 5), see Figure 6(b).
Furthermore, the density of the feature grid influences the

accuracy of the position and scale estimation and the
calculation time. Obviously less features result in a reduced
calculation time, but also reduce the accuracy of the system.
Scale Estimation and Background Adaptation

Obviously the benefits of a combined SOAMST-CBWH
approach could be scale adaptive and robust position
estimations. However the results of the MST evaluation in the
next section show that the benefit of the background
adaptation can be neglected in most cases. Furthermore, the
manipulated target model influences the scale estimation
based on the moment features, if the target model is mainly
built up on a group of features that are also prominent in the
background. The CBWH is used to suppress the influence of
these background features during the position estimation, by
reducing the proportion of these features inside the target
model. This results in a smaller MSWV, comparing the
equations (3) and (19).
(a). Sequence 1, frame 6
The scale-estimation of SOAMST relies on the area estimation

based on the sum of the MSWV inside the object area. If the
weights are reduced, not based on the relation of the target
and the candidate model, but by the background, the
assumption of a relation between the real object area and the
estimated area fails and the estimation results in a smaller
candidate size.
A solution of this problem can be, to use the different target
models for the different tasks, they have been designed for.
CBHW target model for position estimation and the standard
MST target model for scale estimation.
Experimental Results
The proposed MST system is evaluated in an advanced driver
assistance system for pedestrian tracking. Therefore a set of 5
videos is recorded, containing typical pedestrian situations
during daytime, like cross-walking pedestrians, approaching
pedestrians, groups of pedestrians and overlapping
pedestrians. Summed up 9 pedestrians (PED) are tracked. For
a comparison to the real object, the position and scale of all
pedestrians are hand-labeled in every 5-th frame, and
interpolated in between. The labeled data is used as a groundtruth. The initialization and modelling of the pedestrian-object is
done in the first frame of its appearance.
Sequence (Seq.) 1 contains three PEDs, see Figure 5(a). The
vehicle is standing in front of a cross walk, one big PED (height
340 px) is cross-walking in the foreground, one PED in a group
is approaching to the vehicle (100 - 200px) and the last one is
standing in the background (40 px). In Seq. 2 the vehicle is
driving and again, three PEDs are walking on the right
walkway, see Figure 6(a), one large (70 - 130px) and a group
of two, which are initialized with a height of just 30px (30 -
(b). Sequence 1, frame 60

Figure 5. Tracking results of sequence 1. SOAMST using the grey
intensity value target model loses all objects almost directly. (a) Shows
the results up to frame 6. (b) Shows the superior performance of the
HoG-feature modelling up to frame 60 in combination with the
SOAMST position and scale estimation.
The proposed algorithm is evaluated in two different

implementations as SOAMST based on Hog-features with and
without background adaptation using the CBWH target model.
Furthermore the results are compared to the state of the art
MST algorithms, SOAMST [13] and CBHW [14] and
additionally the standard MST [7], adapted to monochrome
vision systems. A target is defined lost, if the quotient between
the estimated scale and the ground truth scale is greater than
1.5 or smaller than 0.5, or the relative position error is greater
than 0.5 times the ground truth width.
Due to the random character of the K-means-clustering in the
modelling process, the HoG-feature based results are not static
for every initialization of the same pedestrian. Therefore the
system is initialized multiple times and the results are
averaged, see Table 1.
Figure 5 shows the results of all tracking algorithms on Seq. 1.

Like in Table 1, it points out clearly that the tracking algorithms
based on intensity values cannot track the targets properly.
The SOAMST (yellow) loses the cross-walking directly after 3
frames and the other two PEDs due to completely wrong scale
estimation after a few frames. The MST (red) and the
CBWHMST (magenta) can track the targets longer, but dont
estimate the scale-change.
(a). Sequence 2, frame 18
With a moving vehicle, the performance of all intensity based

algorithms drops significantly and a tracking over several
frames is not possible in most sequences. Only the CBWHalgorithm can track the object if the background is not
structured. These sequences show the advantage of the
background adaptation, see Table 1.
The HoG-features can improve this tracking a lot. SOAMSTHoG tracks 7 of the 9 PEDs almost over the complete distance.
The remaining 2 tracks fail, caused by wrong scale estimation.
In the sequences 2-5 all PEDs approach towards the vehicle.
This causes a higher position error, due to the initialization of
the target model in the first frame and the missing model
update strategy, as can be seen in Figure 6.
(b). Sequence 5, frame 32

Figure 6. Tracking results of the sequences 2 and 5. (a) Even when the
position error becomes larger for some consecutive frames, the
HoG-feature based tracking can recover the object (big PED). This is
caused by the object modelling based on one single frame and the
pose of the pedestrian in this frame. If the PED returns to this pose, the
model matching can be done more accurate. (b) The proposed
algorithm is also able to handle vehicle pitching
Slightly better position estimations can be achieved using the

CBWH-target-model inside the HoG algorithm. The difference
of the CBWH and SOAMST becomes clear looking at the scale
estimation error. SOAMST often results in a slightly larger
scale estimate, especially when the object does not change its
size, compare Seq. 1 PED 1. The average scale estimation of
CBWH instead is always too small, due to the reduced
influence of feature, which prominent in the background.
The presented results are based on a single target initialization
and do not contain any model update strategies. This results in
two different effects. On the one side, approaching PED will get
lost if the scale change gets to large compared to the initial
Table 1. Statistical results of the mean shift algorithms for the PED sequences. Evaluation criteria are: Track Length(TL), Position Error(PE)
and Scale Error (SE). The ground truth track length is given in column GT.
size, due to the fact that the HoG-features are calculated with a
fixed size and for example, during target modelling, one feature
contains all gradients describing a head (a full circle), which
later may be described by a handful features (circle parts) that
are matched to different cluster centers and histogram bins in
the candidate model. The other effect is a positive one, due to
the effect that the PED is initialized in one special pose
(walking with spread legs) that will return after some frames,
the PED may be recovered with high accuracy, if it returns to
this initial pose, compare Figure 6(a).
Conclusions
In this paper we present a method that uses high dimensional
HoG-features for target modelling inside a mean shift tracking
algorithm, so the MST is adapted to monochrome vision
systems, commonly used in driver assistance systems. The
HoG based object modelling is needed, because the reduced
information content of the grey-scale image pixels compared to
colored is not discriminant enough to guarantee a robust and
accurate tracking. The HoG-features need to be preprocessed
to fit to the required histogram object models. Therefore we
propose to use a Bag of Words algorithm that clusters all
extracted features with K-means and summarizes the object
features based on a nearest neighbor cluster matching into a
histogram.
The system is evaluated on typical pedestrian appearances in
various situations on monochrome sequences, in two different
implementations of the adapted SOAMST by Ning et al. [13]
with (HoG-CBWH) and without (HoG-SOAMST) a background
adaptation. It is shown that the proposed MST based on
HoG-features is superior to the standard methods adapted to
greyscale histogram modelling. The position estimation based
on the HoG-SOAMST points out to be the most reliable and
robust algorithm and is able to track 7 of the 9 evaluated
pedestrian appearances nearly over the full length. The
position estimation still contains a relative small error
compared to the hand labelled ground truth, which cannot be
reduced significantly by the use of the background adaptive
CBWH target model. Since the scale estimation is designed for
the SOAMST, it is the most reliable way to use it.
The HoG-features can be used without a considerable increase
of processing power, since they are already computed and
needed for pedestrian detection. Only the clustering process in
the modelling stage is costly in sense of calculation time. Using
a global dictionary, which is trained offline on a large dataset of
pedestrians and their surroundings, will solve this problem.
Future Work
The results we reached so far are satisfying by the length of
the tracks and their robustness in different situations, but still
can be improved in terms of accuracy in position and scale. A
model update strategy can be introduced since the results are
just based on a single target initialization, to generate longer
tracks and handle the problem of losing approaching
pedestrians with a small initialization size. Therefore reliable
confidence estimation is necessary, using the Bhattacharyya
coefficient as a similarity measure between the final candidate

model and the target model, may be an option. Additionally the
current model is just based on the pixel orientation and leaves
the pixel intensities out. A combination of both features may
improve the estimation. Furthermore the position of the
features inside the object is neglected by the use of histograms
as model-structures. This loss of information may be covered
by an offline trained voting scheme that defines an offset vector
for every feature towards the object center, which can be used
to recalculate the kernel weights or as an additional feature
dimension.
References
1. Yilmaz A., Javed O. and Shah M., Object tracking:
A survey, ACM Comput. Surv., vol. 38, no. 4, p. 13+,
December 2006, doi:10.1145/1177352.1177355.
2. Lucas B. D. and Kanade T., An iterative image registration
technique with an application to stereo vision, in
Proceedings of the 7th international joint conference on
Artificial intelligence - Volume 2, San Francisco, CA, USA,
Morgan Kaufmann Publishers Inc., 1981, pp. 674-679.
3. Comaniciu D., Ramesh V. and Meer P., Kernel-based
object tracking, Pattern Analysis and Machine Intelligence,
IEEE Transactions on, vol. 25, no. 5, pp. 564-577, 2003,
doi:10.1109/TPAMI.2003.1195991.
4. Zitov B. and Flusser J., Image registration methods: a
survey, Image Vision Comput., vol. 21, no. 11, pp. 9771000, 2003, doi:10.1016/S0262-8856(03)00137-9.
5. Fukunaga K. and Hostetler L., The estimation of the
gradient of a density function, with applications in pattern
recognition, Information Theory, IEEE Transactions on, vol.
21, no. 1, pp. 32-40, 1975, doi:10.1109/TIT.1975.1055330.
6. Comaniciu D. and Meer P., Mean shift analysis and
applications, in Computer Vision, 1999. The Proceedings
of the Seventh IEEE International Conference on, Kerkyra,
1999, doi:10.1109/ICCV.1999.790416.
7. Comaniciu D., Ramesh V. and Meer P., Real-time tracking
of non-rigid objects using mean shift, in Computer Vision
and Pattern Recognition, 2000. Proceedings. IEEE
Conference on, Hilton Head Island, SC, 2000, doi:10.1109/
CVPR.2000.854761.
8. Collins R., Mean-shift blob tracking through scale
space, in Computer Vision and Pattern Recognition,
2003. Proceedings. 2003 IEEE Computer Society
Conference on, Madison, Wisconsin, 2003, doi:10.1109/
CVPR.2003.1211475.
9. Yan Y., Huang X., Xu W. and Shen L., Robust KernelBased Tracking with Multiple Subtemplates in Vision
Guidance System, Sensors, vol. 12, no. 2, pp. 1990-2004,
2012, doi:10.3390/s120201990.
10. Bradski G. R., Computer vision face tracking for use in a
perceptual user interface, Intel Technology Journal, vol. 2,
1998.
11. Lee L.-K., An S.-Y. and Oh S.-y., Robust visual object

tracking with extended CAMShift in complex environments,
in IECON 2011 - 37th Annual Conference on IEEE
Industrial Electronics Society, Melbourne, VIC, 2011,
doi:10.1109/IECON.2011.6120057.
12. Zivkovic Z. and Krose B., An EM-like algorithm for colorhistogram-based object tracking, in Computer Vision and
Pattern Recognition, 2004. CVPR 2004. Proceedings of the
2004 IEEE Computer Society Conference on, Washington,
DC, 2004, doi:10.1109/CVPR.2004.1315113.
13. Ning J., Zhang L., Zhang D. and Wu C., Scale and
orientation adaptive mean shift tracking, Computer
Vision, IET, vol. 6, no. 1, pp. 52-61, 2012, doi:10.1049/
ietcvi.2010.0112.
14. Ning J., Zhang L., Zhang D. and Wu C., Robust mean-shift
tracking with corrected background-weighted histogram,
Computer Vision, IET, vol. 6, no. 1, pp. 62-69, 2012,
doi:10.1049/iet-cvi.2009.0075.
15. Dalal N. and Triggs B., Histograms of oriented gradients
for human detection, in Computer Vision and Pattern
Recognition, 2005. CVPR 2005. IEEE Computer Society
Conference on, San Diego, CA, USA, 2005, doi:10.1109/
CVPR.2005.177.
16. Enzweiler M. and Gavrila D., Monocular Pedestrian
Detection: Survey and Experiments, Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 31, no. 12,
pp. 2179-2195, 2009, doi:10.1109/TPAMI.2008.260.
17. Zheng G. and Chen Y., A review on vision-based
pedestrian detection, in Global High Tech Congress
on Electronics (GHTCE), 2012 IEEE, Shenzhen, 2012,
doi:10.1109/GHTCE.2012.6490122.
18. Viola P. a. J. M., Rapid object detection using a boosted
cascade of simple features, in Computer Vision and
Pattern Recognition, 2001. CVPR 2001. Proceedings of the
2001 IEEE Computer Society Conference on, Kauai, HI,
2001, doi:10.1109/CVPR.2001.990517.
The Engineering Meetings Board has approved this paper for publication. It has successfully completed SAEs peer review process under the supervision of the session
organizer. The process requires a minimum of three (3) reviews by industry experts.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical,
photocopying, recording, or otherwise, without the prior written permission of SAE International.
Positions and opinions advanced in this paper are those of the author(s) and not necessarily those of SAE International. The author is solely responsible for the content of the
paper.
ISSN 0148-7191
http://papers.sae.org/2014-01-0170

Meanshif Delphi

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Meanshif Delphi

Încărcat de

Drepturi de autor:

Formate disponibile

Adaptation of the Mean Shift Tracking Algorithm to

Monochrome Vision Systems for Pedestrian Tracking

Daniel Schugk and Anton Kummert

the object like distance, velocity and acceleration. Tracking in

Mean Shift Tracking

Figure 1. Generalized block-diagram of standard mean shift tracking

The object model histogram is equal to the probability

wherein qu is the kernel weighted probability of the u-th

The candidate model is built up the same way, at position yi

The key step for the tracking algorithm is the computation of a

The kernel function g(x) is defined as the negative derivative of

a similarity measurement between the target and candidate

Scale and Orientation Estimation (Ning et al.)

By enlarging the object bounding box, background features are

The Bhatacharryan coefficient of eq. (5) is an indicator for the

together with the centroid of the distribution of MSWVs

Furthermore the second order central moments

can be combined to a covariance matrix

which eigenvalues 1 and 2 are used to estimate the object

tracking. Comaniciu et al. [3] tried to reduce the influence of

so that the CBWH-target model results in

The relationship between the standard mean shift feature

After calculating the width and height of the estimated

to be reused in the next frame, since it contains the orientation

Figure 3. (a) Standard mean shift target model. (b) Background

Introducing Histograms of Oriented

Figure 4. HoG-features are calculated on the oriented gradients inside

With these high dimensional features an easy histogram

The orientation calculation is not influenced by different feature

90px). The other sequences only contain one walking PED on

Furthermore, the density of the feature grid influences the

Scale Estimation and Background Adaptation

(a). Sequence 1, frame 6

The scale-estimation of SOAMST relies on the area estimation

(b). Sequence 1, frame 60

The proposed algorithm is evaluated in two different

Figure 5 shows the results of all tracking algorithms on Seq. 1.

(a). Sequence 2, frame 18

With a moving vehicle, the performance of all intensity based

(b). Sequence 5, frame 32

Slightly better position estimations can be achieved using the

coefficient as a similarity measure between the final candidate

11. Lee L.-K., An S.-Y. and Oh S.-y., Robust visual object

S-ar putea să vă placă și