Sunteți pe pagina 1din 14

Pattern Recognition 43 (2010) 40284041

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

Multi-object detection and tracking by stereo vision


Ling Cai a, Lei He b,, Yiren Xu a, Yuming Zhao a, Xin Yang a
a b

The Department of Automation, Shanghai Jiao Tong University, Shanghai 200340, China The National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

a r t i c l e in f o
Article history: Received 9 August 2009 Received in revised form 15 April 2010 Accepted 10 June 2010 Keywords: Stereo vision Kernel density estimation Multi-object detection and tracking Clustering

a b s t r a c t
This paper presents a new stereo vision-based model for multi-object detection and tracking in surveillance systems. Unlike most existing monocular camera-based systems, a stereo vision system is constructed in our model to overcome the problems of illumination variation, shadow interference, and object occlusion. In each frame, a sparse set of feature points are identied in the camera coordinate system, and then projected to the 2D ground plane. A kernel-based clustering algorithm is proposed to group the projected points according to their height values and locations on the plane. By producing clusters, the number, position, and orientation of objects in the surveillance scene can be determined for online multi-object detection and tracking. Experiments on both indoor and outdoor applications with complex scenes show the advantages of the proposed system. & 2010 Elsevier Ltd. All rights reserved.

1. Introduction Surveillance for multiple objects is an active research topic in computer vision, which is widely applied in public safety, trafc control, and intelligent humanmachine interaction, to just name a few examples. A surveillance system usually consists of two correlated components: object detection and tracking, both of which have been extensively studied and numerous approaches have been proposed. The detection component is to locate the objects of interest, and the tracking component associates objects positions over time in a sequence of frames. These two components are correlated with each other, i.e., object detection locates regions of interests for tracking, and tracking results can be used for efcient detection in subsequent frames. Current surveillance systems usually use low cost CCD video cameras.

1.1. Methodology of monocular vision In object detection, background subtraction is usually used to detect moving objects in a scene. To lter out minor or periodic motion in the background, e.g. illumination variation, tree and water wave, Gaussian mixture model [1] and Bayesian model [2] are commonly used. Some learning-based approaches [35] focus on specic object detection, such as face, human or vehicle. These approaches usually apply a specic feature descriptor to
Corresponding author.

characterize objects of interest, such as wavelet [3], motion [4], histogram of oriented gradients (HOG) [5], etc. Machine learning algorithms like neural network, support vector machine [3] and adaptive boost [4,5] can then be used to learn the patterns from training samples and classify newly detected samples. In [6], object tracking is modeled as the optimal state estimation in a nonlinear system, which applies sequential Monte Carlo method to track the contour of a single object. Comaniciu and Meer [7] use Bhattacharyya coefcient to measure the similarity between a given object model and each candidate. Instead of exhaustive search, mean shift iteration is employed to locate the one most similar to the object model. With satisfactory performance in image segmentation, active contour models have also been adopted by some tracking methods to extract the object regions in sequential frames [8]. Unlike the above tracking algorithms [68] based on updating strategy, the predicting strategy is presented in [9] to learn certain compact object regions and train a set of local predictors, which can be applied for real time object motion estimation.

1.2. Challenges in existing systems Existing object detection and tracking systems usually have problems in such conditions: (1) illumination variations in complex scenes; (2) the shadow interference; (3) multiple objects with severe occlusion. Illumination variation is a common effect in surveillance. Because color feature is closely related to illumination, the visual appearance of object varies with the illumination change, which may result in poor tracking performance. To obtain a robust

E-mail addresses: cailing.cs@gmail.com (L. Cai), lei.he@nih.gov (L. He), xuyiren@gmail.com (Y. Xu), arola_zym@sjtu.edu.cn (Y. Zhao), yangxin@sjtu.edu.cn (X. Yang). 0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.06.012

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

4029

tracking, researchers have explored different color spaces, including the RedGreenBlue (RGB), HueSaturation-Value (HSV), LuminanceChrominance (YUV), Luminance-In-Quadrature (YIQ), and their normalized versions. However, few of them can achieve a satisfactory result in conditions with illumination variations. Recently, Moreno-Noguer et al. [11] selected learning examples from both background and object regions to construct a Fisher plane that partitions these two classes of pixels to the maximum extent with linear discriminant analysis. It was proven theoretically and experimentally that this plane is invariant to the illumination changes and this method outperforms many other approaches. In [12], a set of candidate color spaces are constructed by the RGB channels with different parameters, and an evaluation function is utilized to select the best color space in which the object being tracked is clearly distinguished from the background. In addition, specic features [6] or a combination of multiple features [13,14] that are robust to illumination are proposed for object tracking. Shadow interference poses another challenge for surveillance systems, which has the same motion and similar shape as the moving object. If not treated properly, the shadow may be wrongly regarded as a part of object, which could generate an incorrect result, such as shape distortion, object merging or loss. Therefore, detecting moving shadow correctly in tracking signicantly reduces error ratio. In [15], a new color model is proposed to separate the brightness from the chromaticity component and the threshold is automatically determined to distinguish the shadow region from background and moving objects. Mikic et al. [16] extracts the shadow region based on individual pixel appearance and the spatial and temporal constraints. Stauder et al. [17] employs heuristic methods to classify the shadow and foreground objects. A comprehensive study for the above methods can be seen in the survey [18]. Recently, the approach of [19] extends the illumination approximation model from single lighting source to multiple sources. After extracting moving pixels, a stepwise strategy sequentially removes object pixels until the whole shadow region is obtained. The above approaches are all simplications of certain complex circumstances, which may not be applicable for real situations. In summary, the approaches in [15,17] are usually used in indoor settings and those in [16,19] are suitable for outdoor applications. Most existing algorithms (e.g. particle lter [6] and mean shift [7]) focus only on single object tracking, which cannot be simply extended to multi-object tracking. For example, a direct extension is to construct multiple instances of a single object tracking algorithm, and each instance tracks one object. Two major problems of this extension are: (1) the expensive computational cost prevents an efcient online tracking; and (2) the occlusion, splitting and merging of multiple objects cannot be handled by single object tracking algorithm. Another way is to extend the classical framework to include all objects. In such multi-object tracking systems [20,21], level set method is rstly used to handle contour splitting and merging. Recently, Maccormick and Blake [22] propose a probabilistic exclusion principle for the observation model of the Particle lter to prevent two targets with similar congurations from merging. Meanwhile, the sampling strategy for high dimensional space, called partitioned sampling, reduces the computational complexity with increased multi-object space dimensions. For modeling the interaction of multiple objects, extended dynamic programming is represented in [23] to search the optimum correspondence in the path space. In [24], Jiang et al. propose a linear relaxation scheme with less complexity and high probability to determine objects correspondence. To solve the combination problem of ranking and classication in associationbased tracking, Li et al. [25] introduce a new algorithm called HybridBoost to learn the afnity models ofine and track multiple

targets in crowded scenes. In [26], an extended Kalman lter employs a global optimization technique to ensure the optimum multi-object matching in each frame, and a management model to handle the object occlusion and vanishing.

1.3. Related work based on stereo vision To our knowledge, no existing system can handle all these three problems (i.e., illumination variation, shadow interference, and multiple objects with severe occlusion) successfully. For the increasing public safety needs, new hardware devices have been utilized in current surveillance systems to essentially enhance the system performance, e.g. stereo or multi-camera systems. A multi-camera system observes the scene from two or more different views, and obtains more comprehensive information than a monocular camera system, which can be used to handle object occlusion [27,28]. According to the distance of adjacent cameras, multi-camera systems can be classied into two categories: wide and short baseline systems. Current systems usually adopt wide baseline systems which do not require a prior system calibration and provide more exible viewing angles than short baseline systems. To avoid noise disturbance, wide baseline systems [28,29] usually estimate the correspondence directly based on a sparse set of feature points in different views, while short baseline systems [3133] usually construct depth map to correspond camera views. For wide baseline examples, a regionbased stereo algorithm, M2tracker [28], is proposed to derive the 3D object points directly with the information from 16 cameras, which overcomes object occlusion and produces a global optimal result for multi-object detection and tracking. In [29], Khan and Shah correspond the people feet in different views under the constraint of planar homography, so that the people locations on the ground plane can be determined even with occlusion. Both these two systems use color as the key feature to build the correspondence among different cameras, which has difculty to handle objects with similar appearance. To overcome this problem, a motion model [30] is constructed by learning the human movement on ground plane. These wide baseline systems usually have difculties to derive an accurate correspondence even with a rather high computational cost. In addition, it is difcult to generalize these systems to the public applications due to the use of specic features or techniques. Short baseline camera systems can easily construct the camera correspondence with the small variations in different views. The disparity between cameras can be used to derive the depth map of the scene, which is then used to segment the objects of interests. However, the estimated depth map is sensitive to noise and leads to inaccurate object detection. Moreover, the disparity estimation for homogeneous regions usually produces a large depth error. In [31], two additional modules of skin-hue classication and face detection are employed to detect and track the body and face independently, which alleviates the noise problem. Darrell et al. [32] present a dynamic-range imagery with an expensive learning approach to recover the dense depth map. To avoid occasional tracking failures, the whole trajectory is determined ofine by dynamic programming. People tracking in [33] is based on the heterogeneous head detection to overcome the homogenous region problem. Due to the loss of the trivial head details, head map is usually constructed as a at region. Thus head size is estimated before the head extraction by a scale-adaptive lter. Still, this system cannot detect an occluded head in tracking. Recently, a new tracking and reconstructing method [34] for the single rigid object is introduced to build the 3D model incrementally by one stereo pair and then estimate the motion parameters by a single camera. Instead of tracking independent

4030

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

objets in 3D space, Mozerov et al. [35] present an action-specic model of human motion to track the correlated body parts by automatically synchronizing the training sequences and learning 3D human postures. In practice, object occlusion poses a signicant challenge for the depth map segmentation [3133] due to very small object regions in the map. In addition, these systems usually require additional modules (e.g. face or skin detection [31], enhancement [32], scale ltering [33] or 3D model [34]) to reduce the depth noise. 1.4. Our contributions In general, spare feature points-based correspondence construction are not only more cost effective, but also provides more accurate results than those based on the dense depth map. For example, the situations with a small object region available, due to low contrast or occlusion, which always cause problems for depth map-based systems [3133], have little effect on the tracking approaches based on feature points. With these inspirations, we present a new stereo vision-based model for multi-object detection and tracking. Instead of estimating the dense depth map and extracting object regions, we use the depth values of a small set of extracted feature points. Given a scene, our model rst identies a set of feature points, and then projects them to the 2D ground plane. The height values and locations of these projected points are used to generate object clusters. The number, position and orientation of objects in the scene can be determined by the obtained clusters. Compared to current surveillance systems, this new system can solve the problems mentioned above: (1) The illumination variations in a scene are the same in both cameras, which can be offset by the normalized cross correlation (NCC); (2) The feature points of shadow regions can be ltered out before clustering according to their heights and positions; (3) After projecting feature points to the ground plane, there is no overlapping of different object regions because they are viewed from the top; (4) The proposed algorithm employs only a sparse set of feature points, and multi-object tracking is implemented by updating clusters, which signicantly reduces the computational cost, and enables an online surveillance system. Our system employs the good properties (feature extraction) of wide baseline model to the short baseline model for more efcient and robust object detection and tracking, specically for low contrast and occlusion applications. In addition, the new system can be easily generalized to different objects detection and tracking, with only minor parameter changes. The paper is organized as follows: Section 2 introduces our stereo vision system. Section 3 presents the proposed algorithm for multi-object detection and tracking using a kernel-based clustering approach. Section 4 demonstrates the experiment results, and Section 5 draws the conclusion.

constructing the dense disparity map for the surveillance scene, we extract a sparse set of feature points (e.g. corner and edge points) in either of two cameras, and estimate their disparities in the rectied image pair with NCC metric. Such feature points positioning method is more efcient and accurate than the derivation of the dense depth map, which is appropriate for multi-object tracking. With the derived disparities and image coordinates of feature points and camera parameters, their 3D coordinates in stereo vision system can be easily determined according to a simple triangulation geometry. One major problem in the camera view is that it is impractical to separate and track multi-object directly using the feature points because of object occlusion. But in the top view the points on different objects are separated from each other and object occlusion can be eliminated. Meanwhile, we assume that there exists a at ground plane in the surveillance scene, e.g. a oor region without walls or pillars, and an object can be determined in the region if enough points above the ground plane are detected, which is used in our system for new object detection. In order to transform the camera view coordinate into the top view coordinate, the ground plane equation in the camera coordinate system should be estimated rst. In some robotic systems [40,41], inertial sensors are attached to a stereo vision system, providing the gravity direction to the ground plane. Unlike these mobile systems, the cameras in the surveillance system are usually xed so the ground plane equation remains unchanged. That is, without additional sensors, the ground plane can be detected by a calibration process. Some feature points are manually set (or labeled) on the ground to measure their 3D camera coordinates and t the ground plane equation, see Fig. 1(a). With the ground plane, the projected points (xi,yi) and the height zi can construct the new top view coordinate system or the world coordinate system. The new axes (Xg and Yg) and the origin are the projection of the axes (Xc and Yc) and the origin of the camera coordinate system, see Fig. 1(b). Fig. 2(a) shows the surveillance scene from which the feature points are extracted and shown as red crosses. The transformation (or projection) of these points into the world coordinate system are shown in Fig. 2(b). Note that some features points are outliers from the background. Without losing generality, we dene a region of interest (ROI) as the surveillance region to remove noise points, see the green polygon in Figs. 2(a) and (b). Those points outside the ROI (e.g. points on the wall and door) will be removed from processing.

3. Object detection and tracking The feature points of an object are close to each other after being projected onto the ground plane, based on which the projected feature points can be grouped into different clusters to represent the location and orientation of the objects. The number of clusters is not xed since the number of objects in the scene changes over time. Such uncertainty may cause a critical problem in classical clustering algorithms which require a cluster number in prior. In our stereo vision system, the distance of two projected points on the ground plane is equivalent to the distance of the corresponding feature points in the top view, so we can use certain object spatial shape as the prior knowledge for clustering operator. For instance, no matter how a person looks like, he can be simply represented by an ellipse from the top view. In addition, such ellipse may have different angles of rotation to approximate the orientation of a person. 3.1. Object detection Object size plays an important role in our clustering algorithm, which is based on a kernel density estimation (KDE) algorithm.

2. Stereo coordinate system Our stereo vision system consists of two xed cameras, which are calibrated by Zhangs planar calibration method [36] and rectied by Fusiellos compact rectication method [37]. With the short baseline between the two cameras, the point correspondence in the two views can be simply established using the epipolar line constraint and the correlation similarity measure. With such stereo setting, the correspondence among homogeneous region points cannot be constructed straightforwardly, i.e., it is difcult for stereo algorithms to recover the 3D structure of the whole scene and generate a dense depth map. Some approaches [32,38,39] have been proposed to solve this problem with a signicant computational cost, which limits the implementation for online surveillance systems. Therefore, instead of

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

4031

Fig. 1. (a) Ground plan equation tting by feature points and (b) the relationship of the camera and world coordinate systems.

Fig. 2. (a) Illustration of the extracted feature points and (b) their projection to the world coordinate system.

KDE is a classical nonparametric method, which estimates the local probability density without assuming a generative model for points. It uses the density of points within a window to approximate the probability at the center of window. In the window, the point contributes different weights according to its value in the kernel function. Therefore, the probability density in a region with dense points is higher than the region with sparse points. The positions and orientations of clusters can then be determined by the local maxima of the density probability. Each projected point can be associated with a local maximum by the hill climbing technique. The points climbing to the same local maximum are grouped into one cluster. In order to efciently search for these local maxima in the probability space produced by the KDE, Fukunaga and Hostetler [42] present mean shift method to provide a local density gradient estimation, which is also used in different applications, including data analysis [43], image segmentation [44], and object tracking [7,45]. Following the mean shift scheme, we propose a kernel-based algorithm to group discrete points to determine the number, position and orientation of objects in a scene. Without loss generality, the position and orientation of an object are considered as independent variables in the probability space. With the orientation Hy and position Hx kernel functions, the probability density function Ex, y for projected points is constructed as
ni X i1 nj X j1

where x and y are the position and orientation variables, ni and nj are the number of angle samples and feature points. di y and dj(x) are the distance metrics of orientation and position with normalized coefcients Ay , Ax , i.e., di y JAy yyi J2 , dj x JAx xxj J2 (xj denotes the coordinate of the jth projected point (xj,yj), and yi denotes the i-th orientation value in 0,2p). wj represents the weight of the jth projected point. The construction of Hy di y and Hx dj x, yi will be given in Section 3.3. Note that only those points that lie within the surveillance region and whose height values are above a predened threshold are used to estimate the probability function, i.e., those feature points on walls or in shadow area are ltered out before computing the density function. Figs. 3(a) and (b), respectively, show an example of the projected points xj on the ground plane and the corresponding probability function E(x,0). During the hill climbing, the opposite direction of gradient in Ex, y is the steepest ascent direction toward the local maximum. We take its partial derivatives @E=@x and @E=@y with respect to x and y. In mean shift method, the derivative of probability function is formulated as the summation of derivative of kernel functions, thus the derivative can be expressed as 2 3 nj ni X X @E 24 2Ax Hy di y wj Kx dj x, yi 5Dx @x i1 j1 2 3 nj ni X X @E 24 2Ay Ky di y wj Hx dj x, yi 5Dy @y i1 j1

Ex, y

Hy di y

wj Hx dj x, yi

4032

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

Fig. 3. (a) The projected feature points and (b) the constructed probability function.

where Pni

i Dx Pi 1 n

Hy di y

Pnj

H d y i1 y i

Pnj

j1

wj xj Kx dj x, yi wj Kx dj x, yi

j1

Pni

Dy

Pnj y K d y j 1 wj Hx dj x, yi i1 i y i y Pni Pnj i 1 Ky di y j 1 wj Hx dj x, yi

Kx and Ky are the negative derivative functions of Hx and Hy , i.e., Kx Hux and Ky Huy . The local minimum of function E should satisfy the conditions @Ex, y=@x 0 and @Ex, y=@y 0. With ^ ^ gradient ascent approach, a new position x and orientation y closer to the local maximum in the space Ex, y are updated as ^ x x Dx ^ y y Dy 3

For a small cluster with one object, the projected points with high weights are close to the center of cluster, i.e., var(Ck) is small for the cluster. On the contrary, a large cluster includes points of different objects so that many points with high weights are far away from the center, which result in a big cluster variance. When the variance of a cluster is greater than certain degree, it is labeled as a large one to be split. For a large cluster with multiple objects points, two new centers need to be determined to generate two small clusters. These two centers should be separated as far as possible, while both of them should be close to the points with high weights. Therefore, we dene an object function to search for the new centers within the large cluster: X X ^ ^ ^ ^ ^ ^ Fx s1 , x s2 wj dxj , x s1 2 wj dxj , x s2 2 llogdx s1 , x s2 2
^ xj A Hx s1 ^ xj A Hx s1

6 ^ ^ where Hx s1 (Hx s2 ) is the simplied form of the position kernel ^ ^ function in Eq. (1) as the kernel with the center x s1 (x s2 ), and l is the penalty weight, which pushes two clusters far way from each other. A practical way is to dene l based on the cluster variance, e.g. l f varCk , and f is a monotonically increasing function. In order to minimize Eq. (6), the split cluster centers are adjusted using gradient descent type iteration. The adjustment is performed by adding the negative of the scaled gradient to the two new center locations at each step, 0 1 X ^ ^ ^ @dxj , x s1 A @F 1 @dx s1 , x s2 ^ 2@l wj dxj , x s1 ^ ^ ^ ^ ^ @x s1 @x s1 dx s1 , x s2 @x s1 ^ xj A Hx s1 0 1 X ^ ^ ^ @dxj , x s2 A @F 1 @dx s1 , x s2 @l ^ s2 2 wj dxj , x ^ ^ ^ ^ ^ @x s2 @x s2 dx s1 , x s2 @x s2
^ xj A Hx s1

The iteration step above is repeated until convergence (Dx and Dy are less than a threshold), which is called mean shift iteration. ^ ^ At this point, x, y is the local maximum in Ex, y. In this mean shift iteration, x starts from the projected points position, and y starts with a random number to nd a local maximum of E. Thus each projected point can be assigned to a local maximum, and those points with the same maximum constitute a cluster. The position and orientation of each local maximum are regarded as the position and orientation of the corresponding cluster. The feature points can be easily grouped into different clusters when objects are separated from each other. Though, when multiple objects are very close to each other, mean shift algorithm may generate a large cluster consisting of the points of adjacent objects, which violates our predened shape size requirements and needs to be divided into smaller clusters with multiple kernel functions. In our system, a splitting method is used to divide each large cluster into two small ones. It ^ is based on the cluster orientation, y , which represents the direction of the kernel covers the maximum number of projected points. In a cluster Ck, the distance measure between the cluster ^ ^ center x k and a point xj in the orientation y k can be written as ^ ^ ^ xj x k tany k yj y k ^ ^ q dxj , x k 2 ^ 1 tan y k 4

7 With the above new centers and the orientation of the original large cluster, the new clusters are constructed. All the projected points in the large cluster will be assigned to either one of the new clusters according to their distances to the new centers, i.e., ^ ^ ^ ^ dxj , x s1 and dxj , x s2 . This splitting process is iteratively repeated until no more clusters exceed the variance threshold. 3.2. Object tracking

To measure the point distribution of Ck, we dene the variance of Ck as P ^ ^ xj A Ck wj jdxj , x k j P 5 varCk xj A Ck wj

After objects detection, the tracking algorithm is used to associate the object positions in consecutive frames and provide a trajectory of each object. Two methods can be used to track an object: we can detect the object position in each frame by

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

4033

clustering algorithm and then correspond them across frames; Or we can estimate the object position in a new frame by iteratively updating the position in the previous frame. In our system, we choose the second approach for a computationally efcient tracking. Given a new frame, the object position in the previous frame is known, which can be used to estimate the new position in the current frame according to motion prediction. If an object with N feature points is tracked across M frames, the rst tracking strategy needs NM hill climbing to correspond object positions, while the second strategy (updating strategy) needs only N + M 1 times. Specically, this approach does not search for the local maximum for all projected points, which signicantly saves the computational cost in multi-object tracking. In the updating strategy, the object position prediction and correction can be implemented by a Kalman lter. The velocity ^ ^ ^ ^ vector of an object, v x t1 x t2 (x t1 and x t2 are the positions of the object in two previous frames), is modeled as the state and observation variables in the lter. The initial position of hillclimbing in the current frame is estimated according to the lter prediction. Then the local maximum in the current frame is obtained by the mean shift iteration. At last, the position and orientation of the object is updated with the new local maximum. Fig. 4 shows the probability functions, which are constructed by the projected points in four sequential frames after Fig. 3(b). The black points from the Kalman lter prediction are used to initialize the hill-climbing procedure, which are rather close to the local maxima so that the updating needs only a few iterations. After updating the local maxima, those projected points covered by the kernel function Hx are assigned to this cluster directly without a hill-climbing procedure. Unlike the object detection, the tracking algorithm only runs the hill-climbing procedure once, which enables an online tracking system.

The summary table of the system is illustrated in Fig. 5. For the projected points from the preprocessing block described in Section 2, the tracking block generates clusters to update the positions of objects being tracked, and then the detecting block searches the remaining clusters as new objects. 3.3. Kernel construction In Eq. (1), the kernel functions Hx and Hy represent the position and orientation distributions of object points in spatial space and the weight wj reects the relative importance of each projected point in determining the local maximum. Generally, different types of objects have different spatial attributes, which requires different kernel and weighting functions. Here we take human as an example category to illustrate the construction of the kernel functions and weight. As indicated in Section 2, a human body is observed from the top viewpoint and it can be approximated by an ellipse with the major axis of a and the minor axis of b, i.e., the center and the major axis of the ellipse correspond to the head and the shoulder, respectively. Parameters a and b are the person width and thickness, so the position bandwidth AX can be dened as " # 1=a 0 AX 2 8 0 1=b

The head is higher than the shoulder. After normalization, the kernel function Hx can be expressed as  1x 0 r x r 1 Hx x 9 0 otherwise

Fig. 4. The constructed probability function in the four sequential frames after Fig. 3(b).

4034

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

Fig. 5. The summary table of the system.

The function Hx dj x, yi in Eq. (1) can be obtained by rotating an angle yi with respect to the center. The kernel size selection is crucial in our algorithm. A big kernel may combine some adjacent clusters together, while a small kernel may mistakenly partition a large cluster. Therefore, an appropriate kernel should balance the cost and performance. In our system, the kernel size is set as the average of a group of people of different sizes, which performs well in detecting and tracking people, see Section 4. The orientation in a kernel function represents the rotation angle of a cluster with respect to its center. We derive the object orientation from a set of ni discrete angles which are evenly distributed in 0,2p. Hy is dened as a Gaussian type function: Hy x ex 10

detection and tracking in both indoor and outdoor scenes with complex background. Specically, we will show the experiments with the problems of illumination variation, shadow interference, and severe occlusion. As discussed in Section 3.3, an ellipse is used to approximate a person and the axis parameters a, b are set as the rough average size of objects. An appropriate estimation for a and b is 0.7, 0.3. In our experiments, other smaller and bigger sizes are also applied to analyze the system sensitivity to different parameter congurations. Unlike the size parameters a, b in the position kernel function Hx, the parameter ni are not related to the object category but to the resolution of object orientation. It is set to 4 in the our experiments, i.e., the set of angle samples yi is f0,pi=4,pi=2,3pi=4g.

Different from the strict restriction on position kernel size, the orientation bandwidth Ay can be assigned roughly according to the application resolution requirements, e.g. ni =2p or ni =4p for people tracking The weight wj represents the effect of each projected point to generate the whole cluster. As for human, the center corresponds to the highest points above the ground plane, i.e, the cluster center should be close to the points with a large height. Thus, we can set the weight function as wj zj 11

4.1. Single object experiments At present, there are few public databases to test short baseline camera systems, so we collect the experiment data to evaluate the system performance. The rst experiment is to present the tracking performance of different methods for single object. In Fig. 6, the feature points tracking algorithm [9] for a monocular vision system is initialized to track the person in the sequential frames. The algorithm learns the feature points intensity variation ofine and generates a number of linear predictors for motion estimation. The second row shows the tracking result of [9]. The predictors performances are rather sensitive to the object poses. Once the moving object changes the orientation signicantly (i.e., nonrigid deformation or rotation), the local feature used by predictors becomes unavailable, resulting in inaccurate tracking, e.g the second row of Fig. 6. With the stereo vision systems [32], the 3D structures of an observed scene can be obtained, which are then projected to the ground plane as top-view images. Instead of color feature, the stereo systems uses spatial features which are robust to object pose changes. Fig. 7 compares the tracking results of stereo vision systems. With Fig. 6 data, the rst row of Fig. 7 is produced by a state-of-art tracking algorithm [32], and our results are illustrated in the second row. Though both algorithms use short baseline stereo vision systems to acquire 3D scene structural information, the method in [32] locates objects by the foreground volume projected on the ground plane and our method searches the density space of projected points for local maxima. Compared to monocular vision system, stereo vision systems not only detect

Our system uses the spatial attribute of object instead of other image features (e.g. skin and face) for object detection and tracking. Thus it is more general than other existing systems, which are usually limited to specic objects. For example, in trafc surveillance, the height of a vehicle is not important to determine the center, so we can construct a box-like kernel function and set the weight as a constant.

4. Experiments Our binocular stereo vision system consists of two cameras with parallel optical axes. After parameter calibration, the ground plane equation in the camera view can be estimated as described in Section 2. The feature points are obtained by Harris operator. Our system has no specic requirements for the surveillance objects, and moving people is used as an example of multi-object

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

4035

Fig. 6. Single object tracking: top row: object feature points initialization; bottom row: tracking results using linear predictors [9].

Fig. 7. Single object tracking trajectories from Fig. 6.

objects automatically (i.e., without user initialization) but also produce accurate object trajectories over frames.

4.2. Indoor multi-object detection and tracking For multi-object tracking, the occlusion among objects is a common yet challenging problem. When an object is mostly obstructed by others, it is difcult to detect and track the occluded object due to only very few available feature points. One example of occlusion is in the situations when two queues of people walk towards each other, see Fig. 8. The occlusion is a short-term event, but the object motion is hard to predict when two people meet together, i.e., they can either continue with their original directions or walk back. For the method of [32], it is unable to differentiate two occluded objects because they just correspond to a small single region in the dense depth map and the single volume in the ground plane image as well. Although this temporary occlusion lasts only a short time period, it causes a large uctuation in object trajectories. The tracker takes the nearby larger volume as the object projection, which leads to a wrong tracking result (see the second row in Fig. 8). For our system, the occlusion signicantly reduces the available

feature points. Though, the object detection and tracking does not depends on the number of points but the density of points. Therefore, these points can still generate an independent cluster as other points of normal object, see the third row of Fig. 8. Compared to the ellipse conguration (a 0:7, b 0:3) in the third row, two smaller ellipses (a 0:6, b 0:2 and a 0:5, b 0:15) in the fourth and fth rows are selected for the position kernel function Hx. There is subtle difference in three results because projected points lie within compact regions. Fig. 9 presents another occlusion when people walk in a queue towards the camera. In this queue, a girl is occluded severely by the people before her. Only a small area belonging to the girl is visible in most frames, which gives little response on the dense depth map and the ground plane image (see the second row). In this case, the method in [32] loses the occluded object and gives incorrect object number and object trajectories. The occlusion also causes few feature points for the object in our system. However, these points are high above the ground plane and the projected points lie within a small region so that they have relatively large weights and high density, based on which the system can correctly generate a cluster to detect and track. The experimental result with a 0:7, b 0:3 is shown in the third row of Fig. 9. We further reduce the position kernel function Hx to be

4036

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

Fig. 8. Multi-object tracking with two queues moving toward each other.

a 0:6, b 0:2, and the experimental result in the fourth row is


nearly identical to the one in the third row. Given a smaller parameter setting a 0:5, b 0:15, one person may generate two clusters corresponding to two shoulders so that the number of object being detected sometime is more than the true number. To correctly track objects in occlusion, the method in [32] installs multiple stereo vision systems with different views to produce an integrated ground plane image, which demands much higher hardware costs than our system. In addition, the system

calibration and the registration of ground plane images introduce a high computational cost. All these factors inhibit the system in [32] from online applications.

4.3. Outdoor multi-object detection and tracking In outdoor experiments, the camera system can be installed in a higher position without the limitation of ceiling. Therefore the object

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

4037

Fig. 9. Multi-object tracking under occlusion.

occlusion is reduced signicantly and the surveillance area is enlarged as well. On the other hand, the outdoor environmental effects, such as fast illumination variations and complex background, introduce signicant noise in the disparity map, which causes problems for most surveillance systems, e.g. [3133]. In Fig. 10, we monitor a building exit to track different people walking directions, where the illumination changes as people walking towards the exit in the surveillance region. In the second row, the illumination variation increase the noise in the depth map and the object projection area in the ground plane image. The method in [32] cannot differentiate

objects close to each other and its correspondence-based tracker produces wrong tracking results due to the large displacement of the object. In our system, feature points on the background can be ltered out by the ROI restriction. The disparity based on feature point contains less error than the one based on region. Meanwhile, those projected points with disparity noise are dispersed in the whole surveillance region rather than assembled in a small region, which produces low densities in the probability space. Our updating scheme-based tracking strategy predicts the most probable position according to the object velocity in previous frames. With these two

4038

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

Fig. 10. Outdoor multi-object tracking.

Fig. 11. Complex multi-object tracking (small distance among objects).

advantages and the ellipse conguration (a 0:7, b 0:3), our result in the third row shows a more satisfactory result over [32]. The fourth row is the result of a larger ellipse (a 0:8, b 0:4). Because the length of major axis 0.8 is still less than the distance between objects, this setting does not merge two adjacent objects and obtains the same result as the one with a 0:7, b 0:2. Most object detection and tracking algorithms are sensitive to the distance between object, i.e., they may regard different objects as a single one when they are close to each other. The rst row of Fig. 11 shows two adjacent objects in the surveillance scene. The

system of [32] can differentiate the rst pair of objects but unable to separate the second pair in all frames and the third pair in most frames due to the rather small distance between them. In our system, when the distance between adjacent objects are less than the length of major axis, they will be grouped into a large cluster. Although the ellipse parameters in the third row and the fourth row of Fig. 11 are different, the variances of clusters produced by them are the same (see Eq. (5)). Similarly, the splitting algorithm dened in Eq. (7) can automatically detect the large cluster and divide it into two same small clusters in case of two different

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

4039

Fig. 12. Complex multi-object tracking (pedestrians with umbrellas).

ellipse congurations. These smaller clusters have different centers but the same orientation (see the third row and the fourth row of Figs. 11(b)(d)). Note that in Fig. 11(e), the objects feature points naturally form two separate clusters without splitting and they have independent orientations, which shows the robustness and accuracy of our system. To summarize these experimental results with different parameter settings, the ellipse describing object spatial feature should be close to its average size. For example, the settings (a 0:6, b 0:2), (a 0:7, b 0:3) and (a 0:8, b 0:4) can obtain the same results. In addition, an extreme setting (a 0:5, b 0:15) in the fth row of Fig. 8 still works well. In this experiment, the tracked objects pass the front of camera, and only one shoulder and head can be detected. The distance between them is less than 0.5, so their points are grouped into a cluster. Fig. 9 has a different setting with people walking towards the camera. The smaller ellipse often regards two shoulders of a people as two independent objects, which leads to a wrong number of objects. In this condition, a bigger ellipse conguration can detect object correctly. Though, it may group two objects and the splitting process will divide them when they are close to each other, see the third and fourth row of Fig. 11. Meanwhile, a bigger ellipse groups more projected points into the cluster, and the splitting algorithm divides the large cluster into smaller ones, which increases the computational cost (see Eq. (2)) for an efcient online tracking. We further compare the system performance in more complex outdoor scenes, including serious illumination variations, large shadow areas and heavy occlusion. In Fig. 12, the radical changes of the outdoor illumination generate noise regions in the ground plane image. In addition, the umbrellas hold by pedestrian may cover most human bodies and its spatial distribution is much wider than that of a person from the top view. In such challenging case, the single umbrella is sometimes incorrectly divided into different objects by the method [32] and our system , e.g. the rst frame of the second row and the fourth frame of the third row. As

Table 1 Quantitative tracking accuracy measures. Condition Two directions (3 ft) Two directions (1.5 ft) Multiple directions Under umbrella *3 and 1.5 ft means the distance between adjacent people. Rate of accuracy (%) 95.08 91.67 89.29 80.95

shown in the second row, those fake objects will be far away from the true object as the tracking continue due to the attraction of noise regions or other nearby objects. In our system, the calculation for disparities of feature points generates less noise than for the whole depth map. In addition, the noise regions include fewer feature points than the threshold, so that they are ltered out before deriving objects trajectories. Meanwhile, as the tracking continues, the fake objects will gradually merge together due to the change of the angle to the camera (see the last frame of the third row of Fig. 12). Besides the above qualitative comparison, we also quantitatively evaluate the system performance in different conditions including: queues of people walking in two opposite directions; in multiple directions; and people under umbrellas. Table 1 lists the objective measures, with the rate of accuracy being computed as the ratio of the number of correctly tracked persons over the number of all people walking in the scenes.

5. Summary In this paper we present a new stereo vision-based model for multi-object detection and tracking, which is based on object

4040

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

feature points extraction and clustering. Feature points are extracted in the 3D camera coordinate system, and then projected onto the ground plane. The height values and locations of the projected points are used to generate object clusters by a novel kernel-based algorithm, which determines the number, position and orientation of the objects in the scene. Specically, our feature points-based clustering algorithm is less sensitive to the illumination variation compared with those color-based approaches. In addition, by projecting the feature points from the 3D camera system to the 2D ground plane, the shadow effect is overcome by ltering out the feature points with small height values. Meanwhile, with the top view, object occlusion is prevented. Lastly, the computational cost of our method is much less than the traditional depth map-based stereo vision systems, which enables an online multi-object detection and tracking system. Both indoor and outdoor experiments in complex environments show the improved performance of our system when compared with a monocular method [9] and a state-of-art method [32]. In future, we will make efforts to further enhance the system robustness, for example, by correctly detecting completely overlapped objects (e.g. people under umbrella) and estimating the new ground plane when the cameras are quivered under external forces (e.g. strong wind).

Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version of 10.1016/j.patcog.2010.06.012.

References
[1] D. Lee, Effective Gaussian mixture learning for video background subtraction, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 827832. [2] Y. Sheikh, M. Shah, Bayesian modeling of dynamic scenes for object detection, IEEE Trans. Pattern Anal. Mach. Intell. 27 (11) (2005) 17781792. [3] C. Parageorgiou, M. Oren, T. Poggio, A general framework for object detection, in: International Conference on Computer Vision, 1998, pp. 552562. [4] P. Viola, M. Jones, D. Snow, Detecting pedestrians using patterns of motion and appearance, in: International Conference on Computer Vision, 2003, pp. 734741. [5] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 886893. [6] M. Isard, A. Blake, Condensationconditional density propagation for visual tracking, Int. J. Comput. Vision 29 (1) (1998) 528. [7] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, IEEE Trans. Pattern Anal. Mach. Intell. 25 (5) (2003) 564575. [8] P. Nikos, D. Rachid, Geodesic active contours and level sets for the detection and tracking of moving objects, IEEE Trans. Pattern Anal. Mach. Intell. 22 (3) (2000) 266280. [9] K. Zimmermann, J. Matas, T. Svoboda, Tracking by an optimal sequence of linear predictors, IEEE Trans. Pattern Anal. Mach. Intell. 31 (4) (2009) 677692. [11] F. Moreno-Noguer, A. Sanfeliu, D. Samaras, A target dependent colorspace for robust tracking, in: Proceedings of the International Conference on Pattern Recognition, 2006, pp. 4346. [12] R. Collins, Y. Liu, M. Leordeanu, On-line selection of discriminative tracking features, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005) 16311643. [13] E. Ozyildiz, N. Krahnstover, R. Sharma, Adaptive texture and color segmentation for tracking moving objects, Pattern Recognition 35 (10) (2002) 20132029. [14] V. Takala, M. Pietikinen, Multi-object tracking using color, texture and motion, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 17. [15] T. Horprasert, D. Harwood, L.S. Davis, A statistical approach for real-time robust background subtraction and shadow detection, in: International Conference on Computer Vision, 1999, pp. 119.

[16] I. Mikic, P.C. Cosman, G.T. Kogut, M.M. Trivedi, Moving shadow and object detection in trafc scenes, in: Proceedings of the International Conference on Pattern Recognition, 2000, p. 1321. [17] J. Stauder, R. Mech, J. Ostermann, Detection of moving cast shadows for object segmentation, IEEE Trans. Multimedia 1 (1) (1999) 6576. [18] A. Prati, I. Mikic, M.M. Trivedi, R. Cucchiara, Detecting moving shadows: algorithms and evaluation, IEEE Trans. Pattern Anal. Mach. Intell. 25 (7) (2003) 918923. [19] S. Nadimi, B. Bhanu, Physical models for moving shadow and object detection in video, IEEE Trans. Pattern Anal. Mach. Intell. 26 (8) (2004) 10791087. [20] N. Paragios, R. Deriche, Detecting multiple moving targets using deformable contours, in: Proceedings of the International Conference on Image Processing, 1997, pp. 2629. [21] N. Paragios, R. Deriche, A PDE-based level set approach for detection and tracking of moving objects, in: International Conference on Computer Vision, 1998, pp. 11391145. [22] J. MacCormick, A. Blake, A probabilistic exclusion principle for tracking multiple objects, Int. J. Comput. Vision 39 (1) (2000) 5771. [23] J.K. Wolf, A.M. Viterbi, G.S. Dixson, Finding the best set of K paths through a trellis with application to multitarget tracking, IEEE Trans. Aerosp. Electron. Syst. 25 (2) (1989) 287296. [24] H. Jiang, S. Fels, J. Little, Optimizing multiple object tracking and best view video synthesis, IEEE Trans. Multimedia 10 (6) (2008) 9971012. [25] Y. Li, C. Huang, R. Nevatia, Learning to associate: hybridboosted multi-target tracker for crowded scene, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 29532960. [26] R. Pinho, J. Tavares, Tracking features in image sequences with Kalman ltering, global optimization, mahalanobis distance and a management model, Comput. Modeling Eng. Sci. 46 (1) (2009) 5175. [27] S. Dockstader, T.A. Murat, Multiple camera tracking of interacting and occluded human motion, in: Proceedings of IEEE, 2001, pp. 14411455. [28] A. Mittal, L. Davis, M2Tracker: a multi-view approach to segmenting and tracking people in a cluttered scene using region-based stereo, in: European Conference on Computer Vision, 2002, pp. 1836. [29] S. Khan, M. Shah, A multiview approach to tracking people in crowded scenes using a planar homography constraint, in: European Conference on Computing Vision, 2006, pp. 133146. [30] J. Berclaz, F. Fleuret, P. Fua, Multi-camera tracking and atypical motion detection with behavioral maps, in: European Conference on Computing Vision, 2008, pp. 112125. [31] T. Darrell, G. Gordon, M. Harville, J. Woodll, Integrated person tracking using stereo, color, and pattern detection, Int. J. Comput. Vision 37 (2) (2000) 175185. [32] T. Darrell, D. Demirdjian, N. Checka, P. Felzenszwalb, Plan-view trajectory estimation with dense stereo background models, in: International Conference on Computer Vision, 2001, pp. 628635. [33] X. Huang, L. Li, T. Sim, Stereo-based human head detection from crowd scenes, in: Proceedings of the International Conference on Image Processing, 2004, pp. 13531356. [34] K. Zimmermann, T. Svoboda, J. Matas, Multi-view 3d tracking with an incrementally constructed 3d model, in: Proceedings of the Third International Symposium on 3D Data Processing, 2006, pp. 1416. [35] M. Mozerov, I. Rius, X. Roca, J. Gonzlez, Nonlinear synchronization for automatic learning of 3D pose variability in human motion sequences, EURASIP Journal on Advances in Signal Processing 2010 (2010) Article ID 507247, 10 pages. [36] Z. Zhang, A exible new technique for camera calibration, IEEE Trans. Pattern Anal. Mach. Intell. 22 (11) (2000) 13301334. [37] A. Fusiello, E. Trucco, A. Verri, A compact algorithm for rectication of stereo pairs, Mach. Vision Appl. 12 (1) (2000) 1622. [38] S. Bircheld, C. Tomasi, Depth discontinuities by pixel-to-pixel stereo, Int. J. Comput. Vision 35 (3) (1999) 269293. [39] J. Kim, V. Kolmogorov, R. Zabih, Visual correspondence using energy minimization and mutual information, in: International Conference on Computer Vision, 2003, pp. 10331040. [40] J. Lobo, L. Almeida, J. Alves, J. Dias, Registration and segmentation for 3D map buildinga solution based on stereo vision and inertial sensors, in: Proceedings of IEEE International Conference on Robotics and Automation, 2003, pp. 139144. [41] J. Lobo, J. Dias, Inertial sensed ego-motion for 3D vision, J. Robotics Syst. 21 (1) (2004) 312. [42] K. Fukunaga, L.D. Hostetler, The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Trans. Inf. Theory 22 (1) (1975) 3240. [43] Y. Cheng, Mean shift, mode seeking, and clustering, IEEE Trans. Pattern Anal. Mach. Intell. 17 (8) (1995) 790799. [44] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell. 24 (5) (2002) 603619. [45] R. Collins, Mean-shift blob tracking through scale space, in: IEEE Conference on Computer Vision and Pattern Recognition, 2003, pp. 234240.

Ling Cai is now a Ph.D. candidate in the School of Electronic Information and Electrical Engineering at Shanghai Jiao Tong University, Shanghai, China. His research interests are image processing and computer vision, including image segmentation, visual tracking and stereo vision.

L. Cai et al. / Pattern Recognition 43 (2010) 40284041

4041

Lei He received his Ph.D. degree of Electrical Engineering at the University of Cincinnati in 2003, and a master degree in Computer Engineering from the Chinese Academy of Sciences. He is an associate professor at Armstrong Atlantic State University, and is currently visiting NIH as a research scholar. His research interests include image and video processing and analysis, computer vision and pattern recognition, and intelligent information systems. He has published over forty refereed journal and conference papers, secured a patent with Hewlett Packard, and implemented a number of industry and medical projects. He served as a research consultant with Siemens Corporate Research.

Yiren Xu is now a master candidate in the Institute of Image Processing and Pattern Recognition at Shanghai Jiao Tong University. His research interests include PDE-based image processing, computer vision and neural networks.

Yuming Zhao received his Ph.D. degree in the Institute of Image Processing and Pattern Recognition at Shanghai Jiao Tong University. He is an assistance professor in the Institute of Image Processing and Pattern Recognition at Shanghai Jiao Tong University. His research interests include stereo vision and image enhancement.

Xin Yang is a professor in the Institute of Image Processing and Pattern Recognition at Shanghai Jiao Tong University. His research activities focus on medical image analysis and PDE-based image processing. He has a Ph.D. in Electronic Engineering from the Free University of Brussels ETRO/VUB.

S-ar putea să vă placă și