Documente Academic
Documente Profesional
Documente Cultură
Chu-Song Chen
I. I NTRODUCTION
Pedestrian detection based on images is an active research
area and become more and more popular within the last
decade. A robust pedestrian detector is a key component applicable to the fields such as automotive safety [1], robotics [2],
and visual surveillance [3]. Pedestrian detection problem is
regarded as a canonical object detection task, where differentiating human bodies from backgrounds is its main goal.
Detecting humans in various scenes is challenging on the
grounds that humans may have various postures with partial
occlusions. Furthermore, shapes of several human-like objects
(such as postboxes and traffic lights) on the street may also
be confused with the human bodies due to large ambiguities
between these two types of objects, making the problem even
challenging.
Deep convolutional neural networks (CNN) have received
considerable attentions in recent years, and have shown their
effectiveness for objection recognition recently. Deep CNNs
have the ability to learn discriminative feature representations
from raw RGB pixels in an end-to-end manner, and they unify
the feature-extraction and classification in a single learning
model. Deep CNNs have been widely used in image classification [4] [5], object detection and localization [6] [7],
and segmentation [8]. Several deep architectures have been
proposed for general object recognition tasks, eg., AlexNet [5],
GoogLeNet [9], and VGG [10]. GoogLeNet has a great
success on both the image-classification and object-detection
c
978-1-5090-0357-0/15/$31.00
2015
IEEE
GoogLeNet
Inception
Sigmoid
CrossEntropy
Loss
Conv 1x1
Image
DepthConcat
Conv 1x1
Conv 3x3
224
224
Conv 7x7
Conv 3x3
Conv 1x1
MaxPool 3x3
MaxPool 3x3
Conv 1x1
Conv 5x5
AveragePool
5x5
Conv 1x1
MaxPool
3x3
Conv 1x1
FC
1024
FC
1
TPDCNN
224
Image
Upper Row
Average
Fusion
ACF Proposals
224
Early Layers
480
Middle Layers
Final Layers
640
Lower Row
Fig. 2. The overall architecture of Two Parallel Deep Convolutional Neural Networks (TPDCNN)
III. M ETHOD
In this section, we would illustrate our concept and details
of the proposed TPDCNN. Before introducing our method, a
brief review of GoogLeNet is given in advance. First, an image
fed into the network goes through the early layers shown
in Figure 1(a). In early layers, Some early representations
are formed through a sequence of 7 7, 1 1, and 3 3
convolutions and max pooling operators. Then, middle layers
which is formed by a repetitive structure called inception is
concatenated with the early layers. An inception has several
parallel 1 1, 3 3, and 5 5 convolutions and max pooling
summarized with a depth concatenation stage, as shown in
Figure 1(b). A depth concatenation is then performed and the
output is sent to the next inception. The inception serves as a
basic unit module to extract deep-representation features, and
9 inceptions are used in GoogLeNet. The final layers (shown
in Figure 1(c)) integrate the deep features extracted by average
pooling and a fully-connected layer, and then linear classifiers
are constructed for the classification. The scale of learning
parameters, i.e. weight, in neural networks is a pivotal factor.
It may degrade the performance of networks if the number of
learning parameters is too larger with limited training data.
Compared to the AlexNet model having almost 60 Mega
parameters, GoogLeNet reduced the learned parameters to
approximately 5 Mega with a more deeper network.
In the original GoogLeNet, it uses softmax to classify
1000 categories on ImageNet. For pedestrian detection, it
only needs two classes, pedestrian and background, which
is a binary classification problem. In our work, we replace
the softmax layer with a sigmoid cross-entropy loss in the
final linear classifier layer. Nevertheless, using the whole
Networks
Upper row
Lower row
# of inceptions
2
3
2
3
TABLE I
Test MR
27.31%
29.52%
29.84%
25.38%
Networks
ConvNet [25]
DBN-Isol [26]
DBN-Mut [27]
MultiSDP [29]
JointDeep [28]
SDN [30]
LFOV [32]
DeepCascade [33]
DeepCascade+ [33]
R-CNN (AlexNet) [11]
TA-CNN [31]
TPDCNN
TABLE III
BBox Regression
N
N
N
Y
N
Y
N
Y
TABLE II
Test MR
29.76%
36.66%
27.31%
24.55%
25.38%
24.73%
21.25%
19.57%
Test MR
77.20%
53.14%
48.22%
45.39%
39.32%
37.87%
35.85%
31.11%
26.21%
23.30%
20.86%
19.57%
1
.80
.64
.50
.40
A. Dataset
.30
miss rate
.20
.10
.05
77.20% ConvNet
53.14% DBNIsol
48.22% DBNMut
45.39% MultiSDP
39.32% JointDeep
37.87% SDN
35.85% LFOV
31.11% DeepCascade
26.21% DeepCascade+
23.32% AlexNet+ImageNet
20.86% TACNN
19.57% TPDCNN
3
10
10
10
10
10
B. Evaluations of TPDCNN
In this section, we display the performance of our proposed
TPDCNN and the comparison to other deep learning methods
for pedestrian detection. We start to show the selection of
concise models from GoogLeNet, which are used to form two
parallel deep CNNs . Table I shows the performance with
a different number of inceptions for upper row and lower
row individually. We can see that the MR is closer on upper
row when the number of inceptions in middle layers is 2
or 3. However, as the lower row is assigned to deal with
more hard negative samples, the more inceptions (here, 3) is
advantageous to extract better features for discrimination.
Then, we show the detection quality on different deep
architectures and our proposed TPDCNN in Table II. We can
see that the original GoogLeNet structure is not promising
in pedestrian detection compared to other structures (even
when fine-tuning is performed), and it is even worse than the
proposal ACF. It shows that the original deeper GoogLeNet
is not a proper architecture for pedestrian detection. Our
upper-row model achieves a more favorable performance than
[16] X. Wang, T. Han, and S. Yan, An hog-lbp human detector with partial
occlusion handling, in ICCV, 2009.
[17] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, Object
detection with discriminatively trained part-based models, PAMI, 2010.
[18] Z. Lin and L. Davis, Shape-based human detection and segmentation
via hierarchical part-template matching, PAMI, 2010.
[19] L. Zhu, Y. Chen, and A. Yuille, Learning a hierarchical deformable
template for rapid deformable object parsing, PAMI, 2010.
[20] P. Dollar, Z. Tu, P. Perona, and S. Belongie, Integral channel features,
in BMVC, 2009.
[21] P. Dollar, R. Appel, and W. Kienzle, Crosstalk cascades for frame-rate
pedestrian detection, in ECCV, 2012.
[22] P. Dollar, R. Appel, S. Belongie, and P. Perona, Fast feature pyramids
for object detection, PAMI, 2014.
[23] W. Ke, Y. Zhang, P. Wei, Q. Ye, and J. Jiao, Pedestrian detection via
pca filters based convolutional channel features, in ICASSP, 2015.
[24] S. Zhang, R. Benenson, and B. Schiele, Filtered channel features for
pedestrian detection, in CVPR, 2015.
[25] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, Pedestrian
detection with unsupervised multi-stage feature learning, in CVPR,
2013.
[26] W. Ouyang and X. Wang, A discriminative deep model for pedestrian
detection with occlusion handling, in CVPR, 2012.
[27] W. Ouyang, X. Zeng, and X. Wang, Modeling mutual visibility
relationship with a deep model in pedestrian detection, in CVPR, 2013.
[28] W. Ouyang and X. Wang, Joint deep learning for pedestrian detection,
in ICCV, 2013.
[29] X. Zeng, W. Ouyang, and X. Wang, Multi-stage contextual deep
learning for pedestrian detection, in ICCV, 2013.
[30] P. Luo, Y. Tian, X. Wang, and X. Tang, Switchable deep network for
pedestrian detection, in CVPR, 2014.
[31] Y. Tian, P. Luo, X. Wang, and X. Tang, Pedestrian detection aided by
deep learning semantic tasks, in CVPR, 2015.
[32] A. Angelova, A. Krizhevsky, and V. Vanhouck, Pedestrian detection
with a large-field-of-view deep network, in ICRA, 2015.
[33] A. Angelova, A. Krizhevsky, V. Vanhoucke, A. Ogale, and D. Ferguson,
Real-time pedestrian detection with deep network cascades, in BMVC,
2015.
[34] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, Selective
search for object recognition, IJCV, 2013.
[35] P. Dollar, Piotrs Computer Vision Matlab Toolbox (PMT), http:
//vision.ucsd.edu/pdollar/toolbox/doc/index.html.
[36] Princeton Vision Group, A gpu implementation of googlenet, http:
//vision.princeton.edu/pvt/GoogLeNet.
[37] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, Caffe: Convolutional architecture for
fast feature embedding, in ACM MM, 2014.