Documente Academic
Documente Profesional
Documente Cultură
applications
Abstract
Autonomous driving requires reliable and accurate detection and recognition of surrounding objects in real drivable
environments. Although different object detection algorithms have been proposed, not all are robust enough to detect
and recognize occluded or truncated objects. In this paper, we propose a novel hybrid Local Multiple system (LM-CNN-
SVM) based on Convolutional Neural Networks (CNNs) and Support Vector Machines (SVMs) due to their powerful
feature extraction capability and robust classification property, respectively. In the proposed system, we divide first the
whole image into local regions and employ multiple CNNs to learn local object features. Secondly, we select discrimina-
tive features by using Principal Component Analysis. We then import into multiple SVMs applying both empirical and
structural risk minimization instead of using a direct CNN to increase the generalization ability of the classifier system.
Finally, we fuse SVM outputs. In addition, we use the pre-trained AlexNet and a new CNN architecture. We carry out
object recognition and pedestrian detection experiments on the Caltech-101 and Caltech Pedestrian datasets.
Comparisons to the best state-of-the-art methods show that the proposed system achieved better results.
Keywords
Convolutional Neural Networks, Support Vector Machines, Object recognition pedestrian detection
discriminative data obtained from some interval process 2. The new hybrid Local Multiple
steps is classified using a classifier. Convolutional Neural
In recent years, deep learning methods have emerged as Network–Support Vector
powerful machine learning methods for object recognition
Machine system
and detection.20–25 Deep learning methods are different
from traditional approaches in that they automatically and This section presents a brief review of the CNN and SVM
quickly learn the features directly from the raw pixels of classifiers used to generate the new hybrid the LM-CNN-
the input images without using approaches such as SIFT, SVM classifiers.
HOG, and SURF.20 In deep learning methods, local recep-
tive fields grow in a layer-by-layer manner. The low-level
2.1. Convolutional Neural Networks
layers extract fine features, such as lines, borders, and cor-
ners, while high-level layers exhibit higher features, such as The basic structure of CNNs is inspired by the animal
object portions, like pedestrian parts, or the whole object, like visual cortex organization. Hubel and Wiesel presented a
cars and traffic signs. In other words, they allow for repre- work about ‘‘local receptive fields’’ in 1968.34 They
senting an object at different granularities end-to-end. Their showed that the animal visual cortex has complex cell
successes are presented on the challenging ImageNet classifi- arrangements in small sub-regions of the visual field. The
cation task across thousands of classes23,24 by using a kind of first ideas relating to CNNs were originally given by
deep neural network called a Convolutional Neural Networks Fukushima with hierarchically organized image transfor-
(CNN). It has been shown that CNNs outperform the recog- mations to simulate the human visual system by using the
nition performance of classifiers that use conventional feature concepts in Hubel and Wiesel34 in 1980.35 Unlike conven-
extraction methods.25–29 However, the global features tional CNNs, however, his work does not contain a shared
extracted using CNNs are significantly affected by illumina- weight. In the early 1990s, CNNs started to appear in the
tion, noise, or occlusion when a CNN is applied to the com- literature with the disadvantage of big computational
plete image. In this paper, we therefore propose a new load.36,37 Nowadays, CNNs have gained popularity as both
system using both the CNN and the SVM to solve these chal- a powerful feature extractor and a classifier with the
lenges. Recently, it was seen that some works included differ- advent of powerful Graphics Processing Units (GPUs). In
ent combinations of both the CNN and the SVM.30–32 For a short time, CNNs have been successfully applied to
example, one study carried out robust face detection using many computer vision fields, such as autonomous vehi-
local CNNs and SVMs based on kernel combination.30 cles, speech recognition, and medical imaging tasks.38–41
Scene recognition was realized by collecting the features CNNs consist of multiple layers similar to feed-forward
relating to different layers of CNNs and importing them to neural networks. The outputs and inputs of the layers are
the input of a SVM.31 Another study32 proposed a novel sin- given as a set of image arrays. CNNs can be constructed
gle CNN–SVM classifier for recognizing handwritten digits. by different combinations of the convolutional layers,
Different from these works, however, in this paper, we pro- pooling layers, and fully connected layers with point-wise
nonlinear activation functions. A typical CNN architecture
pose to determine local robust features by multiple CNNs
is shown in Figure 1.
defined for local regions of objects and then using the SVM
The layer definitions of CNNs are briefly described as
to recognize and detect all objects. Thus, the proposed hybrid
follows.
Local Multiple CNN-SVM (LM-CNN-SVM) system can
Input layer: images are directly imported to the input of
extract more robust and efficient features than those of a sin-
the network.
gle CNN. In addition, we deploy two CNN architectures.
Convolutional layer: this layer performs the main work-
One of them is a pre-trained AlexNet architecture22 with
horse operation of the CNN.20,21 In CNNs, the convolution
eight layers, excluding an input layer, and the other is a CNN
is used instead of the matrix multiplication in conventional
architecture with nine layers, excluding an input layer, similar
feed-forward neural networks so as to decrease the number
to AlexNet. AlexNet22 has an architecture with five convolu-
of weights and hereby the network complexity. The input
tional layers and three fully connected layers. We prefer this
image is convoluted by the kernels or learnable filters. The
network, since it was applied successfully on an ImageNet
convolution provides a feature map in the output image.
dataset for object recognition tasks.23 We perform detailed
The obtained feature maps are imported to the input of the
experiments to evaluate the proposed CNNs. This paper is an
later convolutional layer.
extended version of our previous conference paper.33 Pooling layer: this layer is used to reduce the feature
The rest of this paper is organized as follows. In Section dimension. Thus, the resolution of the feature maps is
2, the principles of the SVM, the CNN, and the hybrid reduced and spatial invariance is achieved. The input
LM-CNN–SVM system are described. Section 3 illustrates images are partitioned into a set of non-overlapping rectan-
and discusses the experimental results. In Section 4, the gles. Each region is down-sampled by a nonlinear down-
paper is concluded. sampling operation, such as maximum or average. The
xar et al.
Uc 761
layer achieves faster convergence, better generalization, direction through different layers. Secondly, the output val-
small invariance to translation, and distortion. The layers ues are calculated after the digital filters extract salient fea-
are usually located between successive convolutional tures at each layer. Thirdly, the error between the actual
layers to reduce the spatial size. output and network output are computed and the error is
Rectified Linear Units (ReLU) layer: this layer includes minimized by being back propagated. The weights of the
units employing the rectifier to achieve scale invariance. CNN are then further adjusted to fine-tune them. So, the
The activation function of this layer is mathematically end-to-end learning process succeeds in CNNs that find a
described as f(x) = max(0,x) for an input x. Thanks to this direct mapping from the raw input image data to the target
function, the gradient does not saturate in the positive class without prior knowledge and human interference.
region and a small and non-zero gradient is obtained,
increasing the accuracy of the CNN. Moreover, the com- 2.2. Support Vector Machines
putation is realized by a simple threshold and in faster
way than the equivalents with a sigmoid function and a Given L samples of training data (x1 , y1 ), :::,
tanh function. However, the gradient of the activation (xL , yL ), x 2 <n , y 2 f1, 1g, the SVM constructs an opti-
function is not definite at the origin. Instead, it is used in mal separating surface as yi ½wT f(xi ) + b 5 1, where (ã)
practice as a smoothed function f(x) = ln(1 + ex) so that is a linear/nonlinear mapping function, w 2 <n represents
its derivative is equal to a logistic function. the normal vector, and b 2 < determines a measure for the
Fully Connected layer: this layer is connected after sev- offset of the separating hyperplane. An optimal separating
eral convolutional, max pooling, and ReLU layers. This surface is obtained by maximizing the minimum distance
layer is similar to the layers in conventional feed-forward of the points closest to the hyperplanes of two classes,kw2 k,
neural networks. Its neurons are fully connected to all acti- called a margin. In addition, the SVM formulation allows
vations in the former layer. The layer is considered a final for misclassified intra-margin training examples,
feature selecting layer. The outputs are calculated by yi ½wT f(xi ) + b 5 1 ji . The SVM formulation with an
means of matrix multiplication and bias addition. inequality constraint aims to minimize both the training
Moreover, as with the conventional feed-forward neural error and the generalization error, as follows:
networks, the weights of these layers are estimated by
X
L
1
minimizing only the training error. min Lprimal (w, ji ) = C ji + kw k2 ð1Þ
Loss layer: a loss function is applied to measure the w, b, j
i=1
2
discrepancy between the prediction of the CNN and the
real target at the final layer of a CNN. There are several subject to yi ½wT u(xi ) + b 5 1 ji ji 5 0 ð1aÞ
loss functions. Euclidean loss can be used in a real-value where the first term, the sum of absolute errors ji , indicates
regression problem. Softmax loss is used to assign the the measures of the distances between the intra-margin
label with a single class of K mutually exclusive classes. misclassified training samples and the separating hyper
Cross-entropy loss is commonly deployed to calculate K plane. The second term defines the margin, and C is a para-
independent probability outputs in the range between 0 meter that controls the penalty incurred by each misclassi-
and 1 at the classification problems. fied point in the training set. Generally, larger C values
CNNs are trained using Stochastic Gradient Descent.20 cause a SVM structure with a smaller margin and better
Firstly, input data are propagated in the feed-forward training accuracy. On the other hand, relatively smaller C
762 Simulation: Transactions of the Society for Modeling and Simulation International 93(9)
values generate a larger margin and better generalization 2.3. Proposed Local Multiple CNN-SVM system
accuracy.
In this paper, we propose to use a hybrid LM-CNN system
The Lagrange multiplier method is firstly applied to the
instead of a single CNN system for learning the salient
primal problem in (1), as follows12,13:
features relating to the whole object. We first divide the
whole image into local regions and apply LM-CNNs for
1 X
L
^primal (w, b, j, l, b) = C
L kwk2 ji + extracting discriminative features of local regions. The
i=1
2 CNN training is based on only training error minimization,
X
L i T X L
similar to that of feed-forward neural networks. Feed-for-
i
li y w u(x ) + b 1 + ji bi j i ð2Þ ward neural networks have a lower generalization perfor-
i=1 i=1
mance than that of the SVM because the SVM minimizes
where li 5 0 and bi 5 0 are Lagrange multipliers. both structural risk and empirical risk.13 Hence, we replace
The unconstraint problem in (2) is solved by minimiz- the last output layer of a local CNN with a SVM classifier
ing it with respect to w, b, and ji and by maximizing it to compensate for the CNN’s limitation. The fully con-
with respect to li andbi , as follows: nected layer of the CNN is expressed as the linear combi-
nation of the previous hidden layer outputs expressed with
^primal
∂L XL
weights and a bias term. Moreover, the outputs of the layer
=0 ! w= li yi u(xi ) ð3Þ are given as the inputs to the last layer of the CNN. The
∂w i=1
CNN provides the class probabilities for each input image
^primal
∂L XL
by using the softmax activation function. On the other
=0 ! li yi = 0 ð4Þ hand, our work uses the outputs of a fully connected layer
∂b i=1
of a CNN as the SVM inputs. We increase the generaliza-
^primal
∂L tion performance of the SVM in terms of the obtained fea-
= 0 ! li + bi = C ð5Þ
∂ji tures from the CNN. Thus, we get rid of the limitations of
the CNN and SVM classifiers. The proposed approach is
and applying Karush–Kuhn–Tucker optimality conditions: similar to that of feature fusion,40 but the feature extrac-
tion process of the proposed method is different from tra-
li fyi ½wT u(xi ) + b 1 + ji g = 0 ð6Þ ditional learning methods.
bi ji = 0 ! 0 4 li 4 Cfor li 5 0, bi 5 0, ji 5 0, Figure 2 illustrates the block schema of the proposed
for i = 1, :::, L ð7Þ CNN detection and recognition algorithm. Figure 3 shows
the structure of the hybrid LM-CNN-SVM system. The
By stating the use of li , all primal variables in (3), and by proposed algorithm is applied in eight steps as follows.
introducing kernel function K(xi , x j ) = u(xi )T u(x j )
(assumed to be symmetric and positive definite), the fol- (1) All images are divided into local regions.
lowing quadratic optimization problem, called the dual (2) Each patch is wrapped to a fixed size, namely
form of (2), is obtained as follows: 64 3 64 3 3, and converted to a gray image.
(3) Each patch is imported into the pre-trained
1 X L XL
AlexNet, shown in Table 1, and the proposed
max Ldual (l) = li lj yi y j K(xi , x j ) + li ð8Þ
l 2 i, j = 1 i=1 CNN, shown in Table 2.
(4) All networks are trained by stochastic gradient
X
L
descent.
subject to yi li = 0, 0 4 li 4 C, i = 1, :::, L ð8aÞ
i=1
(5) The salient features obtained from the final fully
connected layer of the networks are saved.
By solving li of the dual problem (8) with quadratic (6) Principal Component Analysis (PCA) is deployed
programming, the separating hyper plane of the SVM is to reduce the dimension of the saved features.
constructed as follows: (7) SVM classifiers are applied to the reduced fea-
X tures. The one-against-rest method is used for sol-
‘(x) = sign( yi li K(xi , x j ) + b) ð9Þ ving multi-class classification problems.
support vectors (8) Decision fusion is used to combine the outputs of
where the support vectors are the data points x corre- i multi-class SVM classifiers.
sponding to li . 0 and K(xi , x j ) is a kernel function. It is
defined as K(xi
, x j ) = (xi )T (x j ) for Note that the pooling layer is used to select the features
the linear basis and after convolutions at the CNN, while PCA is used for
2
K(xi , x j ) = exp kxi x j k =2s2 for the radial basis, reducing them before importing the large features obtained
where s is the spread parameter of the kernel. from the CNN in this algorithm into the inputs of the
xar et al.
Uc 763
Input 64 × 64 1
Convolution 1 3 × 3-1 64
Convolution 2 3 × 3-1 64
Average Pooling 2 2 × 2-2 64
Convolution 3 3 × 3-1 96
Max Pooling 3 2 × 2-2 96
Convolution 4 5 × 5-1 256
Max Pooling 4 3 × 3-2 256
Convolution 5 3 × 3-1 384
Convolution 6 3 × 3-1 384
Convolution 7 3 × 3-1 256
Max Pooling 7 3 × 3-2 256
Fully connected 8 6 × 6-1 4096
Fully connected 9 1×1 1 × 1 × Class number
Figure 3. Structure of the proposed hybrid Local Multiple Convolutional Neural Network-Support Vector Machine (LM-CNN-
SVM) system. PCA: Principal Component Analysis.
linear SVM classifier is then trained by using these fea- divided each of the images into nine patches and then
tures for object detection. resized them to 64 3 64 3 3. We converted all images
to gray scale. We located all the images as an array into a
matrix. We applied a stochastic gradient descent with a
3. Experiments minimum batch size of 30 and with the learning rates for
To evaluate the performance of the proposed LM-CNN- weights and biases as 0.001 and 0.02, respectively. After
SVM system, we conducted experiments on two we extracted the features using LM-CNNs, we applied
well-known datasets – Caltech-10145,46 and Caltech pedes- PCA to the outputs of the final fully connected layer. We
trian47 in this section. We carry out experiments by using fed the reduced features as the input to the linear SVM
MatConvNet and Vlfeat toolboxes in MATLAB.48,49 classifier. In order to determine regularization parameters,
MatConvNet is a computer vision toolbox that allows for we employed five-fold cross-validation in the range from
the use of a GPU, thanks to CUDA support. In our experi- 2–10 to 210. We fused the outputs of the SVM classifiers
ments, we use the NVIDIA GTX960. relating to each path. For this aim, we applied a decision
fusion rule by using a weighted majority voting rule.
All experiments were repeated 10 times with different
3.1. Experimental results on object recognition randomly selected training and testing images, and the
We used the Caltech-101 database for object recognition. averages of the per-class recognition rates were recorded
The database consists of 101 object categories, except for for each run. In the figures and tables, CNN-SVM-1 and
a background category of 9144 images with large inter- CNN-SVM-2 show the obtained system using AlexNet and
class variability.46 Each class has a different number of the proposed CNN architecture, respectively. Figure 4
images between 31 and 800. The database includes both shows the classification accuracy of the proposed single
rigid objects, such as airplanes, cars, bikes, chairs, cars, CNN-SVM-1 and 2 systems and LM-CNN-SVM-1 and 2
and cameras, and non-rigid objects, such as sheep, lions, systems on the Caltech-101 database. As can be seen in
and cows. We constructed training and testing sets with Figure 4, for 15 images per class, the averages of the per-
respect to an experimental setup procedure used in previ- class recognition rates of the proposed single CNN-SVM-1
ous research45,46 for a fair comparison. We conducted the and 2 systems and LM-CNN-SVM-1 and 2 systems are
training set by randomly selecting 15 or 30 images per 84.80, 84.93, 87.43, and 89.80, respectively, while they are
category, and the testing was set to no more than 50 86.80, 88.80, 91.13, and 92.80, respectively, for 30 images
images per category, since the dimension of some cate- per class. The comparisons between the single CNN-SVM
gories is very small. We normalized each image by sub- systems and the LM-CNN-SVM systems show that the
tracting from the per-pixel mean across all images. We LM-CNN-SVM system is always better than the single
xar et al.
Uc 765
Figure 4. Comparison of the proposed Local Multiple Convolutional Neural Network-Support Vector Machine (LM-CNN-SVM)
and single CNN-SVM for 15/class (a) and 30/class (b) from the Caltech-101 benchmark dataset.
Table 3. Comparison between the proposed Local Multiple Convolutional Neural Network-Support Vector Machine (LM-CNN-
SVM) model and the other state-of-the-art models on the Caltech-101 benchmark dataset.
M-HMP: Multipath Hierarchical Matching Pursuit model; ScSPM: linear Spatial Pyramid Matching model; LCNM: Large Convolutional Network Model;
SPP-Net: CNN model applying a Spatial Pyramid Pooling process layer; CNN S TUNE-CLS: slow CNN model.
CNN-SVM system. The features extracted from CNNs methods are given in averages of the per-class accuracies
make the SVM classifier more accurate, since each local are shown in Table 3. For 15 images per class, the ScSPM
region exhibits properties with a similar texture structure. and LCNM performances are 73.20% and 83.80 6 0.50%,
In addition, we evaluated current state-of-the-art methods, respectively. For 30 images per class, the performances of
including deep learning and Spatial Pyramid Matching M-HMP, ScSPM, LCNM, SPP-Net, and CNN S TUNE-
(SPM), known as an extended version of the BoW model. CLS are 81.40 6 0.33%, 84.30%, 86.50 6 0.5%, 91.44 6
The methods are briefly defined as follows: 0.70%, and 88.35 6 0.56%, respectively. Our best results
M-HMP50: Multipath Hierarchical Matching Pursuit were 89.80 6 0.50 % for 15 images per class and 92.80 6
model that extracts expressive features from images 0.43% for 30 images per class. It can be seen that our local
through multi-pathways, similar to deep learning. multiple hybrid system is superior to the competitors at
ScSPM51: the linear SPM model that uses a linear ker- object recognition. Moreover, an increasing number of
nel on Spatial Pyramid Pooling of SIFT sparse codes. training images also improves the performance. This
LCNM52: Large Convolutional Network Model using a shows that affluent features are important for recognition
multi-layered deconvolutional network approach to visua- with large classes and large images.
lize the feature activation.
SPP-Net53: CNN model applying a Spatial Pyramid
Pooling process layer to remove the fixed-size constraint 3.2. Experimental results on pedestrian detection
of the CNN. In the object detection application, we used the Caltech
CNN S TUNE-CLS54: slow CNN model including the pedestrian benchmark dataset.47 The Caltech dataset was
filters with small strides using a combination of deep captured over 11 sessions of 640 3 480 30 Hz video taken
representations and a linear SVM. from a vehicle driving through regular traffic in an urban
Detailed comparison results between the proposed LM- environment. We used the first five subsets for training
CNN-SVM systems with a number of state-of-the-art and the others for testing. We used pedestrian images that
766 Simulation: Transactions of the Society for Modeling and Simulation International 93(9)
ConvNet56 77.20
HIKSVM57 73.39
HOG-SVM6 66
JointDeep58 39.3
SDN59 37.8
MT-DPM + Context60 37.64
DNNSliding61 32.4
DeepCascade62 31.11
Katamari63 22.0
Proposed LM-CNN-SVM-2 30.0
Figure 5. Comparison of single CNN-SVM and the proposed HIKSVM: Histogram Intersection Kernel Support Vector Machine;
Local Multiple Convolutional Neural Network-Support Vector HOG-SVM: Histogram of Gradient Support Vector Machine; SDN:
Machine (LM-CNN-SVM) on the Caltech benchmark pedestrian CNN model including multiple switchable layers; DNNSliding: CNN
dataset. model applying a sliding window to the image.
20. Deng L and Yu D. Deep learning: methods and applications. 34. Hubel DH and Wiesel TN. Receptive fields and functional
Foundat Trend Signal Proc 2014; 7: 3–4. architecture of monkey striate cortex. J Physiol 1968; 195:
21. Christian S, Toshev A and Erhan D. Deep neural networks 215–243.
for object detection. In: Proceedings of advances in neural 35. Fukushima K. Neocognitron: a self-organizing neural net-
information processing systems (NIPS), Lake Tahoe, work model for a mechanism of pattern recognition unaf-
Nevada, 5–10 December 2013, pp.2553–2561. Red Hook, fected by shift in position. Biol Cybern 1980; 36: 193–202.
NY: Curran Associates Inc. 36. LeCun Y, Boser B, Denker JS, et al. Backpropagation
22. Krizhevsky A, Sutskever I and Hinton GE. Imagenet classi- applied to handwritten zip code recognition. Neural Comput
fication with deep convolutional neural networks. In: 1989; 1: 541–551.
Proceedings of advances in neural information processing 37. LeCun BB, Denker JS, Henderson D, et al. Handwritten
systems, (NIPS), Lake Tahoe, Nevada, 2012, pp.1097–1105. digit recognition with a back-propagation network. In:
Red Hook, NY: Curran Associates Inc. Proceedings of advances in neural information processing
23. Russakovsky O, Deng J, Su H, et al. Imagenet large scale systems (NIPS), Denver, Colorado, USA, 26–29 November
visual recognition challenge. Int J Comput Vision 2015; 115: 1990, pp.396–404. San Francisco, CA: Morgan Kaufmann
211–252. Berlin: Springer. Publishers Inc.
24. Deng J, Berg A, Satheesh S, et al. ILSVRC-2012, http:// 38. Kim H, Hong S, Son S, et al. High speed road boundary
www.image-net.org/challenges/LSVRC/2012/ (accessed 1 detection on the images for autonomous vehicle with multi-
January 2016). layer CNN. In: Proceedings of IEEE conference on circuits
25. Yan K, Wang , Liang D, et al. CNN vs. SIFT for image and systems (ISCAS), Bangkok, Thailand, 25–28 May 2003,
retrieval: Alternative or complementary? In: Proceedings of vol. 5, pp.V-769–V-772. Piscataway, NJ: IEEE.
the ACM on multimedia conference (MM), Amsterdam, The 39. Shen W, Zhou M, Yang F, et al. Multi-scale convolutional
Netherlands, 15–19 October 2016, pp.407–411. New York: neural networks for lung nodule classification. In:
ACM. International conference on information processing in medi-
26. Tian Y, Luo P, Wang X, et al. Deep learning strong parts for cal imaging (IPMI) Isle of Skye (ed S Ourselin, DC
pedestrian detection. In: Proceedings of IEEE International Alexander, C-F Westin, et al.), Scotland, 28 June–3 July
Conference on Computer Vision (ICCV), Washington, DC, 2015, LNCS 9123, pp.588–599. Berlin: Springer.
USA, 07–13 December 2015, pp.1904–1912. Piscataway, 40. Chen D and Mak BKW. Multitask learning of deep neural
NJ: IEEE. networks for low resource speech recognition. IEEE/ACM
27. Ouyang W, Wang X, Zeng X, et al. Deepid-net: Deformable Trans Audio Speech Lang Process 2015; 23: 1172–1183.
deep convolutional neural networks for object detection. In: 41. Narhe MC and Nagmode MS. Vehicle classification using
Proceedings of IEEE conference on computer vision and SIFT. Int J Eng Res Technol 2014; 3: 1735–1738.
pattern recognition (CVPR), Boston, MA, USA, 7–12 June 42. Cimpoi M, Maji S, Kokkinos I, et al. Deep filter banks for
2015, pp.2403–2412. Piscataway, NJ: IEEE. texture recognition, description, and segmentation. Int J
28. Zhang S, Benenson R and Schiele B. Filtered channel fea- Comput Vis 2016; 118: 65–94.
tures for pedestrian detection. In: Proceedings of the IEEE 43. Kataoka H, Iwata K and Satoh Y. Feature evaluation of deep
conference on computer vision and pattern recognition convolutional neural networks for object recognition and
(CVPR), Boston, MA, USA, 7–12 June 2017, pp.1751–1760. detection. arXiv preprint arXiv:1509.07627, 2015.
Piscataway, NJ: IEEE. 44. Harsha SS and Anne KR. Gaussian mixture model and deep
29. Ren S, He K, Girshick R, et al. Faster R-CNN: towards real- neural network based vehicle detection and classification. Int
time object detection with region proposal networks. In: J Adv Comput Sci Appl 2016; 7: 17–25.
Proceedings of advances in neural information processing 45. Fei-Fei L, Fergus R and Perona P. Learning generative
systems (NIPS), Montréal, Canada, 2015, pp.91–99. visual models from few training examples: An incremental
Cambridge, MA: MIT Press. Bayesian approach tested on 101 object categories. Comput
30. Tao Q-Q, Zhan S, Li X-H, et al. Robust face detection using Vis Image Underst 2007; 106: 59–70.
local CNN and SVM based on kernel combination. 46. Fei-Fei L, Fergus R and Perona P. One-Shot learning of
Neurocomputing 2016; 211: 98–105. object categories. IEEE Trans Pattern Anal Mach Intell
31. Tanga P, Wanga H and Kwong S. G-MS2F: GoogLeNet 2006; 28: 594–611.
based multi-stage feature fusion of deep CNN for scene rec- 47. Dollar P, Wojek C, Schiele B, et al. Pedestrian detection: an
ognition. Neurocomputing 2017; 225: 188–197. evaluation of the state of the art. IEEE Trans Pattern Anal
32. Niu X-X and Suen CY. A novel hybrid CNN–SVM classifier Mach Intell 2012; 34: 743–761.
for recognizing handwritten digits. Pattern Recogn 2012; 48. Vedaldi A and Lenc K. MatConvNet – convolutional neural
45:1318–1325. networks for MATLAB. In: Proceedings of ACM interna-
33. Ucxar A, Demir Y and Güzelisx C. (2016). Moving towards in tional conference on multimedia (ACMMM), Brisbane,
object recognition with deep learning for autonomous driving Queensland, Australia, 26–30 October 2015, pp.1–4. New
applications. In: Proceedings of IEEE international confer- York: ACM.
ence on innovations in intelligent systems and applications 49. Vedaldi A and Fulkerson B. VLFeat: an open and portable
(INISTA), Sinaia, Romania, 2–5 August 2016, pp.1–5. library of computer vision algorithms, http://www.vlfeat.org/
Piscataway, NJ: IEEE. (accessed 1 January 2016).
xar et al.
Uc 769
50. Bo L, Re X and Fox D. Multipath sparse coding using hier- 62. Angelova A, Krizhevsky A, Vanhoucke V, et al. Real-time
archical matching pursuit. In: Proceedings of IEEE confer- pedestrian detection with deep network cascades. In:
ence on computer vision and pattern recognition (CVPR), Proceedings of the British machine vision conference
Portland, Oregon, USA, 23–28 June 2013, pp.660–667. (BMVC), Swansea, UK, 7–10 September 2015, pp.1–12.
Piscataway, NJ: IEEE. Swansea: BMVA Press.
51. Jianchao Y, Kai Y, Yihong G, et al. Linear spatial pyramid 63. Benenson R, Omran M, Hosang J, et al. Ten years of pedes-
matching using sparse coding for image classification. In: trian detection, what have we learned? In: Agapito L,
Proceedings of IEEE conference on computer vision and Bronstein M and Rother C. (eds) Computer Vision - ECCV
pattern recognition (CVPR), Miami, FL, USA, 20–25 June 2014 Workshops. ECCV 2014. Lecture Notes in Computer
2009, pp.1794–1801. Piscataway, NJ: IEEE. Science, Berlin: Springer, 2014, vol. 8926, pp.613–627.
52. Zeiler MD and Fergus R. Visualizing and understanding con-
volutional networks. In: European conference on computer
Author biographies
vision (ECCV) (ed D Fleet, et al.), Zurich, Switzerland, 1–5
September 2014, Part I, LNCS 8689, pp.818–833. Berlin: Aysxegül Uc xar received a BS degree, MS degree, and
Springer. PhD degree from the Electrical and Electronics
53. He KK, Zhang X, Ren S, et al. Spatial pyramid pooling in Engineering Department at the University of Firat of
deep convolutional networks for visual recognition. In: Turkey in 1998, 2000, and 2006, respectively. In 2013,
European conference on computer vision (ECCV) (ed D she was a visiting professor at Louisiana State University
Fleet, T Pajdla, B Schiele, et al.), Zürich, Switzerland, 1–5 in the USA. She has been an associate professor in the
September 2014, LNCS 8691, pp.346–361. Berlin: Department of Mechatronics Engineering since 2009. Her
Springer. research interests include modeling and control, artificial
54. Chatfield K, Simonyan K, Vedaldi A, et al. Return of the
intelligence systems, robotics vision, pattern recognition,
devil in details: delving deep into convolutional nets. In:
Proceedings of the British machine vision conference
and signal processing.
(BMVC), Nottingham, UK, 1–5 September 2014, pp.3626–
3633. BMVA Press. Yakup Demir received a BS degree, MS degree, and PhD
55. Sun S. Local within-class accuracies for weighting individual degree from the Electrical and Electronics Engineering
outputs in multiple classifier systems, Pattern Recogn Lett Department at the University of Firat of Turkey in 1986,
2010; 31: 119–124. 1990, and 1992, respectively. He is currently a professor in
56. Sermanet P, Kavukcuoglu K, Chintala S, et al. Pedestrian the Electrical and Electronics Engineering Department at
detection with unsupervised multi-stage feature learning. In: the University of Firat of Turkey. His research interests
Proceedings of IEEE conference on computer vision and pat- include circuits and systems.
tern recognition (CVPR), Portland, OR, USA, 23–28 June
2013, pp.3626–3633. Piscataway, NJ: IEEE. Cüneyt Güzelisx received the B.Sc., M.Sc., and Ph.D.
57. Maji S, Berg A and Malik J. Classification using intersection _
degrees in electrical engineering from Istanbul Technical
kernel support vector machines is efficient. In: Proceedings _
University, Istanbul, Turkey, in 1981, 1984, and 1988,
of IEEE conference on computer vision and pattern recogni- _
tion (CVPR), Anchorage, Alaska, USA, 23–28 June 2008,
respectively. He was with Istanbul Technical University
pp.1–8. Piscataway, NJ: IEEE. from 1982 to 2000 where he became a full professor. He
58. Ouyang W and Wang X. Joint deep learning for pedestrian worked between 1989 and 1991 in the Department of
detection. In: Proceedings of IEEE international conference Electrical Engineering and Computer Sciences, University
on computer vision (ICCV), Sydney, NSW, Australia, 1–8 of California, Berkeley, California, as a visiting researcher
December 2013. Piscataway, NJ: IEEE. and lecturer. He was with the Department of Electrical
59. Luo P, Zeng X, Wang X, et al. Switchable deep network for and Electronics Engineering from 2000 to 2011 at Dokuz
pedestrian detection. In: Proceedings of the British machine _
Eylül University, Izmir, Turkey. He was with Izmir _
vision conference (CVPR), Columbus, Ohio, 23–28 June University of Economics, Faculty of Engineering and
2014 pp.899–906. Piscataway, NJ: IEEE. Computer Sciences, Department of Electrical and
60. Yan J, Zhang X, Lei Z, et al. Robust multi-resolution pedes- Electronics Engineering from 2011 to 2015. He is cur-
trian detection in traffic scenes. In: Proceedings of the IEEE
rently working in Yasxar University, Faculty of
conference on computer vision and pattern recognition
Engineering, Department of Electrical and Electronics
(CVPR), Portland, Oregon, USA, 23–28 June 2013,
pp.3033–3040. Piscataway, NJ: IEEE.
Engineering. His research interests include artificial neural
61. Hosang M, Omran R, Benenson, et al. Taking a deeper look networks, biomedical signal and image processing, non-
at pedestrians. In: Proceedings of IEEE conference on com- linear circuits-systems, and control, and educational
puter vision and pattern recognition (CVPR), Boston, MA, systems.
USA, 7–12 June 2015 pp.4073–4082. Piscataway, NJ: IEEE.