Sunteți pe pagina 1din 8

Histogram of Gradients descriptor applied to vehicles detection

Daniel Tertei (T ortei)


Robotics-Action-Perception research team, Laboratory for Analysis and Architecture of Systems
August, 2013
1 Introduction
The ability to detect and track other vehicles automatically is a core requirement for any autonomous
vehicle designed to operate in trac. Patch descriptors model an object in a local subregion of
interest of an image into a set of features. In our case the object is a 2-D vehicle. HoG (Histogram
of Gradients) descriptor, introduced by Dalal and Triggs in 2005 [1] is in its original form used for
pedestrian detection. It is largely inspired by previous works on Lowes SIFT (Scale Invariant Feature
Transform) descriptor [2]. It is today one of the best performing object descriptors and the best
pedestrian descriptor in use. Many works have used HoG as a basis for features extraction as for
detecting vehicle orientations i.e. in [3], [4], [5], [6]. Mr. Hedi Harzallah wrote a thesis on combining
HoG features with BoW (Bag of Words, an alphabet-like algorithm) and won the Pascal VOC (visual
objects classes) challenge 2008 where he had the best score on detecting 11 classes out of 20 (including
a car object class) [7]. I will present a study on the HoG, followed by a practical example of tuning
its parameters and a discussion on the obtained results.
2 HoG by Dalal and Triggs
The creators of this approach trained a Support Vector Machine, or SVM, to recognize HoG descrip-
tors of people. This descriptor is simpler to understand (compared to SIFT object recognition, for
example). One of the main reasons for this is that it uses a global feature to describe a person rather
than a collection of local features. This means that the entire person is represented by a single feature
vector, as opposed to many feature vectors representing smaller parts of the person. The HOG person
detector uses a sliding detection window which is moved around the image. At each position of the
detector window, a HOG descriptor is computed for the detection window. This descriptor is then
shown to the trained SVM, which performs a binary classication task - either person or not a person
- Figure 1.
Afterwards a non-maxima supression method is applied to nd the best detections and eliminate the
ones that are under a predetermined threshold.
Figure 1: Computing HoG from an input image patch
1
2 PDF L
A
T
E
X coloured text and graphics
Figure 2: On the left image in red: 8x8 cell. On the middle image in blue: rst overlapping 2x2 block
containing 4 cells. On the right image in green: second 2x2 overlapping block. Block overlap is at
50%.
2.1 Gamma/Color Normalization
In their paper authors evaluated grayscale, RGB and LAB color spaces with gamma equalization on
input images pixels. The conclusion is that only grayscale representation reduces the performance
while LAB and RGB give similar results. Square root gamma compression of each color channel
improves performance a bit.
2.2 Gradients histograms
The HoG person detector uses a 64x128 detection window. The size of the window is an empirical
choice. It is divided into overlapping blocks which contain non overlapping cells. Each cell contains 64
pixels ordered as a square of 8x8 pixel intensities - Figure 2. The authors make a relevant study why
are these parameter combinations chosen in [1] after a study based on DET (detection error tradeo)
curves.
Gradient vector is computed in each cell. Then they take the 64 gradient vectors (in a 8x8 pixel cell,
a x- and y-direction gradients are computed and their magnitude calculated) and put them into a
9-bin histogram. In an RGB color space the strongest gradient is chosen from all three dimensions.
The histogram ranges from 0 to 180 degrees, so there are 20 degrees per bin. It is possible to choose
multiple binnings but a 9-bin histogram worked best for pedestrian detection. Bilinear function is
used to dene how much each gradient magnitude contributes to histogram, i.e. a 85 degree gradient
is between a 70- and 90- degree bin centers so then we add 1/4th of its magnitude to the bin centered
at 70 degrees, and 3/4ths of its magnitude to the bin centered at 90 - Figure 3.
Dalal and Triggs used unsigned gradients such that the orientations only ranged from 0 to 180
degrees instead of 0 to 360. The number of bins corresponds to angular resolution of the descriptor,
so if we choose a low number we would have a poor representation of an image (and consequently a
biased classier) whereas if we were to chose a high number we would have overshooting problems
with the classier. Ignoring the sign of the gradients gives us a certain robustness with the descriptor
but at a price to lose some possibly valuable information in image (i.e. the same person could wear a
white shirt and black pants in one and black shirt and white pants in the second photo) [7] - Figure
4.
The histogram doesnt encode where each gradient is within the cell, it only encodes the distribution
of gradients within the cell.
C.T.J. Dodson 3
Figure 3: A 180-degrees (9-bin) histogram with bin centers expressed in degrees.
Figure 4: Three dierent image patches that would obtain the same HoG signature if using only signed
gradients.
2.3 Normalization Scheme
Gradient strenghts (magnitudes) vary over a wide range owing to local variations in illumination and
foreground-background contrast, so eective local contrast normalization turns out to be essential
for good performance. By normalizing the strenghts we can make them invariant to multiplications
of the pixel values - Figure 5. The accent is on third grayscale image as it displays an increase in
contrast. The eect of the multiplication is that bright pixels became much brighter while dark pixels
only became a little brighter, thereby increasing the contrast between the light and dark parts of the
image. If you divide all three vectors by their respective magnitudes, you get the same result for all
three vectors: [0.71 0.71]. So by gradients magnitude normalization we make our descriptor invariant
to contrast and brightness changes. Dividing a vector by its magnitude is referred to as normalizing
the vector to unit length, because the resulting vector has a magnitude of 1. Normalizing a vector
does not aect its orientation, only the magnitude.
Here is where Dalal and Triggs make use of grouping the cells into blocks. Choices of cell conguration
in a block and block-on-block overlap percentage are arbitrarily made to be 2x2 and 50% as on 2
for person detection. Consequently, they group 4 (2x2) cell histrograms (4x9bin) of a block into a
36-element vector and perform a L2-normalization scheme:
v =
v

|v|
2
2
+
2
(1)
where v is the unnormalized descriptor vector, |v|
k
is its k-norm and is a small constant. The
overlapping of the blocks permits each cell response to contribute (with respect to block in which it
is normalized by neighboring) to the nal descriptor vector more than once. Specically, the corner
cells appear once, the other edge cells appear twice each, and the interior cells appear four times each.
The rationale in the block normalization approach is that changes in contrast are more likely to occur
over smaller regions within the image. So rather than normalizing over the entire image, we normalize
within a small region around the cell.
The nal descriptor is in fact a normalized vector on a 64x128 pixel detection window which is divided
into 7 blocks across and 15 blocks vertically, for a total of 105 blocks. Each block contains 4 cells with
a 9-bin histogram for each cell, for a total of 36 values per block. This brings the nal vector size to
3,780 values.
4 PDF L
A
T
E
X coloured text and graphics
Figure 5: Gradient magnitudes in a brighter and contrast-enhanced images. First row represents an
image with its surrounding pixel values and the value of its gradients magnitude. In second row we
added +50 to pixel values (increased brightness) from the rst image while in the third image we
multiplied them by 1.5 (increased contrast).
2.4 Methodology
Dalal and Triggs used an INRIA (French Institute of Research in Informatics and Automation) data
set of images which can be found online at http://lear.inrialpes.fr/data. They selected 1239 of them
as positive examples and made their left-right reections (2478 positive examples in total). A xed
set of 12180 image patches sampled randomly from person-free images served as negative examples.
For each detector and parameter combination they made a two-level trainining. First, a preliminary
detector is trained on whole set and used to identify the false positives which are denoted as hard
examples. Second, they combine only the hard examples with negative examples and then repeat
the method in order to obtain the nal detector. This retraining process improves signicantly the
performance on their data set.
3 HoG by other researchers
In [4] authors compare Haar (all-time famous face detector) to HoG features in vehicle detection.
They conclude that HoG outperforms the Haar cascade and they make an interesting observation
that when increasing the HoG feature space the detection rate increases while the number of false
positives uctuates at a stable level contrary to Haar where (by increase in Haar feature space) the
number of false positives decreases but at the lower corresponding detection rate. The training is at
multi-stage levels (11 and 12) as in a Viola Jones boosting algorithm [8].
A fused HoG-HCT (Histograms of Census Transform) feature set is created in [3]. First, HoG and
HCT features are extracted from dierent vehicle viewpoints and then fused togeteher in a single
block of matrix elements by PCA (Principal Component Analysis) algorithm. Afterwards they are
introduced and built into a deformable parts model (front view, rear view, side view). Astonishing
results based on car object class images from VOC 2007 show that this aproach outperforms HoG.
However, they only present results on images of cars on roadways (no urban trac). The quality of
detector is always tied to specicity of the training set (data set).
One interesting study was aimed to train HoG-based classiers specic to vehicle orientation versus
C.T.J. Dodson 5
a coarse HoG-based classier [5]. The goal was to detect vehicle orientation and with it to further
infer some relevant information. The conclusion was that, counter-intuitivelly, the coarse classier
outperformed the specic ones.
There is also a study that unsuccessfully tried to modify the HoG descriptor to separate between edges
(edge gradients) and shades (diuse gradients) because this descriptor at its current implementa-
tion cannot make a classier (SVM) sensitive to a i.e. black object on white background without
becoming sensitive to texture [6]. Also, HoG exhibits aliasing problems where two image patches that
are perceptually very dierent may end up with very similar HoG descriptors.
Harzallah [7] proposed a two-stage cascade in object detection. First, he uses HoG with linear SVM
to fast classify between positive and negative detections on a sliding detection window. Afterwards
he uses BoW with a non-linear classier to attribute condence scores to each remaining example and
nally suppression of non maxima. This rst stage speed / second stage performance tradeo proved
to outperform each other contestant on Pascal VOC 2008. The class car was well in the scope of
high scoring detections of his algorithm (data set was Pascal VOC 2007).
Purpose of this section was to give an insight how to use HoG, what are his weaknesses and how
to exploit its strenght. Descriptor itself is a proven tool but it should be used in conjuction with
others algorithms whether in cascades or in boosting.
4 My own HoG
In the scope of my thesis I make use of KITTI (Karlsruhe Institute of Technology and Toyota Tech-
nological Institute at Chicago) image data set [9]. This dataset comprises of 7480 labeled training
images and 7517 testing images with an average 1224x370 p resolution. However, because of speed,
I used 362 positive image patches with their left-right reections and 751 negative image patches as
training set and 631 positive with 587 negative image patches as test set. All patches are in RGB
color space. The balanced number of positive/negative examples is due to evaluation after the testing
phase which is done by calculating the EER (equal error rate) metric on the ROC (Receiver Operat-
ing Charateristic) curves of dierent HoG-based classiers. I used EER metric insted of AUC (Area
Under Curve) because of high accuracy of classiers and therefore - more linear EER metric which
takes into account false negatives (high accuracy is due to a small number of false positives). The
smaller the EER is the better the classier is.
Goal of the simulation was to determine which HoG descriptor (in a combination with a linear SVM)
with which parameters regarding cell size, block size, number of orientation bins and HoG detecton
window size would t best on KITTI data set. Thus I evaluated (in nested loops) the following
parameters:
1. HoG window size: [120x60], [128x64], [136x68], [144x72], [152x76];
2. block size: [1x1], [2x2], [3x3], [4x4];
3. cell size: [4x4], [6x6], [8x8], [10x10], [12x12];
4. bin size: 9, 10, 11, 12;
5. orientation: 1, 0;
resulting in a total of 800 dierent HoG-based classiers. It took approximatelly 26 hours to run a
single simulation under Matlab on an Intel Core i3 CPU platform clocked at max 1.8 MHz and having
6 GB of RAM.
4.1 Results
Figure 6 shows comparable results.
6 PDF L
A
T
E
X coloured text and graphics
Figure 6: The whole graphic may be divided into 5 columns; each column represents a n-by-n block
size group of units. In each block size there are m-by-m cell size group of units. In each cell there are
8 (9,10,11,12 - non-oriented and the same number-based oriented) units. Each unit corresponds to a
n-by-n m-by-m K-oriented/non-oriented classiers equal error rate.
C.T.J. Dodson 7
4.2 Discussion
From Figure 6 we may conclude:
1. Increasing HoG window size by 4 in height and by 8 pixels in width we gain in performance -
at least 10% smaller EER;
2. For block sizes, decreasing them from 4x4 = 16 cells to 3x3 = 9 cells and so consequently to
1x1 = 1 cell we gain in performance on the classiers that have smaller number than at least 64
(8x8) pixels in their cells;
3. Smaller cell sizes - more local normalization - tend to increase performance signicantly by rates
of over 20% in EER;
4. bin size: the best bin size is between 22 and 24 (both oriented) and all the non-oriented
histogram-based classiers perform poorer than those with oriented histograms.
The best HoG based classier on KITTI data set is the 152x76 1x1 4x4 22-bin oriented classier. Its
EER is 0.0376 meaning automatically an AUC over 97%. It is the equivalent if we divided the 152x76
imagette into 4x4 pixel squares without overlapping and used a 22-bin oriented histogram. In overall,
for vehicle detection, we may say that HoG works best on larger image patches (than those used for
pedestrian detection) that are divided into a grid of preferrably not overlapping small blocks which
contain small number of cells. These cells are also mapping a small number of neighbouring pixels
and the invariance to illumination and contrast are very important as HoG works best with larger
histograms.
5 Future Work
The next step is to wire this descriptor with a detection window, nd condence scores and perform
suppression of non maximas. Histogram of Oriented gradients descriptor with linear SVM should be
used in conjuction with others algorithms. In the scope of my thesis I plan to try using it with Adaboost
and, if the time permits, neural networks. One possibility of cooperation would be incorporating fuzzy-
based detection algorithms develloped by Chair of Informatics at FTN, Novi Sad.
8 PDF L
A
T
E
X coloured text and graphics
References
[1] Navneet Dalal and Bill Triggs. Histogram of Oriented Gradients for Human Detection,
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego,
CA, USA, Volume 1, pages 886 - 893, 2005.
[2] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91 -
110, 2004.
[3] Sun Li, Do Wang, ZHniui Zheng, Hailuo Wang. Multi-view vehicle detection in trac
surveillance combining HOG-HCT and deformable part models, Proceedings of the
2012 International Conference on Wavelet Analysis and Pattern Recognition, Xian, 15-17 July,
2012.
[4] Pablo Negri, Xavier Clady, Lionel Prevost. Benchmarking HAAR and Histograms of Ori-
ented Gradients features applied to vehicle detection, Universite Pierre et Marie Curie-
Paris 6, ISIR, CNRS FRE 2507.
[5] Paul E. Rybski and Daniel Huber and Daniel D. Morris and Regis Homan. Visual Classica-
tion of Coarse Vehicle Orientation using Histogram of Oriented Gradients Features.
[6] Carl Doersch and Alexei Efros. Improving the HoG descriptor
[7] H. Harzallah and C. Schmid and F. Jurie and A. Gaidon. Classication aided two stage
localization, PASCAL Visual Object Classes Challenge Workshop, in conjunction with ECCV,
October 2008.
[8] Viola P. and Jones M. Rapid object detection using a boosted cascade of simple features,
In CCVPR pages 511 - 518, 2001.
[9] Andreas Geiger, Philip Lenz, Raquel Urtasun. Are we ready for Autonomous Driving? The
KITTI Vision Benchmark Suite.,Conference on Computer Vision and Pattern Recognition
(CVPR), 2012.

S-ar putea să vă placă și