Documente Academic
Documente Profesional
Documente Cultură
1
are trainable. Our main technical contribution is a training To represent an image, X = {x1 , . . . , xM }, one aver-
algorithm for support vector machines with Fisher kernels ages the vector representations of all descriptors, ψ(X) =
1
PM
that jointly learns the classifier weight vector and a suitable M i=1 ψ(xi ). The result is called the Fisher vector of
image representation. The procedure is not only intuitively the image, since for any two images X and X 0 , the inner
appealing, but also has a theoretical justification in statisti- product ψ(X)> ψ(X 0 ) is approximately equal to the Fisher
cal learning theory. kernel, k(X, X 0 ), that is induced by the GMM.
After training, the architecture is still an instance of a In practice, it has been proven useful to postprocess the
support vector machine with Fisher kernel. This allows us Fisher vector further: we compute the signed squared of
to transfer the learned representation effortlessly to other each vector dimension, d = 1, . . . , 2KD, and normalize
categorization tasks. In a quantitative evaluation we show such that the resulting vector has unit L2 -norm [31]
that learning the representation indeed leads to improved
p p
classification accuracy compared to Fisher kernels with φd (X) = (sign ψd (X)) |ψd (X)| / kψ(X)kL1 (3)
fixed parameters.
For simplicity we refer to the resulting vectors again as
2. Deep Fisher Kernel Learning Fisher vectors and to their inner product as Fisher kernel.
In this section we first give short summaries of image The main advantage of having such an explicitly com-
categorization with Fisher kernel SVMs and with deep ar- putable vector representation is that one can use an SVM in
chitectures, concentrating on convolutional neural networks its primal (or linear) form. This allows training a state-of-
(CNNs). We then highlight the conceptual similarities be- the-art object categorization system for thousands of classes
tween both architectures, showing that the idea of end-to- from millions of training images within just a few hours [1].
end training and discriminatively learned feature represen-
2.2. Image categorization with CNNs
tations is not limited to neural network architectures, but
can also be applied for SVMs with Fisher kernels. Convolutional neural networks (CNNs) [22] are feed-
forward architecture that consist of multiple interconnected
2.1. Image categorization with Fisher kernel SVMs
layers. Each layer computes a non-linear function using the
Fisher kernels were first introduced as a mathematically outputs of the previous layer as its inputs. For the first layer
sound tool for combining generative probabilistic models the input is the image itself. In the last layer each output
with discriminative kernel methods [17]. In this work, we is associated with one of the target classes, and its value is
concentrate on their practical use for image categorization, used as a measure of confidence whether the class is present
following the established setup introduced in [30, 31]. in the image or not.
Our base representation of an image is as a set of local Within each layer, multiple computing elements, the
descriptors, for example SIFT or one of its variants. The neurons, operate in parallel. Each neuron computes one
descriptors are PCA-projected to reduce their dimensional- output value using the same computational rule as the other
ity and decorrelate their coefficients. We assume a given neurons in the layer, but based on a different receptive field,
K-component Gaussian mixture PK model (GMM) in descrip- i.e. subset of the available inputs. The actual computation
k
tor space, p(x|π, µ, Σ) = k=1 π gk (x; µk ; Σk ), where within each neuron has two phases: first, a linear map is
π ∈ [0, 1] are mixture weights, and each gk (x; µk ; Σk )
K
applied to the inputs. For CNNs, these maps are convolu-
is a Gaussian with mean µk ∈ RD and each diagonal co- tions of the inputs with (learned) filter masks. Afterwards, a
variance matrix, Σk = diag(σ k ) for σ k ∈ RD . For any non-linear transformation is applied to the result, for exam-
descriptor, x, we define a vector, ψ(x) = F 1 (x), . . . , ple a sigmoid or a thresholding [29], and often also a spa-
F (x), G (x), . . . , G (x) ∈ R
K 1 K
2KD
. The subvectors tial pooling or subsampling operation is applied. Improved
classification results have been reported when additionally a
x − µk
1
F k (x) = √ γk (x) ∈ RD (1) groupwise normalization of several neuron’s outputs is per-
πk σk formed, e.g. by dividing each neuron’s output by the Lp -
1 (x − µk )2 norm of the outputs of a neighborhood [18].
G k (x) = √ γk (x) − 1 ∈ RD (2)
2π k (σ k )2 Training CNNs is usually done by classical backpropa-
gation [33], which essentially is a joint stochastic gradient
are the gradients of p(x|π, µ, Σ) with respect to the param-
descent optimization of all network parameters. The main
eter vectors µk and σ k , respectively, scaled by an empirical
challenge lies in computing the gradient in an efficient way.
estimate of the inverse Fisher information matrix (see [30]).
For the last layer this is straight-forward, since these param-
Division and multiplication of vectors should be understood
eters have an immediate effect on the network’s output. For
componentwise, and γk (x) denotes the posterior of the k-th
k k
;Σk )
parameters in deeper layers (further away from the output),
GMM component, γk (x) = PKπ gπkj(x;µ g (x;µj ;Σj )
. analytic expression can be obtained by repeated invocations
j=1 j
of the chain rule. CNNs also allow for regularization, ei- SVM objective function with squared Hinge loss, and treat
ther explicitly by weight decay [27] or implicitly by early it as a function not just of the weight vector, w, but also of
stopping [8]. the GMM, G = (π, µ, σ),