Documente Academic
Documente Profesional
Documente Cultură
Abstract
Automatic image annotation (AIA) is an effective technology to improve the per-
formance of images retrieval from large image collections by text queries. In this
paper, we propose a novel AIA scheme based on finding the most similar images
using distance metric and then transferring the annotation to the query image.
The proposed method can effectively deal with the problem of finding the result-
ing annotation length. We propose an iterative refinement optimization algorithm
to find the best parameters of annotator. We evaluate the proposed method on
two similarity criterion, Minkowski distance and Jensen-Shannon divergence. The
proposed solution outperforms the current state-of-the-art methods and could be
treated as a new baseline for other automatic image annotation methods.
1 Introduction
Traditional search engines were concern with only textual data for providing in-
formation that the user was looking for. Nowadays we can observe a trend, where
also other modalities are becoming important in retrieval tasks. There is an enor-
mous amount of visual data available on the Internet, as well as in off-line image
databases. The automatic analysis of visual data is not a trivial task.
Traditional, so called Text-Based Image Retrieval (TBIR), dealt with this prob-
lem by retrieving images using textual information that is available in the same
document as the target image. This approach suffers from small correlation be-
tween textual description and visual dataGoodrum (2000). On the other hand, not
every image database has sufficiently rich meta-data describing images to perform
efficient retrieval. In contrast to TBIR the Content-Based Image Retrieval (CBIR)
deals only with content-based visual cues. For this purpose manually created se-
mantic labels (tags, annotations) could be used for image retrieval. The process of
labelling is tedious, costly and error-prone. There is a clear need for an automatic
method for labelling images. Thus the goal of automatic image annotation is to
assign semantic labels for images. Assigned labels can then be used in several
ways, most notably in search engines.
∗ This work is partially financed from the Ministry of Science and Higher Education Repub-
lic of Poland resources in 2008–2010 years as a Poland–Singapore joint research project 65/N-
SINGAPORE/2007/0.
2 O. Maier, M. Stanek and H. Kwasnicka
There are many reasons why automatic image annotation is a difficult task.
We name just a few among them. The number of classes is usually very large (in
other words: the size of the label dictionary W is large). Available training data
is often weakly annotated, i.e., annotations are often incomplete and may contain
errors as shown in Carneiro et al. (2007). Last but not least, there is no direct
correspondence between visual features and semantic labels.
There has been a plethora of studies on automatic image annotation utilizing
machine learning techniques for learning statistical models from annotated images
and apply them to generate annotations for unseen images. Most of the state of
the art approaches can be classified into two categories: probabilistic modelling
methods and classification methods.
Among the first category we name just a few especially interesting methods:
Hierarchical Probabilistic Mixture Model (HPMM) by Hironobu et al. (1999),
Translation Model (TM) by Duygulu et al. (2002), Supervised Multi-class La-
belling (SML) by Carneiro et al. (2007), Continuous Relevance Model (CRM) by
Lavrenko et al., and Multiple Bernoulli Relevance Models(MBRM) by Feng et al.
(2004).
The CRM method is based on Bayes theorem and uses the non-parametric
approach. Parzen estimator is used combined with a one dimensional Gaussian
kernel for density estimation. MBRM is an extension for CRM based on the
Bernoulli Relevance Models, which outperforms the other methods as reported
by the authors in Feng et al. (2004). Following Carneiro et al. (2007); Makadia
et al. (2008) those methods were used as a reference baseline by many researchers
working on the problem of image annotation.
The methods of the second category try to find correlation between words and
visual features by training classifiers. Bayes Point Machine by Chang et al. (2003),
Support Vector Machine by Cusano et al. (2004) and Decision Trees by Kwasnicka
and Paradowski (2008) are estimating the visual feature distributions associated
with each word.
There are also methods that try to improve the output of other image anno-
tation methods. GRWCO proposed by Kwasnicka and Paradowski (2008), can
be used to improve the average recall and precision of automatic annotators, by
reducing the difference between the expected and resulting word count vectors as
shown by Kwasnicka and Paradowski (2006). Annotation refinement can also be
achieved by using Word-Net which contains semantic relations between words Jin
et al. (2005). The word co-occurrence models coupled with fast random walks are
used in IRWR by Llorente et al. (2009) for re-ranking the output annotations.
Recently, Makadia et al. (2008) proposed a family of baseline methods that
are build on the hypothesis that visually similar images are likely to share the
same annotations. They treat image annotation as a process of transferring labels
from the nearest neighbours. Makadia’s method does not solve the fundamental
problem of determining the number of annotations that should be assigned to the
target image. Thus they assume a constant number of annotations per image. The
transfer is performed in two steps: all annotations from the most similar image
are rewritten and the most frequent words are chosen from the whole neighbour-
hood until a given annotation length has been achieved. They also combine many
similarity measures to obtain the subset of the most similar images.
PATSI – Photo Annotation through Similar Images 3
In this paper we propose a simple method for Photo Annotation through Find-
ing Similar Images (PATSI) based on the hypothesis that similar images should
share a large part of the annotations. High accuracy obtained by the proposed
method on the standard benchmark image datasets in conjunction with the sim-
plicity of the method and its computational efficiency makes it a perfect candidate
for being a baseline in the field of automatic images annotation.
The proposed method also solves the difficult problem of choosing the appro-
priate number of annotations assigned for the target images. For this purposes
we propose transfer parameters optimization method which leads to tune resulted
words count associated with the image.
This article is organized as follows. In the next section we describe the proposed
method with particular emphasis on the transfer function, similarity criterion and
used feature sets. The following section describes the experiments and achieved
results. The paper finishes with the conclusions and remarks on possible further
improvements of the method.
where all visual features are a m-dimensional vector of low level attributes viI =
[xi,I i,I
1 , · · · , xm ]. The visual features vectors represent statistical information about
color and texture in selected area of the image I.
To obtain the similarity or rather dissimilarity between images, one can mea-
sure the distance between vectors or divergence between distributions build on
visual vectors.
n
!1/p
vi − viB p
X A
dMK (A, B) = (2)
i=1
where p is the Minkowski factor for the norm. Particularly, when p is equal one
and two, it is the well known L1 and Euclidean distance respectively.
4 O. Maier, M. Stanek and H. Kwasnicka
1 1
dJS (A, B) = DKL (M A kM B ) + DKL (M B kM A ), (4)
2 2
where M A , M B are models (PDF) for image A and B, DKL is Kullback-Leibler
distance which for multivariate-normal distribution takes the form:
A 1B det ΣB 1
+ tr Σ−1
DKL (M kM ) = loge B ΣA
2 det ΣA 2
1 ⊤ −1 N
+ (µB − µA ) ΣB (µB − µA ) − , (5)
2 2
where ΣA , ΣB and µA , µB are covariance matrices and mean vectors from respec-
tively image model A and B.
To assure that labels from more similar images have a larger impact on resulting
annotation we define ϕ as
1
ϕ(ri ) = , (6)
i
where ri is an image on position i in the ranking. All words associated with image
ri is then transferred to resulted annotation with the value 1/i. If the words has
been transferred before the transferred values are summed.
The resulting query image annotations consists of all the words whose transfer
value was greater than a specified threshold t. The threshold value t has an
impact on annotation length and its optimal value as well as optimal number
of neighbours k which should be taken into account during annotation process
must be find by using optimization process. The outline of the PATSI annotation
method is summarized in the Algorithm 1.
We can denote the PATSI annotator as At,k (I|d, ϕ, D), where annotation of image
I depends on the training dataset D, distance measure d, and transfer function
ϕ, annotation quality can be improved by adjusting threshold t and the number
of neighbours k.
6 O. Maier, M. Stanek and H. Kwasnicka
where k ∗ and t∗ are the optimal setting for t and k with respect to a quality
function. They values differ greatly not only for different databases, but also
between feature sets, methods of distance measure and transfer functions. There
exists no optimal choice of them that would be suitable in all cases and the need
for adjusting them in each explicit case arises. In the Figure 1 we showed the
dependency of the annotation quality (precision, recall, f-score) of the different
parameters t and k.
target target
0.4 0.6
0.5 0.3 0.7 0.5
0.45 0.2 0.6 0.4
0.4 0.5
0.35 0.1 0.3
0.3 0.4
0.25 20 maxima 0.3 0.2
0.2
rating 0.15 rating 0.2 0.1
0.1 0.1 20 maxima
0.05
0 0
0 0
5 5
10 0 10 0
15 0.5 15 0.5
1 1
n(eighbours) 20 1.5 n(eighbours) 20 1.5
2 2
25 2.5t(hreshold) 25 2.5t(hreshold)
3 3
30 4 3.5 30 4 3.5
mgv2006/xy/rgb/dev/hes : grid
target
0.4
0.45 0.3
0.4 0.2
0.35 0.1
0.3
0.25 20 maxima
0.2
rating 0.15
0.1
0.05
0
0
5
10 0
15 0.5
1
n(eighbours) 20 1.5
2
25 2.5t(hreshold)
3
30 4 3.5
(c) f-score
The algorithm requires to provide the area on which it operates given by the
boundaries k− to k+ and t− to t+ . For the continuous threshold value t, also an
initial grid step size ts and a grid step divider must be given. The grid step size
for the neighbours value k is fixed to 1. How many interesting areas will be further
investigated in each iteration step gets set by M . Finally also a stop condition
serving as minimal improvement over the investigated areas has to be supplied.
As the initial step we create a set of points P in the given boundaries lying on
the grid. For each of these points (k, t) the quality measure φ(k, t) is retrieved
and from the resulting set S the subset of M elements with the highest quality
measure are selected as S ∗ . Then the new points of interest are collected into P
by investigating the small areas around these maxima. These steps are repeated
until the relative improvement ǫ expectable for the next step is lower than the
stop condition.
The complexity of the approach can be further optimized by introducing a
buffer to store the already queried function values and thus omitting some of the
costly function calls, as the investigated areas often overlaps. Furthermore when
8 O. Maier, M. Stanek and H. Kwasnicka
3 Evaluation
In this section we present experimental evaluation of proposed method. Experi-
ments were divided into two parts. The first part served to check the effectiveness
of the proposed annotation method with different similarity measures and different
types of visual features. In this part we also investigate the performance of pro-
posed iterative refinement optimization algorithm. The second part of experiments
focus on the quality of annotation obtained using the proposed PATSI methods
with regard to state-of-the-art literature methods.
In experiments for evaluation purposes we use three quality measures: preci-
sion, recall and F-Score. Precision of annotation determines how often the word w
in the annotated images collection was used correctly. I.e., it is a ratio of correct
occurrences of word w to all occurrences of word w. Precision is usually supple-
mented with recall — a measure that indicates how many images, which should
be annotated with the word w has been annotated correctly by this word. The
higher the precision and recall the better. Usually both measures are combining
together using F-score, i.e., a harmonic mean of precision and recall.
PATSI – Photo Annotation through Similar Images 9
For the evaluation process we use three benchmarking data sets: ICPR 2004 2004
(2004), MGV 2006 Paradowski (2008) and IAPR TC-12Grubinger et al. (2006)
whose characteristics are shown in Table 1.
1. Mean values of H, S and V (in HSV color space) and their std. deviation
2. Mean values of R, G, and B (in RGB color space) and their std. deviation
3. Normalized X and Y coordinates center of the region, mean R, G, and B (in
RGB color space), std. deviation of R, G, B and mean eigenvalue of the color
Hessian computed in RGB color space
Standard Brute-Force
Jensen–Shannon Set 1 0.33135 0.38358 0.35556 0.9 12 513 -
Set 2 0.39700 0.43754 0.41629 1.0 27 513 -
Set 3 0.41712 0.41477 0.41594 1.1 28 513 -
Minkowski L1 Set 1 0.04321 0.27027 0.07451 0.1 27 513 -
Set 2 0.22210 0.58392 0.3218 0.3 5 513 -
Set 3 0.24078 0.48260 0.32127 0.4 6 513 -
Minkowski L2 Set 1 0.37524 0.30783 0.33821 1.1 29 513 -
Set 2 0.33137 0.29354 0.31131 1.1 30 513 -
Set 3 0.31881 0.27287 0.29406 1.1 24 513 -
Iterative Refinement
Jensen–Shannon Set 1 0.37353 0.34348 0.35788 1.06 14 258/334 8
Set 2 0.40165 0.42775 0.41429 0.785 8 236/313 9
Set 3 0.41599 0.43371 0.42467 1.015 23 229/292 8
Minkowski L1 Set 1 0.04321 0.27027 0.07451 0.1 29 81/81 1
Set 2 0.22210 0.58392 0.32180 0.26 4 129/141 3
Set 3 0.24078 0.48260 0.32127 0.38 6 182/214 5
Minkowski L2 Set 1 0.37575 0.30922 0.33926 1.06625 25 270/386 10
Set 2 0.32703 0.29916 0.31247 1.06 29 234/309 9
Set 3 0.30948 0.29102 0.29997 1.00875 24 324/462 11
4 Conclusion
In this paper, we proposed automatic image annotation method based on the hy-
pothesis that images similar in appearance are likely to share the same annotations.
12 O. Maier, M. Stanek and H. Kwasnicka
Proposed method is efficient (in terms of both accuracy and small computational
complexity) even for very basic grid image segmentation and feature sets. PATSI
is similar to work of Makadia et al. (2008) in terms of main idea and basic hy-
pothesis used, but differs in a few important ways. First, we have improved a
method for transferring annotations to query image. We use simpler model for
both computing similarities between images and for feature extraction from im-
ages. Last but not least, PATSI is able to determine automatically the number of
annotations that should be transferred to the query image. We think that those
properties makes PATSI a better baseline method then the work of Makadia et al.
(2008).
In further work, we have to analyse method performance with different feature
sets. We also would like to explore the possibilities of automatic method for opti-
mization of t parameter separately for each annotated word. It is also interesting
to combine many similarity measures in annotation transfer process.
A very important advantage of using PATSI method is a high recall measure
obtained on all datasets, which indicates that the final annotations can be further
improved by the wrapper methods Kwasnicka and Paradowski (2008); Jin et al.
(2005); Llorente et al. (2009).
References
ICPR 2004 (2004), image database, http://www.cs.washington.edu/research/.
Gustavo Carneiro, Antoni Chan, Pedro Moreno, and Nuno Vasconcelos (2007),
Supervised Learning of Semantic Classes for Image Annotation and Retrieval, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 29(3):394–410, ISSN 0162-
8828.
E. Chang, Kingshy Goh, G. Sychay, and Gang Wu (2003), CBSA: content-based soft
annotation for multimodal image retrieval using Bayes point machines, Circuits and
Systems for Video Technology, IEEE Transactions on, 13(1):26–38.
C. Cusano, G. Ciocca, and R. Schettini (2004), Image annotation using SVM, Pro-
ceedings of SPIE, 5304:330–338.
P. Duygulu, Kobus Barnard, J. F. G. de Freitas, and David A. Forsyth (2002),
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image
Vocabulary, in Proc. of the 7th European Conf. on Computer Vision, Springer, London,
UK, ISBN 3540437487.
S. L. Feng, R. Manmatha, and V. Lavrenko (2004), Multiple Bernoulli Relevance
Models for Image and Video Annotation, Computer Vision and Pattern Recognition,
IEEE Computer Society Conference on, 2:1002–1009, ISSN 1063-6919.
Abby Goodrum (2000), Image Information Retrieval: An Overview of Current Research,
Informing Science, 3:2000.
Michael Grubinger, Clough Paul D., Müller Henning, and Deselaers Thomas (2006),
The IAPR Benchmark: A New Evaluation Resource for Visual Information Systems,
in International Conference on Language Resources and Evaluation, Genoa, Italy.
Yasuhide M. Hironobu, Hironobu Takahashi, and Ryuichi Oka (1999), Image-to-Word
Transformation Based on Dividing and Vector Quantizing Images With Words, in in
Boltzmann machines”, Neural Networks, volume 4.
PATSI – Photo Annotation through Similar Images 13
Yohan Jin, Latifur Khan, Lei Wang, and Mamoun Awad (2005), Image annotations
by combining multiple evidence & wordNet, in MULTIMEDIA ’05: Proceedings of
the 13th annual ACM international conference on Multimedia, ACM, New York, NY,
USA, ISBN 1-59593-044-2.
Halina Kwasnicka and Mariusz Paradowski (2006), Multiple Class Machine Learning
Approach for an Image Auto-Annotation Problem, in ISDA ’06: Proceedings of the
Sixth International Conference on Intelligent Systems Design and Applications, pp.
347–352, IEEE Computer Society, Washington, DC, USA, ISBN 0-7695-2528-8.
Halina Kwasnicka and Mariusz Paradowski (2008), Resulted word counts
optimization-A new approach for better automatic image annotation, Pattern Recogn.,
41(12), ISSN 0031-3203.
V. Lavrenko, R. Manmatha, and J. Jeon (), A Model for Learning the Semantics of
Pictures.
Ainhoa Llorente, Enrico Motta, and Stefan Ruger (2009), Image Annotation Refine-
ment Using Web-Based Keyword Correlation, in SAMT ’09: Proceedings of the 4th
International Conference on Semantic and Digital Media Technologies, pp. 188–191,
Springer-Verlag, Berlin, Heidelberg, ISBN 978-3-642-10542-5.
Ameesh Makadia, Vladimir Pavlovic, and Sanjiv Kumar (2008), A New Baseline for
Image Annotation, in ECCV ’08: Proceedings of the 10th European Conference on
Computer Vision, pp. 316–329, Springer-Verlag, Berlin, Heidelberg, ISBN 978-3-540-
88689-1.
Geoffrey J. McLachlan and Thriyambakam Krishnan (2008), The EM Algorithm and
Extensions (Wiley Series in Probability and Statistics), Wiley-Interscience, 2 edition,
ISBN 0471201707.
Mariusz Paradowski (2008), Metody automatycznej anotacji jako wydajne narzedzie
opisujace kolekcje obrazow, Ph.D. thesis, Wroclaw University of Technology.