PATSI - Photo Annotation Through Similar Images With Annotation Length Optimization

Intelligent Information Systems 9999
ISBN 666-666-666, pages 1–13
PATSI – Photo Annotation through Similar

∗
Images with Annotation Length Optimization
Oskar Maier, Michal Stanek, and Halina Kwasnicka
Institute of Informatics, Wroclaw University of Technology, Wroclaw, Poland
Abstract
Automatic image annotation (AIA) is an effective technology to improve the per-
formance of images retrieval from large image collections by text queries. In this
paper, we propose a novel AIA scheme based on finding the most similar images
using distance metric and then transferring the annotation to the query image.
The proposed method can effectively deal with the problem of finding the result-
ing annotation length. We propose an iterative refinement optimization algorithm
to find the best parameters of annotator. We evaluate the proposed method on
two similarity criterion, Minkowski distance and Jensen-Shannon divergence. The
proposed solution outperforms the current state-of-the-art methods and could be
treated as a new baseline for other automatic image annotation methods.
1 Introduction
Traditional search engines were concern with only textual data for providing in-
formation that the user was looking for. Nowadays we can observe a trend, where
also other modalities are becoming important in retrieval tasks. There is an enor-
mous amount of visual data available on the Internet, as well as in off-line image
databases. The automatic analysis of visual data is not a trivial task.
Traditional, so called Text-Based Image Retrieval (TBIR), dealt with this prob-
lem by retrieving images using textual information that is available in the same
document as the target image. This approach suffers from small correlation be-
tween textual description and visual dataGoodrum (2000). On the other hand, not
every image database has sufficiently rich meta-data describing images to perform
efficient retrieval. In contrast to TBIR the Content-Based Image Retrieval (CBIR)
deals only with content-based visual cues. For this purpose manually created se-
mantic labels (tags, annotations) could be used for image retrieval. The process of
labelling is tedious, costly and error-prone. There is a clear need for an automatic
method for labelling images. Thus the goal of automatic image annotation is to
assign semantic labels for images. Assigned labels can then be used in several
ways, most notably in search engines.
∗ This work is partially financed from the Ministry of Science and Higher Education Repub-
lic of Poland resources in 2008–2010 years as a Poland–Singapore joint research project 65/N-
SINGAPORE/2007/0.
2 O. Maier, M. Stanek and H. Kwasnicka
There are many reasons why automatic image annotation is a difficult task.
We name just a few among them. The number of classes is usually very large (in
other words: the size of the label dictionary W is large). Available training data
is often weakly annotated, i.e., annotations are often incomplete and may contain
errors as shown in Carneiro et al. (2007). Last but not least, there is no direct
correspondence between visual features and semantic labels.
There has been a plethora of studies on automatic image annotation utilizing
machine learning techniques for learning statistical models from annotated images
and apply them to generate annotations for unseen images. Most of the state of
the art approaches can be classified into two categories: probabilistic modelling
methods and classification methods.
Among the first category we name just a few especially interesting methods:
Hierarchical Probabilistic Mixture Model (HPMM) by Hironobu et al. (1999),
Translation Model (TM) by Duygulu et al. (2002), Supervised Multi-class La-
belling (SML) by Carneiro et al. (2007), Continuous Relevance Model (CRM) by
Lavrenko et al., and Multiple Bernoulli Relevance Models(MBRM) by Feng et al.
(2004).
The CRM method is based on Bayes theorem and uses the non-parametric
approach. Parzen estimator is used combined with a one dimensional Gaussian
kernel for density estimation. MBRM is an extension for CRM based on the
Bernoulli Relevance Models, which outperforms the other methods as reported
by the authors in Feng et al. (2004). Following Carneiro et al. (2007); Makadia
et al. (2008) those methods were used as a reference baseline by many researchers
working on the problem of image annotation.
The methods of the second category try to find correlation between words and
visual features by training classifiers. Bayes Point Machine by Chang et al. (2003),
Support Vector Machine by Cusano et al. (2004) and Decision Trees by Kwasnicka
and Paradowski (2008) are estimating the visual feature distributions associated
with each word.
There are also methods that try to improve the output of other image anno-
tation methods. GRWCO proposed by Kwasnicka and Paradowski (2008), can
be used to improve the average recall and precision of automatic annotators, by
reducing the difference between the expected and resulting word count vectors as
shown by Kwasnicka and Paradowski (2006). Annotation refinement can also be
achieved by using Word-Net which contains semantic relations between words Jin
et al. (2005). The word co-occurrence models coupled with fast random walks are
used in IRWR by Llorente et al. (2009) for re-ranking the output annotations.
Recently, Makadia et al. (2008) proposed a family of baseline methods that
are build on the hypothesis that visually similar images are likely to share the
same annotations. They treat image annotation as a process of transferring labels
from the nearest neighbours. Makadia’s method does not solve the fundamental
problem of determining the number of annotations that should be assigned to the
target image. Thus they assume a constant number of annotations per image. The
transfer is performed in two steps: all annotations from the most similar image
are rewritten and the most frequent words are chosen from the whole neighbour-
hood until a given annotation length has been achieved. They also combine many
similarity measures to obtain the subset of the most similar images.
PATSI – Photo Annotation through Similar Images 3
In this paper we propose a simple method for Photo Annotation through Find-
ing Similar Images (PATSI) based on the hypothesis that similar images should
share a large part of the annotations. High accuracy obtained by the proposed
method on the standard benchmark image datasets in conjunction with the sim-
plicity of the method and its computational efficiency makes it a perfect candidate
for being a baseline in the field of automatic images annotation.
The proposed method also solves the difficult problem of choosing the appro-
priate number of annotations assigned for the target images. For this purposes
we propose transfer parameters optimization method which leads to tune resulted
words count associated with the image.
This article is organized as follows. In the next section we describe the proposed
method with particular emphasis on the transfer function, similarity criterion and
used feature sets. The following section describes the experiments and achieved
results. The paper finishes with the conclusions and remarks on possible further
improvements of the method.
2 Annotation Transfer from Similar Images

In this section we describe a proposed method for transferring annotations from
similar images to query image. First we focus on similarity measures, then we
describe the annotation transfer process and method of optimization the resulted
annotation length.
2.1 Similarity measures

Image I can be represented by n-dimensional vectors of visual features:
I = [v1I , · · · , vnI ], (1)
where all visual features are a m-dimensional vector of low level attributes viI =
[xi,I i,I
1 , · · · , xm ]. The visual features vectors represent statistical information about
color and texture in selected area of the image I.
To obtain the similarity or rather dissimilarity between images, one can mea-
sure the distance between vectors or divergence between distributions build on
visual vectors.
2.1.1 Minkowski Metric

The Minkowski metric is widely used for measuring similarity between objects
(e.g., images). The Minkowski metric between image A and B is defined as:
n
!1/p
vi − viB p
X A
dMK (A, B) = (2)
i=1
where p is the Minkowski factor for the norm. Particularly, when p is equal one
and two, it is the well known L1 and Euclidean distance respectively.
2.1.2 Jensen–Shannon Divergence

Based on the visual feature vectors v I one can build a model M I for the image I.
We can assume that the M I is multi-dimensional random variable described by
multi-variate normal distribution and all vectors viI are realizations of this model.
Probability density function (PDF) for the model M I is defined as:

1 1
M I (x, µ, Σ) = exp − (x − µ)⊤ −1
Σ (x − µ) (3)
(2π)N/2 |Σ|1/2 2
where x is an observation vector, µ is mean vector, and Σ is the covariance ma-
trix. Both µ and Σ are parameters of the model calculated using Expectation-
Maximization algorithm (McLachlan and Krishnan, 2008) on all visual features
[v1I , · · · , vnI ] of the image I. In order to avoid problems of inverting covariance
matrix (avoid matrix singularity) one may perform regularization of the covari-
ance matrix. Models of images are build for all images in the training set, as well
as for the query image.
Distance between the models can be computed as Jensen–Shannon divergence,
which is a symmetrized version of Kullback–Leibler divergence:
1 1
dJS (A, B) = DKL (M A kM B ) + DKL (M B kM A ), (4)
2 2
where M A , M B are models (PDF) for image A and B, DKL is Kullback-Leibler
distance which for multivariate-normal distribution takes the form:

A 1B det ΣB 1
+ tr Σ−1

DKL (M kM ) = loge B ΣA
2 det ΣA 2
1 ⊤ −1 N
+ (µB − µA ) ΣB (µB − µA ) − , (5)
2 2
where ΣA , ΣB and µA , µB are covariance matrices and mean vectors from respec-
tively image model A and B.
2.2 Annotation Transfer

In an automatic image annotation process, annotator A describe previously un-
seen image I by a set of concepts W I from the semantic dictionary W based
on the training dataset D, containing image–words pairs (D = {(I1 , W I1 ), · · · ,
(IM , W IM ))}).
In Photo Annotation through Finding Similar Images (PATSI ) approach, for
a query image Q, a vector of the k most similar images from the training dataset
D need to be found based on similarity distance measure d. Let [r1 , · · · , rk ] be
the ranking of k the most similar images ordered by decrease similarity measure.
Based on the hypothesis that images similar in appearance are likely to share the
same annotation keywords from nearest neighbours are transferred to the query
image. All labels for the image on position r in the ranking are transferred with
value designated by transfer function ϕ(ri ).
To assure that labels from more similar images have a larger impact on resulting
annotation we define ϕ as
1
ϕ(ri ) = , (6)
i
where ri is an image on position i in the ranking. All words associated with image
ri is then transferred to resulted annotation with the value 1/i. If the words has
been transferred before the transferred values are summed.
The resulting query image annotations consists of all the words whose transfer
value was greater than a specified threshold t. The threshold value t has an
impact on annotation length and its optimal value as well as optimal number
of neighbours k which should be taken into account during annotation process
must be find by using optimization process. The outline of the PATSI annotation
method is summarized in the Algorithm 1.
Algorithm 1 PATSI image annotation algorithm

Require: D = {(I1 , W I1 ), · · · , (IM , W IM ))} – training dataset
N – number of visual features vectors per image
d – distance function
ϕ - transfer function
1: for all (I, W) ∈ D do {Preparation Phase}
2: split I into N disjoint regions
3: calculate and store visual features vectors for all regions of the image I
4: end for
5: {Optimization Phase}
Choose values of k and t maximizing quality function Q on a training dataset
6: {Query Phase}
split query image Q into N disjoint regions
7: calculate visual features of all query image regions
8: Calculate distance from query image to all other images in training database
D.
9: Take K images with smallest distances between models and create raking of
those images.
10: Transfer all words from images in the ranking with the value ϕ(r), where r is
the position of the image in the ranking.
11: As a final annotation take words which sum of the transfer values are greater
or equal to provided threshold t value.
2.3 Parameter Optimization
We can denote the PATSI annotator as At,k (I|d, ϕ, D), where annotation of image
I depends on the training dataset D, distance measure d, and transfer function
ϕ, annotation quality can be improved by adjusting threshold t and the number
of neighbours k.
Parameters t and k can be tuned on training dataset D using quality function

Q leave-one-out cross validation
I I
P
(I,W I )∈D Q(At,k (I|d, ϕ, D \ (I, W )), W )
φ(t, k) = (7)
|D|
where Q gives quality of automatic annotation according to expected annotation,

A is PATSI annotator, |D| size of training dataset and W I the image I annotation.
The optimal parameters are therefore defined by
(t∗ , k ∗ ) = argmax(φ(t, k)) (8)

t,k
where k ∗ and t∗ are the optimal setting for t and k with respect to a quality
function. They values differ greatly not only for different databases, but also
between feature sets, methods of distance measure and transfer functions. There
exists no optimal choice of them that would be suitable in all cases and the need
for adjusting them in each explicit case arises. In the Figure 1 we showed the
dependency of the annotation quality (precision, recall, f-score) of the different
parameters t and k.
mgv2006/xy/rgb/dev/hes : precision : grid mgv2006/xy/rgb/dev/hes : recall : grid
target target
0.4 0.6
0.5 0.3 0.7 0.5
0.45 0.2 0.6 0.4
0.4 0.5
0.35 0.1 0.3
0.3 0.4
0.25 20 maxima 0.3 0.2
0.2
rating 0.15 rating 0.2 0.1
0.1 0.1 20 maxima
0.05
0 0
0 0
5 5
10 0 10 0
15 0.5 15 0.5
1 1
n(eighbours) 20 1.5 n(eighbours) 20 1.5
2 2
25 2.5t(hreshold) 25 2.5t(hreshold)
3 3
30 4 3.5 30 4 3.5
(a) precision (b) recall
mgv2006/xy/rgb/dev/hes : grid
target
0.4
0.45 0.3
0.4 0.2
0.35 0.1
0.3
0.25 20 maxima
0.2
rating 0.15
0.1
0.05
0
0
5
10 0
15 0.5
1
n(eighbours) 20 1.5
2
25 2.5t(hreshold)
3
30 4 3.5
(c) f-score
Figure 1: Comparative graphs of precision, recall and f-score quality measures on

MGV 2006 database.
Finding t∗ and k ∗ can be treated as a maxima problem in a time-critical en-

vironment. TheP method must work on a minimal number of annotation transfers,
as obtaining (I,W I )∈D Q(At,k (I|d, ϕ, D \ (I, W I )), W I ) is a very complex op-
eration especially on very large databases. The optimization problem proves to
be a non-trivial task. The commonly used optimization solvers are inapplicable
due to the non-linearity character of φ(t, k). The discrete domain on k enforce a
preceding approximation for the remaining solvers. However, the great amount
function queries required for a detailed approximation and the unacceptable high
error render this approach unusable. It has been tested and dismissed due to the
above mentioned reasons. Furthermore, the functions partly continuous character
renders it impossible to determine an exact optima.
One of the solution for the problem would be the application of a brute force
method to acquire an approximation. For this the function is evaluated over a grid
with constant step size and the highest value selected as the optima. The main
disadvantage is the huge number of function calls required.
This number can be reduced using proposed iterative refinement algorithm.
Figure 2 gives an illustration of the method, where in the first step the values on a
very broad grid over the function get evaluated. Then in then next step the most
interesting areas around the highest points found are reinvestigated using a finer
grid. This gets repeated until a stop condition is reached. The formal formulation
of the approach is given by Algorithm 2.
Figure 2: Method schema
The algorithm requires to provide the area on which it operates given by the
boundaries k− to k+ and t− to t+ . For the continuous threshold value t, also an
initial grid step size ts and a grid step divider must be given. The grid step size
for the neighbours value k is fixed to 1. How many interesting areas will be further
investigated in each iteration step gets set by M . Finally also a stop condition
serving as minimal improvement over the investigated areas has to be supplied.
As the initial step we create a set of points P in the given boundaries lying on
the grid. For each of these points (k, t) the quality measure φ(k, t) is retrieved
and from the resulting set S the subset of M elements with the highest quality
measure are selected as S ∗ . Then the new points of interest are collected into P
by investigating the small areas around these maxima. These steps are repeated
until the relative improvement ǫ expectable for the next step is lower than the
stop condition.
The complexity of the approach can be further optimized by introducing a
buffer to store the already queried function values and thus omitting some of the
costly function calls, as the investigated areas often overlaps. Furthermore when
Algorithm 2 PATSI optimization with iterative refinement algorithm

Require: ts – initial threshold step
k− – minimal neighbours bound
k+ – maximal neighbours bound
t− – minimal threshold bound
t+ – maximal threshold bound
M – number of interesting areas to further investigate
divider – refinement of threshold grid in each step
stop condition – stop when minimal improvement is less than this
1: Prepare points on the grid
P = {(t, k)|k− ≤ k < k+ ∧k−ki mod 1 = 0∧t− ≤ t < t+ ∧t−t− mod ts = 0}
2: repeat
3: S = {(k, t, φ(k, t)) |(k, t) ∈ P }
4: obtain S ∗ where X X
S ∗ ⊂ S ∧ |S ∗ | = M ∧ ∄S ′ ⊂S . c< c′ ∧ |S ′ | = |S ∗ |
(a,b,c)∈S ∗ (a′ ,b′ ,c′ )∈S ′
[
5: P = {(k, t)|k − 1 ≤ k ≤ k + 1 ∧ k − k + 1 mod 1 = 0 ∧ t′ − ts ≤
′ ′ ′
(k′ ,t′ ,c′ )∈S ∗

ts
t ≤ t + ts ∧ t − t′ + ts mod
′
= 0}
divider
ts
6: ts = divider
7: ǫ = |maxc ((k, t, c) ∈ S ∗ ) − minc ((k, t, c) ∈ S ∗ )|
8: until stop condition > ǫ
working on large databases, the method can be changed to work on a smaller,

however still representative, subset of the database and therefore strongly reducing
the number of required quality function calls.
3 Evaluation
In this section we present experimental evaluation of proposed method. Experi-
ments were divided into two parts. The first part served to check the effectiveness
of the proposed annotation method with different similarity measures and different
types of visual features. In this part we also investigate the performance of pro-
posed iterative refinement optimization algorithm. The second part of experiments
focus on the quality of annotation obtained using the proposed PATSI methods
with regard to state-of-the-art literature methods.
In experiments for evaluation purposes we use three quality measures: preci-
sion, recall and F-Score. Precision of annotation determines how often the word w
in the annotated images collection was used correctly. I.e., it is a ratio of correct
occurrences of word w to all occurrences of word w. Precision is usually supple-
mented with recall — a measure that indicates how many images, which should
be annotated with the word w has been annotated correctly by this word. The
higher the precision and recall the better. Usually both measures are combining
together using F-score, i.e., a harmonic mean of precision and recall.
For the evaluation process we use three benchmarking data sets: ICPR 2004 2004
(2004), MGV 2006 Paradowski (2008) and IAPR TC-12Grubinger et al. (2006)
whose characteristics are shown in Table 1.
Table 1: Properties of benchmark datasets
MGV 2006 ICPR 2004 IAPR TC-12
Number of images 751 1 109 19 805

Dictionary size 74 407 291
Mean annotation length 5.0 5.79 5.72
Mediana of annotation length 5.0 5.0 5.0
Std. dev. of annotation length 1.28 3.48 2.56
Min. and max. annotation length (2, 9) (1,23) (1,23)
To evaluate quality of PATSI annotator with different feature vectors, as well

as various measures of similarity between images, we use MGV 2006 database.
Images were split onto 400 rectangular regions (20 by 20 grid splitter was used)
and for each region three different sets of feature vectors was calculated:
1. Mean values of H, S and V (in HSV color space) and their std. deviation
2. Mean values of R, G, and B (in RGB color space) and their std. deviation
3. Normalized X and Y coordinates center of the region, mean R, G, and B (in
RGB color space), std. deviation of R, G, B and mean eigenvalue of the color
Hessian computed in RGB color space
As a measure of similarity between the images was used Jensen-Shannon di-

vergence between the models of images and the Minkowski measure with the pa-
rameter p = 1 (L1 norm) and p = 2 (Euclidean distance) between images features
vectors.
In the results presented in Table 2 the optimal number of neighbours k and
the threshold value t were obtained as a result of optimization by brute-force and
iterative refinement method on the F-Score quality measure and 1/rank transfer
function.
Standard brute force optimization is applied over the area between the Thresh-
old values of [0.1, 2) and the Neighbours values of [3, 30), examining every point
of a grid with stepsizes 0.1/1, respectively. The total amount of required AT, i.e.
function evaluations, amount to 513 for each distance measures and feature sets.
Experiments over a wider area did not significantly change the results. The values
used for these experiments are [threshold-resolution=0.64, divider=2, maxima=3,
patch=1,1, stop-condition=0.0001]. Iterative refinement algorithm compared to
standard brute force the achieved higher results. Comparable results on brute-
force optimizer could also be achieved by increasing the resolution, but only at the
cost of increasing runtime since a linear dependency exists between the resolution
and the execution time.
Optimization process shows that best annotation results on MGV 2006 can
be obtained by applying Jensen–Shannon similarity measure on third set of fea-
Table 2: Evaluation of optimization algorithm on MGV 2006 dataset
Distance Features Precision Recall F-Score t∗ k∗ Queries Iter
Standard Brute-Force
Jensen–Shannon Set 1 0.33135 0.38358 0.35556 0.9 12 513 -
Set 2 0.39700 0.43754 0.41629 1.0 27 513 -
Set 3 0.41712 0.41477 0.41594 1.1 28 513 -
Minkowski L1 Set 1 0.04321 0.27027 0.07451 0.1 27 513 -
Set 2 0.22210 0.58392 0.3218 0.3 5 513 -
Set 3 0.24078 0.48260 0.32127 0.4 6 513 -
Minkowski L2 Set 1 0.37524 0.30783 0.33821 1.1 29 513 -
Set 2 0.33137 0.29354 0.31131 1.1 30 513 -
Set 3 0.31881 0.27287 0.29406 1.1 24 513 -
Iterative Refinement
Jensen–Shannon Set 1 0.37353 0.34348 0.35788 1.06 14 258/334 8
Set 2 0.40165 0.42775 0.41429 0.785 8 236/313 9
Set 3 0.41599 0.43371 0.42467 1.015 23 229/292 8
Minkowski L1 Set 1 0.04321 0.27027 0.07451 0.1 29 81/81 1
Set 2 0.22210 0.58392 0.32180 0.26 4 129/141 3
Set 3 0.24078 0.48260 0.32127 0.38 6 182/214 5
Minkowski L2 Set 1 0.37575 0.30922 0.33926 1.06625 25 270/386 10
Set 2 0.32703 0.29916 0.31247 1.06 29 234/309 9
Set 3 0.30948 0.29102 0.29997 1.00875 24 324/462 11
ture vectors. Jensen–Shannon proved to be significantly better than Minkowski

distance.
The performance of PATSI method in comparison with other state-of-the-
art method are summarized in Tables 3, 4 and 5. In all experiments we use
Jensen–Shannon distance measure and third set of feature vectors.
For all the datasets images were split onto 400 rectangular regions (20 by 20
grid splitter was used). Visual feature vectors for all regions consist of the mean
value of colours Red, Green and Blue, standard deviation of these values, number
of edges in all RGB color channels, and the three eigenvalues of color Hessian
computed in RGB color space.
For MGV and ICPR dataset as a reference point we have taken the results
presented in Kwasnicka and Paradowski (2008). For these data sets the proposed
method achieved significantly better results. Highest difference is visible for the
best annotated words, where F-Score was improved by 20 percentage points in
both sets (the relative improvement over CRM is about 37% and 44% for MGV
and ICPR dataset respectively).
For IAPR TC 12 for a reference point we have taken the results presented in
Makadia et. al. article Makadia et al. (2008). We obtain comparable results to
Lasso and JEC method on that benchmark set. Lasso and JEC, used the idea
of transferring annotation from similar images, but both on those methods com-
bine seven different similarity measures such as RGB, HSV, LAB, Haar, HaarQ,
Gabor, GaborQ. Only combination of those measured gives comparable results to
PATSI. Makadia et. al. method can not automatically determine the length of
the annotation, assuming that this is one of the given parameters.
Table 3: Evaluation of image annotation algorithms on MGV 2006 dataset.
Method Precision Recall F-Score
FastDIM 0.24 0.16 0.19

FastDIM + GRWCO 0.34 0.34 0.34
MCML 0.32 0.24 0.27
MCML + GRWCO 0.38 0.37 0.37
CRM 0.39 0.34 0.36
PATSI 0.38 0.46 0.42
The best 20 words
FastDIM 0.58 0.53 0.51
MCML 0.61 0.59 0.60
MCML + GRWCO 0.64 0.62 0.63
CRM 0.58 0.57 0.57
PATSI 0.71 0.86 0.78
Table 4: Evaluation of image annotation algorithms on ICPR 2004 dataset.
FastDIM 0.20 0.17 0.18

MCML 0.21 0.17 0.19
MCML + GRWCO 0.25 0.28 0.26
CRM 0.24 0.24 0.24
PATSI 0.27 0.34 0.30
The best 60 words
FastDIM 0.64 0.58 0.61
MCML 0.69 0.60 0.64
MCML + GRWCO 0.69 0.67 0.68
CRM 0.61 0.61 0.61
PATSI 0.82 0.94 0.88
Table 5: Evaluation of image annotation algorithms on IAPR TC 12 dataset.
RGB 0.24 0.24 0.24

HSV 0.20 0.20 0.20
LAB 0.24 0.25 0.24
Haar 0.20 0.11 0.14
HaarQ 0.19 0.16 0.17
Gabor 0.15 0.15 0.15
GaborQ 0.08 0.09 0.08
MBRM 0.24 0.23 0.23
Lasso 0.28 0.29 0.28
JEC 0.28 0.29 0.28
PATSI 0.26 0.31 0.28
4 Conclusion
In this paper, we proposed automatic image annotation method based on the hy-
pothesis that images similar in appearance are likely to share the same annotations.
Proposed method is efficient (in terms of both accuracy and small computational
complexity) even for very basic grid image segmentation and feature sets. PATSI
is similar to work of Makadia et al. (2008) in terms of main idea and basic hy-
pothesis used, but differs in a few important ways. First, we have improved a
method for transferring annotations to query image. We use simpler model for
both computing similarities between images and for feature extraction from im-
ages. Last but not least, PATSI is able to determine automatically the number of
annotations that should be transferred to the query image. We think that those
properties makes PATSI a better baseline method then the work of Makadia et al.
(2008).
In further work, we have to analyse method performance with different feature
sets. We also would like to explore the possibilities of automatic method for opti-
mization of t parameter separately for each annotated word. It is also interesting
to combine many similarity measures in annotation transfer process.
A very important advantage of using PATSI method is a high recall measure
obtained on all datasets, which indicates that the final annotations can be further
improved by the wrapper methods Kwasnicka and Paradowski (2008); Jin et al.
(2005); Llorente et al. (2009).
References
ICPR 2004 (2004), image database, http://www.cs.washington.edu/research/.
Gustavo Carneiro, Antoni Chan, Pedro Moreno, and Nuno Vasconcelos (2007),
Supervised Learning of Semantic Classes for Image Annotation and Retrieval, IEEE
Transactions on Pattern Analysis and Machine Intelligence, 29(3):394–410, ISSN 0162-
8828.
E. Chang, Kingshy Goh, G. Sychay, and Gang Wu (2003), CBSA: content-based soft
annotation for multimodal image retrieval using Bayes point machines, Circuits and
Systems for Video Technology, IEEE Transactions on, 13(1):26–38.
C. Cusano, G. Ciocca, and R. Schettini (2004), Image annotation using SVM, Pro-
ceedings of SPIE, 5304:330–338.
P. Duygulu, Kobus Barnard, J. F. G. de Freitas, and David A. Forsyth (2002),
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image
Vocabulary, in Proc. of the 7th European Conf. on Computer Vision, Springer, London,
UK, ISBN 3540437487.
S. L. Feng, R. Manmatha, and V. Lavrenko (2004), Multiple Bernoulli Relevance
Models for Image and Video Annotation, Computer Vision and Pattern Recognition,
IEEE Computer Society Conference on, 2:1002–1009, ISSN 1063-6919.
Abby Goodrum (2000), Image Information Retrieval: An Overview of Current Research,
Informing Science, 3:2000.
Michael Grubinger, Clough Paul D., Müller Henning, and Deselaers Thomas (2006),
The IAPR Benchmark: A New Evaluation Resource for Visual Information Systems,
in International Conference on Language Resources and Evaluation, Genoa, Italy.
Yasuhide M. Hironobu, Hironobu Takahashi, and Ryuichi Oka (1999), Image-to-Word
Transformation Based on Dividing and Vector Quantizing Images With Words, in in
Boltzmann machines”, Neural Networks, volume 4.
Yohan Jin, Latifur Khan, Lei Wang, and Mamoun Awad (2005), Image annotations
by combining multiple evidence & wordNet, in MULTIMEDIA ’05: Proceedings of
the 13th annual ACM international conference on Multimedia, ACM, New York, NY,
USA, ISBN 1-59593-044-2.
Halina Kwasnicka and Mariusz Paradowski (2006), Multiple Class Machine Learning
Approach for an Image Auto-Annotation Problem, in ISDA ’06: Proceedings of the
Sixth International Conference on Intelligent Systems Design and Applications, pp.
347–352, IEEE Computer Society, Washington, DC, USA, ISBN 0-7695-2528-8.
Halina Kwasnicka and Mariusz Paradowski (2008), Resulted word counts
optimization-A new approach for better automatic image annotation, Pattern Recogn.,
41(12), ISSN 0031-3203.
V. Lavrenko, R. Manmatha, and J. Jeon (), A Model for Learning the Semantics of
Pictures.
Ainhoa Llorente, Enrico Motta, and Stefan Ruger (2009), Image Annotation Reﬁne-
ment Using Web-Based Keyword Correlation, in SAMT ’09: Proceedings of the 4th
International Conference on Semantic and Digital Media Technologies, pp. 188–191,
Springer-Verlag, Berlin, Heidelberg, ISBN 978-3-642-10542-5.
Ameesh Makadia, Vladimir Pavlovic, and Sanjiv Kumar (2008), A New Baseline for
Image Annotation, in ECCV ’08: Proceedings of the 10th European Conference on
Computer Vision, pp. 316–329, Springer-Verlag, Berlin, Heidelberg, ISBN 978-3-540-
88689-1.
Geoﬀrey J. McLachlan and Thriyambakam Krishnan (2008), The EM Algorithm and
Extensions (Wiley Series in Probability and Statistics), Wiley-Interscience, 2 edition,
ISBN 0471201707.
Mariusz Paradowski (2008), Metody automatycznej anotacji jako wydajne narzedzie
opisujace kolekcje obrazow, Ph.D. thesis, Wroclaw University of Technology.

PATSI - Photo Annotation Through Similar Images With Annotation Length Optimization

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

PATSI - Photo Annotation Through Similar Images With Annotation Length Optimization

Încărcat de

Drepturi de autor:

Formate disponibile

Intelligent Information Systems 9999

ISBN 666-666-666, pages 1–13

PATSI – Photo Annotation through Similar

2 Annotation Transfer from Similar Images

2.1 Similarity measures

I = [v1I , · · · , vnI ], (1)

2.1.1 Minkowski Metric

2.1.2 Jensen–Shannon Divergence

2.2 Annotation Transfer

Algorithm 1 PATSI image annotation algorithm

2.3 Parameter Optimization

Parameters t and k can be tuned on training dataset D using quality function

where Q gives quality of automatic annotation according to expected annotation,

(t∗ , k ∗ ) = argmax(φ(t, k)) (8)

mgv2006/xy/rgb/dev/hes : precision : grid mgv2006/xy/rgb/dev/hes : recall : grid

(a) precision (b) recall

Figure 1: Comparative graphs of precision, recall and f-score quality measures on

Finding t∗ and k ∗ can be treated as a maxima problem in a time-critical en-

Figure 2: Method schema

Algorithm 2 PATSI optimization with iterative refinement algorithm

(k′ ,t′ ,c′ )∈S ∗

working on large databases, the method can be changed to work on a smaller,

Table 1: Properties of benchmark datasets

MGV 2006 ICPR 2004 IAPR TC-12

Number of images 751 1 109 19 805

To evaluate quality of PATSI annotator with different feature vectors, as well

As a measure of similarity between the images was used Jensen-Shannon di-

Table 2: Evaluation of optimization algorithm on MGV 2006 dataset

Distance Features Precision Recall F-Score t∗ k∗ Queries Iter

ture vectors. Jensen–Shannon proved to be significantly better than Minkowski

Table 3: Evaluation of image annotation algorithms on MGV 2006 dataset.

Method Precision Recall F-Score

FastDIM 0.24 0.16 0.19

Table 4: Evaluation of image annotation algorithms on ICPR 2004 dataset.

Method Precision Recall F-Score

FastDIM 0.20 0.17 0.18

Table 5: Evaluation of image annotation algorithms on IAPR TC 12 dataset.

Method Precision Recall F-Score

RGB 0.24 0.24 0.24

S-ar putea să vă placă și