Documente Academic
Documente Profesional
Documente Cultură
Abstract—Choosing the optimal terms to represent a search We propose three concept suggestion strategies: suggestion
engine query is not trivial, and may involve an iterative process by normalised textual matching, by semantic similarity, and
such as relevance feedback, repeated unaided attempts by the by the use of a similarity matrix. All three strategies have
user or the automatic suggestion of additional terms, which
the user may select or reject. This is particularly true of a been evaluated by comparing machine suggestions with the
multimedia search engine which searches on concepts as well observations made by professional annotators, using the mea-
as user-input terms, since the user is unlikely to be familiar sures of micro- and macro- precision and recall. The semantic
with all the system-known concepts. We propose three concept similarity strategy, where pictures are characterised by the TF-
suggestion strategies: suggestion by normalised textual matching, IDF weights of the words in their captions, outperformed the
by semantic similarity, and by the use of a similarity matrix.
We have evaluated these three strategies by comparing machine use of a similarity matrix at a range of thresholds. Normalised
suggestions with the suggestions produced by professional anno- textual matching performed almost as well as the semantic
tators, using the measures of micro- and macro- precision and similarity technique on recall-based measures, and even better
recall. The semantic similarity strategy outperformed the use of on precision- and F- based measures.
a similarity matrix at a range of thresholds. Normalised textual
The remainder of this document is structured as follows:
matching, which is the simplest strategy, performed almost as
well as the semantic similarity one on recall-based measures, Section II describes the multimedia collection that we em-
and even better on precision-based and F-based measures. ployed in our study. Section III presents the VITALAS concept
vocabulary and elaborates on the acquisition of ground-truth
I. I NTRODUCTION annotation. Section IV explains the three different strate-
273
274 PROCEEDINGS OF THE IMCSIT. VOLUME 4, 2009
To ensure that we had enough material, Belga granted us Approximately 500 of the pictures annotated for a particular
access to a set of 1,727,159 pictures and captions that were concept were chosen from the results of queries submitted by
published on its website between 22 June 2007 and 2 October Belga users who had included that concept’s name. The rest
2007. Belga also gave us access to its query logs for exactly of the pictures were chosen randomly. The first half of the
the same period. Figure 1 displays an example of a typical pictures annotated for each concept contains mostly positive
picture and caption posted on Belga’s website. samples, whereas the second one supplies mostly negative
samples. Achieving this balance was vital for the evaluation
of our experiments—and will also be useful when applying
supervised learning methods, which we envisage to do in later
stages of our work.
As part of the ground-truth annotation, we included the
provision of a glossary: a document containing a textual
definition—or description—for each concept in the vocabulary,
together with relevant keywords and references to positive
images. Table I shows an example of a VITALAS concept—
Fig. 1. Example of a Belga Picture and Caption: Nicole Vaidisova of
the Czech Republic receives a service during her quarterfinal match against food—, accompanied by its description and reference to
Amelie Mauresmo of France at the Kremlin Cup tennis tournament in positive images. Table I also shows a picture that has been
Moscow, Friday 13 October 2006. annotated positively as an image that does correspond with
the concept food.
III. VITALAS A NNOTATED C ONCEPT VOCABULARY TABLE I
E XAMPLE OF C ONCEPT D ESCRIPTION
The VITALAS concept vocabulary is largely derived from
the automatic extraction of keywords that characterise Belga’s
archive. However, it has been refined manually over time, in
order to improve its functionality.
Originally, the VITALAS vocabulary was produced from a
comparison between Belga’s captions and a model of general
English language. The words that deviated from the model
were very specific to the captions and thus made appropriate
keywords to characterise the archive. Professional annota-
tors evaluated the keywords and removed those that they
considered unsuitable. The remaining keywords became the Concept name: food
first entries in the VITALAS concept vocabulary. Later on,
these entries were extended manually to guarantee that the Concept description: An image showing any substance
vocabulary comprised as many categories as available in the reasonably expected to be ingested by a human or an
news domain. Finally, we mined Belga’s query logs program- animal for nutrition or pleasure.
matically to extract keywords that reflected the most important
concepts from the users’ perspective. These concepts were also Relevant keywords: Cooking, meal.
added to the vocabulary.
Examples of positive images: A picture of a table
Further details related to the VITALAS concept vocabu- showing a served meal; a picture of dishes ready to be
lary have been published by Palomino et al. [5], and the consumed; a picture of meat, fish, fruit or vegetables for
entire concept vocabulary for the VITALAS project can be sale in a market.
browsed from the authors’ website (http://osiris.sunderland.ac.
uk/∼cs0mpl/VITALAS/). The VITALAS manual annotation process has yielded an
incomplete, but reliable ground-truth for our concept vocabu-
A. Ground-Truth Annotation lary. Certainly, we would like to have all the pictures annotated
The availability of multimedia annotations is essential for for all of the concepts; yet, despite resource limitations, we
large-scale retrieval systems to work in practice. Hence, VI- have gathered a reasonably large subset of annotated pictures.
TALAS selected 96,600 pictures published on Belga’s website,
IV. C ONCEPT S UGGESTION S TRATEGIES
and employed professional annotators to determine some of
the concepts that best described them. For annotation purposes, We identify three different strategies for suggesting to users
the presence of a concept was assumed to be binary: it was the most relevant concepts related to a particular picture
either visible in a picture or not. The location of a concept in caption:
the picture was not taken into account. • Normalised textual matching: A method based on the dis-
To secure a sound basis for our experimental setup, a total covery of straight textual matches between the contents
of 1,000 pictures were annotated for each of the 525 concepts. of a caption and the concept names in the vocabulary.
MARCO A. PALOMINO ET. AL: AN EVALUATION OF CONCEPT SUGGESTION STRATEGIES 275
• Semantic similarity: A method based on the calculation the concept name, which would limit the number of matches
of the cosine measure of similarity between the caption considerably, but would ensure that only captions referring
and the textual description of each concept, as it is found explicitly to the concept name are matched.
in the glossary. A more relaxed approach would be to select one or more
• Similarity matrix: A method based on the creation of a words as headwords. For instance, we may say that davis
similarity matrix that shows the degree of association is the headword for the concept davis_cup, and we will
between every pair of concepts. Given a particular cap- automatically associate all the matches of davis with this
tion, whose relation to a certain set of concepts has been concept. This is the approach that we have pursued.
detected by means of straight textual matches, we employ Due to space limitations, we cannot list here all the head-
the similarity matrix to establish all the concepts that are words for the concepts in the vocabulary. Readers are welcome
relevant to it, even though their names do not appear in to visit the authors’ website for further details on this matter
the caption. [8]. Some concepts—such as ac_milan_soccer—have
In the following subsections, we detail these strategies. two headwords—milan and soccer—though they do not
have to appear together to provide a match for the concept.
A. Normalised Textual Matching An appropriate selection of headwords is not a trivial task.
This approach begins by normalising the caption and the Indeed, such a task should be undertaken by a group of
concept names: all text is converted to lower case, punctuation specialists who are familiar with Belga’s collection and can
is removed, and extremely common and semantically non- decide which words are relevant to each particular concept.
selective words are deleted—the stop-word list which we use Nevertheless, the goal of this study is just to analyse the
was built by Salton and Buckley for the experimental SMART feasibility of the approach. Refining the list of headwords and
information retrieval system [6]. The normalisation process improving the quality of our results is left for future work.
stems all text too, reducing inflectional and derivationally Future versions of our suggestion engine will evaluate the
related forms of a word to a common base—the particular impact of different headwords on the precision of the results
algorithm for stemming English words that we use is Porter’s retrieved, and the final choice of headwords that are associated
algorithm [7]. with each concept will be left to specialists.
As a second step in this method, we look for exact matches
B. Semantic Similarity
between the words in the normalised caption and those
available in the normalised concept names. For illustration As explained in Subsection III-A, each concept ω in the
purposes, Table II displays an example of a caption, its VITALAS concept vocabulary is associated with a textual
normalised version and the resulting matches with the concept description dω . Then, we can measure the semantic similarity
vocabulary. between a caption and the textual description of each different
concept. Both captions and descriptions are normalised before
TABLE II examining their semantic similarity.
E XAMPLE OF N ORMALISED T EXTUAL M ATCH
We represent each concept description as a vector, whose
Original caption: Red Cross delegation chief for Peru, Michel entries correspond to unique normalised words. Since concept
Minnig waves to photographers as he leaves descriptions are written in natural language, word distribution
the Japanese Ambassador’s residence after corresponds, roughly, with Zipf’s law [9]. Therefore, the vector
meeting with rebels of the Tupac Amaru
Revolutionary Movement (MRTA) in Lima, space model proposed by Salton et al. [10] is appropriate
Peru. for our semantic analysis. Specifically, with a collection of
descriptions D, a concept description dω in D, a total of
Normalised caption: red cross deleg chief peru michel minnig N Con concepts, a caption q and a total of N Cap captions,
wave photograph leav japanes ambassador
resid meet rebel tupac amaru revolutionari
we use the following formula to compute the cosine similarity
movement mrta lima peru between caption q and concept description dω :
P
wkq · wkdω
Concept names: ambassador ... government ... k∈(q∩dω )
photographers ... sim(q, dω ) = r P r P ,
(wkq )2 (wkdω )2
k∈q k∈dω
Normalised names: ambassador ... govern ...
photograph ... where
N Con
wkdω = fkdω · idfk , idfkdω = log2 Dkdω
In the case of concept names made of more than one word—
such as, davis_cup—different heuristics may be applied. Note that fkq is the frequency with which term k occurs in
We may look for precise matches of all the words contained in caption q, and Dkq is the number of captions containing k.
276 PROCEEDINGS OF THE IMCSIT. VOLUME 4, 2009
Similarly, fkdω is the frequency with which term k occurs on it. Afterwards, we derive from the similarity matrix the
in description dω , and Dkdω is the number of descriptions “similarity” between the appearing headwords and all the
containing k. concepts in the vocabulary. As in the case of the semantic
It can be demonstrated that the resulting similarity between similarity approach, we define a threshold and suggest to
q and dω ranges from 0 meaning no match, to 1 meaning com- the user only those concepts whose corresponding values are
plete match, with in-between values indicating intermediate above the threshold.
similarity [11]. Hence, we may choose a threshold and suggest In the following section, we report on the use of different
concept ω as a possible match for q only if the similarity thresholds for this strategy and the previous one. As we lower
between q and dω is above the threshold. the threshold a larger number of false positives is suggested,
Considering the relatively small size of both captions and but recall also increases, which is more important than preci-
concept descriptions, it is computationally inexpensive to cal- sion for concept suggestion, because users will benefit from
culate their semantic similarity. Using an Intel
c Xeon
c CPU being able to choose from a variety of possible additional
5150 processor with 2GB of RAM, running under Microsoft concepts and can easily reject unsuitable ones.
Windows XP 2002 SP2, our Java-based suggestion engine can
V. E VALUATION
calculate the semantic similarity between a single caption and
all of the 525 concept descriptions in only a couple of seconds. To evaluate the efficiency of our suggestion strategies, we
Although the strategy described in this subsection is have employed two standard measures used in information
based on term frequencies—TF—and inverse document retrieval: precision and recall [13].
frequencies—IDF—we refer to it as semantic similarity, rather Precision and recall are defined in terms of a set of retrieved
than TF-IDF approach, because the words that are contained documents and a set of relevant documents. Given the partic-
in the textual description of a concept, and have high TF-IDF ular characteristics of Belga’s archive, and the conditions of
weights, give us the best indication as to what the textual de- the ground-truth annotation that we have exploited, we have
scription is about and, therefore, reflect the semantic similarity taken a modified version of the traditional definitions. For the
between captions and textual descriptions of concepts. remainder of this document, we refer to the recall for caption
q (Rq ), and the precision for caption q (Pq ), as
C. Similarity Matrix
Our final strategy aims to discover related concepts for |ΩqA ∩ΩqM | |ΩqA ∩ΩqM |
Rq ≡ |ΩqA |
, Pq ≡ |ΩqM |
each picture, which are hopefully relevant to the users’ inter-
ests even though they do not appear in the picture caption. where ΩqA is the set of concepts that the annotators associated
The method requires the creation of a concept-to-concept with the picture whose caption is q, and ΩqM is the set of
similarity matrix showing the degree of association between concepts that our automatic suggestion strategy proposed for
every pair of concepts. To produce each entry in this ma- the same picture.
trix, we first represent the relation between each concept Averaging over the total number of pictures in the collection
and all of the captions in a different vector. The entries C, we made use of the following definitions for micro-recall
in this vector contain the TF-IDF weights of the concept (µR ), micro-precision (µP ), macro-recall (MR ) and macro-
headwords in each caption of the collection. For concepts precision (MP ) [14],
with more than one headword, we sum the TF-IDF weights
of the different headwords. Then, the similarity of a pair |ΩqA ∩ΩqM | |ΩqA ∩ΩqM |
P P
q∈C q∈C
of concepts is given by the cosine similarity of their corre- µR ≡ P q
|ΩA |
, µP ≡ P q
|ΩM |
,
sponding vectors—each concept has a similarity of 1 with q∈C q∈C
TABLE III
E VALUATION R ESULTS textual matching, where the annotators pick up the headwords,
and possibly extend them manually to better reflect concept
Semantic Similarity Textual relations, may yield very good results.
Measure Threshold Similarity Matrix Match Figure 2 compares the efficiency of our suggestion strate-
µR 0.75 gies with a curve that represents the observations made by
µP 0.17 the annotators. The annotators curve displays the cumulative
MR 0.77
number of concepts chosen by them every time they annotate a
new picture. The curves representing our suggestion strategies
MP 0.21
indicate how many concepts we have proposed that match the
µR 0.20 0.34 0.32 observations of the annotators.
µP 0.20 0.26 0.11
MR 0.20 0.35 0.33 Fig. 2. Efficiency Curves
50,000
Annotators curve
MP 0.20 0.35 0.17 Semantic similarity (threshold = 0.05)
Normalised textual match
Similarity matrix (threshold = 0.05)
µR 0.15 0.49 0.34
query expansion [17] and query rewriting [18]. While our VII. C ONCLUSIONS
work may certainly benefit from some of these techniques, We have described three methods of concept suggestion,
our recommendations to the user must be limited to a specific with the aim of helping multimedia search engine users
concept vocabulary, and our users’ queries may need to be re- enhance their initial keyword queries with additional terms
placed completely, rather than just expanded or spell-checked, corresponding to system-known concepts, namely suggestion
because concepts may be grammatically and orthographically by normalised textual matching, by semantic similarity and
unrelated. Additionally, our particular setting does not require by the use of a similarity matrix. Although normalised textual
a fully automated system: we are supposed to give users the matching is the simplest technique that we have assessed, and
opportunity to select related terms, which allows for user-aided the fastest one to execute, it performed very well on the recall-
disambiguation. based measures and F-based measures, and only slightly less
The study published by Hoogs et al. [19] is among the first well on the precision-based measures. High recall is more
ones to add semantics to concept detection, by establishing important than high precision for query term suggestion, since
links with a general-purpose ontology, which connected a the user will benefit from being able to choose from a range of
limited set of visual attributes to WordNet [20]. However, possible additional concepts, and can easily reject unsuitable
combining low-level visual attributes with concepts in an ones. However, in approaches such as the one proposed by
ontology is a rather difficult task, due to the so-called semantic Palomino et al. [30], where discovered additional concepts
gap between them [21]. are automatically added to a query without prior user approval,
To cope with the demand for ground-truth, Lin et al. precision is more important, since the inclusion of non-relevant
initiated a collaborative annotation effort for the TRECVID concepts in the query can severely degrade performance.
Even though we have evaluated the quality of our concept
2003 benchmark [22]. Using tools from Christel et al. [23]
selection using recall- and precision- based measures, we
and Volkmer et al. [24], a common annotation effort was again
still need to measure the effect of our concept suggestion
made for the TRECVID 2005 benchmark, yielding a large set
facilities on the overall search engine performance. Such
of annotated examples for 39 concepts taken from a prede-
fined collection [25]. We have provided a larger compilation, evaluations are being undertaken by our research partners
at the Institut National de l’Audiovisuel [31], using recall,
increasing the concept vocabulary to 525 concepts, and getting
precision, and subjective measures of user satisfaction with
1,000 annotated pictures per concept.
the overall system.
Snoek et al. [26] have published a study closely related
to ours. They also produced a concept suggestion strategy ACKNOWLEDGMENT
based on semantic similarity; yet, they made use of the Lucene This research was supported under the EU-funded VITA-
search engine [27] as part of their implementation, and the LAS project (project number FP6-045389). The authors are
goal of their work was different, as they attempted to obtain very grateful to the Belga News Agency for providing the
semantic descriptions and structure from WordNet [20]. Even data used to carry out their research.
though the results presented by Snoek et al. are not conclusive,
R EFERENCES
we may consider following their recommendations on ontology
querying in future versions of our suggestion engine. [1] A. W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain,
“Content-Based Image Retrieval at the End of the Early Years,” IEEE
Another existing solution to the problem that we have Transactions on Pattern Analysis and Machine Intelligence, vol. 22,
approached involves the use of latent semantic indexing (LSI) no. 12, pp. 1349–1380, December 2000.
[2] M. Worring, C. G. M. Snoek, B. Huurnink, J. C. van Gemert, D. C.
[28], where relations between queries and documents are Koelma, and O. de Rooij, “The MediaMill Large-Lexicon Concept
determined according to a co-occurrence analysis. Unfortu- Suggestion Engine,” in Proceedings of the 14th ACM International
nately, LSI is computationally expensive, especially consider- Conference on Multimedia. Santa Barbara, CA: Association for
Computing Machinery, October 2006, pp. 785–786.
ing the size of Belga’s collection and its continuously growing [3] VITALAS, Video and Image Indexing and Retrieval in the Large Scale,
nature—Belga claims to add between 5,000 and 10,000 new http://www.vitalas.org/.
pictures and captions to its archive on a daily basis [4]. [4] Belga, Belga News Agency, http://www.belga.be/.
[5] M. A. Palomino, M. P. Oakes, and T. Wuytack, “Automatic Extraction
A potentially better alternative than LSI has been proposed of Keywords for a Multimedia Search Engine Using the Chi-Square
Test,” in Proceedings of the 9th Dutch-Belgian Information Retrieval
by Wang et al. [29], who learnt concept relations from Workshop, Enschede, The Netherlands, February 2009, pp. 3–10.
short natural language texts, and stored them in a structure [6] C. Buckley, “Implementation of the SMART Information Retrieval
called fuzzy associated concept mapping. New concepts, not System,” Computer Science Department, Cornell University, Ithaca, New
York, Tech. Rep. TR85-686, May 1985.
explicitly present in the original texts, were recommended to [7] M. Porter, “An Algorithm for Suffix Stripping,” Progam, vol. 14, no. 3,
the users based on this mapping. pp. 130–137, July 1980.
[8] M. A. Palomino, VITALAS Concept-Suggestion Engine, http://osiris.
Rather than using a mapping in which only selected rela- sunderland.ac.uk/∼ cs0mpl/VITALAS/.
tions between concepts are stored, our suggestion strategies [9] W. J. Reed, “The Pareto, Zipf and Other Power Laws,” Economics
favour the utilisation of a similarity matrix, where every Letters, vol. 74, no. 1, pp. 15–19, December 2001.
[10] G. Salton, A. Wong, and C. Yang, “A Vector Space Model for Automatic
concept has a relation with every other concept, namely a real Indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620,
value in the range 0 to 1. November 1975.
MARCO A. PALOMINO ET. AL: AN EVALUATION OF CONCEPT SUGGESTION STRATEGIES 279
[11] D. Widdows, “Measuring Similarity and Distance,” in Geometry and Computational Media Aesthetics,” in Proceedings of the International
Meaning. CSLI Publications, November 2004. Conference on Computational Semiotics for Games and New Media.
[12] G. Salton and C. Buckley, “On the Use of Spreading Activation Amsterdam, The Netherlands: Kluwer Academic Publishers, September
Methods in Automatic Information Retrieval,” in Proceedings of the 11th 2001, pp. 94–99.
ACM SIGIR Conference on Research and Development in Information [22] C.-Y. Lin, B. L. Tseng, and J. R. Smith, “Video Collaborative Anno-
Retrieval, Grenoble, France, June 1988, pp. 147–160. tation Forum: Establishing Ground-Truth Labels on Large Multimedia
[13] R. K. Belew, Finding Out About: A Cognitive Perspective on Search Datasets,” in Proceedings of the TRECVID 2003 Workshop, Gaithers-
Engine Technology and the WWW. Cambridge, UK: Cambridge burg, MD, November 2003.
University Press, February 2001. [23] M. Christel, T. Kanade, M. Mauldin, R. Reddy, M. Sirbu, S. Stevens,
[14] T. Joachims, Learning to Classify Text Using Support Vector Machines: and H. Wactlar, “Informedia Digital Video Library,” Communications of
Methods, Theory and Algorithms. Kluwer Academic Publishers, April the ACM, vol. 38, no. 4, pp. 57–58, April 1995.
2002. [24] T. Volkmer, S. Tahaghoghi, and J. A. Thom, “Modelling Human Judge-
[15] K. Church and B. Thiesson, “The Wild Thing Goes Local,” in Proceed- ment of Digital Imagery for Multimedia Retrieval,” IEEE Transactions
ings of the ACM SIGIR Conference on Research and Development in on Multimedia, vol. 9, no. 5, pp. 967–974, August 2007.
Information Retrieval. Amsterdam, The Netherlands: Association for [25] M. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. Kennedy,
Computing Machinery, Inc., July 2007, pp. 901–901. A. Hauptmann, and J. Curtis, “Large-Scale Concept Ontology for
[16] S.-P. Cucerzan and E. Brill, “Spelling Correction as an Iterative Process Multimedia,” IEEE MultiMedia, vol. 13, no. 3, pp. 86–91, 2006.
that Exploits the Collective Knowledge of Web Users,” in Proceedings of [26] C. G. Snoek, B. Huurnink, L. Hollink, M. de Rijke, G. Schreiber, and
the Conference on Empirical Methods in Natural Language Processing, M. Worring, “Adding Semantics to Detectors for Video Retrieval,” IEEE
Barcelona, Spain, July 2004, pp. 293–300. Transactions on Multimedia, vol. 9, no. 5, pp. 975–986, August 2007.
[17] R. W. White and G. Marchionini, “Examining the Effectiveness of [27] Lucene, The Lucene Search Engine, http://lucene.apache.org/.
Real-Time Query Expansion,” Information Processing and Management, [28] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and
vol. 43, no. 3, pp. 685–704, May 2007. R. Harshman, “Indexing by Latent Semantic Analysis,” Journal of the
[18] R. Jones, B. Rey, O. Madani, and W. Greiner, “Generating Query Society for Information Science, vol. 41, no. 6, pp. 391–407, 1990.
Substitutions,” in Proceedings of the International World Wide Web Con- [29] W. M. Wang, C.F.Cheung, W. B. Lee, and S. K. Kwok, “Mining
ference. Edinburgh, Scotland: Association for Computing Machinery, Knowledge from Natural Language Texts Using Fuzzy Associated
May 2006, pp. 387–396. Concept Mapping,” Information Processing and Management, vol. 44,
[19] A. Hoogs, J. Rittscher, G. Stein, and J. Schmiederer, “Video Content An- no. 5, pp. 1707–1719, September 2008.
notation Using Visual Analysis and a Large Semantic Knowledgebase,” [30] M. A. Palomino, M. P. Oakes, and Y. Xu, “An Adaptive Method to
in Proceedings of the IEEE Computer Society Conference on Computer Associate Pictures with Indexing Terms,” in Proceedings of the 2nd
Vision and Pattern Recognition, Madison, WI, June 2003, pp. 327–334. International Workshop on Adaptive Information Retrieval, London, UK,
[20] C. Fellbaum, WordNet: An Electronic Lexical Database. Cambridge, October 2008, pp. 38–43.
MA: MIT Press, May 1998. [31] INA, Institut National de l’Audiovisuel, http://www.ina.fr/.
[21] C. Dorai, “Bridging the Semantic Gap in Content Management Systems: