Multimedia Lecture Notes

Lecture Notes: Multimedia Information Systems
Marcel Worring Intelligent Sensory Information Systems, University of Amsterdam worring@science.uva.nl http://www.science.uva.nl/~worring tel 020 5257521
Contents
1 Introduction 2 Data, domains, and applications 2.1 Data and applications . . . . . 2.2 Categorization of data . . . . . 2.3 Data acquisition . . . . . . . . 2.4 Ontology creation . . . . . . . . 2.5 Produced video documents . . 2.5.1 Semantic index . . . . . 2.5.2 Content . . . . . . . . . 2.5.3 Recording the scene . . 2.5.4 Layout . . . . . . . . . . 2.6 Discussion . . . . . . . . . . . . 7 13 13 15 16 17 17 18 18 19 19 20 21 21 22 22 23 24 24 24 24 25 26 26 26 26 26 26 26 28 28 28 29 29 30 31
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
3 Data representation 3.1 Categorization of multimedia descriptions 3.2 Basic notions . . . . . . . . . . . . . . . . 3.3 Audio representation . . . . . . . . . . . . 3.3.1 Audio features . . . . . . . . . . . 3.3.2 Audio segmentation . . . . . . . . 3.3.3 Temporal relations . . . . . . . . . 3.4 Image representation . . . . . . . . . . . . 3.4.1 Image features . . . . . . . . . . . 3.4.2 Image segmentation . . . . . . . . 3.4.3 Spatial relations . . . . . . . . . . 3.5 Video representation . . . . . . . . . . . . 3.5.1 Video features . . . . . . . . . . . 3.5.2 Video segmentation . . . . . . . . 3.5.3 Spatio-temporal relations . . . . . 3.6 Text representation . . . . . . . . . . . . . 3.6.1 Text features . . . . . . . . . . . . 3.6.2 Text segmentation . . . . . . . . . 3.6.3 Textual relations . . . . . . . . . . 3.7 Other data . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
4 Similarity 4.1 Perceptual similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Layout similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Semantic similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4 5 Multimedia Indexing Tools 5.1 Introduction . . . . . . . . . . 5.2 Pattern recognition methods . 5.3 Statistical methods . . . . . . 5.4 Dimensionality reduction . . 5.5 Invariance . . . . . . . . . . . 6 Video Indexing 6.1 Introduction . . . . . . . . . . 6.2 Video document segmentation 6.2.1 Layout reconstruction 6.2.2 Content segmentation 6.3 Multimodal analysis . . . . . 6.3.1 Conversion . . . . . . 6.3.2 Integration . . . . . . 6.4 Semantic video indexes . . . . 6.4.1 Genre . . . . . . . . . 6.4.2 Sub-genre . . . . . . . 6.4.3 Logical units . . . . . 6.4.4 Named events . . . . . 6.4.5 Discussion . . . . . . . 6.5 Conclusion . . . . . . . . . . 7 Data mining 7.1 Classication learning 7.2 Association learning . 7.3 Clustering . . . . . . . 7.4 Numeric prediction . .
CONTENTS 33 33 33 35 39 40 41 41 42 43 44 48 48 49 51 51 53 53 57 58 60 63 63 64 65 65 67 67 68 73 73 73 75 75 75 76 76 76 76 77 81 81 82 83 83 84 85 86
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
8 Information visualization 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Visualizing video content . . . . . . . . . . . . . . . . . . . . . . . . 9 Interactive Retrieval 9.1 Introduction . . . . . . . . . . . . . . . . . . 9.2 User goals . . . . . . . . . . . . . . . . . . . 9.3 Query Space: denition . . . . . . . . . . . 9.3.1 Query Space: data items . . . . . . . 9.3.2 Query Space: features . . . . . . . . 9.3.3 Query Space: similarity . . . . . . . 9.3.4 Query Space: interpretations . . . . 9.4 Query Space: processing steps . . . . . . . . 9.4.1 Initialization . . . . . . . . . . . . . 9.4.2 Query specication . . . . . . . . . . 9.4.3 Query output . . . . . . . . . . . . . 9.4.4 Query output evaluation . . . . . . . 9.5 Query space interaction . . . . . . . . . . . 9.5.1 Interaction goals . . . . . . . . . . . 9.5.2 Query space display . . . . . . . . . 9.5.3 User feedback . . . . . . . . . . . . . 9.5.4 Evaluation of the interaction process 9.6 Discussion and conclusion . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
CONTENTS 10 Data export 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 MPEG-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 SMIL: Synchronized Multimedia Integration Language . . . . . . . .
5 89 89 89 91
CONTENTS
Chapter 1
Introduction
1
A staggering amount of digital data goes round in the world. Brought to you by television or the Internet, as communication between businesses, or acquired through various sensors like your digital (video)camera. Increasingly, the digital data stream is composed of multimedia data items, i.e. a combination of pictorial, auditory and textual data. With the increasing capabilities of computers, processing the multimedia data stream for further exploitation will be common practice. Processors are still expected to double their speed every 18 months for the near foreseeable future; memory systems sizes will double every 12 months and grow to Terabytes of storage; network bandwidth will signicantly improve every 9 months, increasing their resources even faster. So there is enough capacity to ll with multimedia data streams. Observing how Internet is evolving, it is obvious that users prefer a multimedia style of information exchange and interaction, including pictures, video and sound. We foresee that the same turnover will occur in professional life. The emphasis in information handling will gradually shift from categorical and numerical information to information in multimedia form: pictorial, auditory and free text data. So, multimedia is the message [after McLuhan phrase of the sixties that the medium is the message, predicting the eect television would have on society]. Multimedia information - when oered as an option with equal accessibility as the current information forms - will be the preferred format as it is much closer to human forms of communication. Multimedia information presentation from the system to user and interpretation from the user to the system employs our coordinated communication skills, enabling us to control system interactions in a more transparent experience then ever before. More natural interaction with information and communication services for a wide range of users will generate a signicant societal and economic value. Hence, it is safe to predict that multimedia will enter all information systems, in business to business communication, on the professional desk, in communication with clients, in education, in science and in arts, and in broadcasting via the Internet. Each of these worlds will eventually pose their own questions but for now a set of tools is needed as keys to unlock this future. Multimedia may become the natural form to communicate, however, it does not come for free entirely. The semantics of a message is much more imminent for categorical and numeric information than the bit stream of pictures and sounds. What tells the picture, what is meant by a text, and what is said in the audio signal are not easy to deduce with a computer. These semantics need to be conquered from the raw data by structuring, transformation, analysis and classication.
1 This introduction is an adaptation of the introduction of Smeulders et.al. in BSIK proposal MultimediaN.
CHAPTER 1. INTRODUCTION
Most of the multimedia information digitally available is irrelevant to the user, but some tiny fraction hidden in the information ood is highly relevant. The challenge is to locate these nuggets of multimedia experiences or knowledge. Much of the information is directed to an anonymous receiver, many pieces just sit there waiting to be accessed, and some valuable information is hidden in the stream of information passing by. Multimedia in its full exploitation requires the handling of video, pictures, audio and language. It includes the man-machine interaction in multi-modal form. In addition, knowledge and intelligence about multimedia forms will lead to insights in the multimedia information ows. Truly multimedia information systems will hold and provide access to the ever increasing data streams; handling information in all its formats. In this context, multimedia can be described as all handling in a computer of information in a pictorial, free language, and audio form. Some obvious application patterns of multimedia are consumer electronics devices, the archiving of large digital multimedia repositories, e-document handling of multimedia, digital delivery of information, digital publishing or multimedia in education, e-health and e-government. Multimedia handling is becoming an integral part of workows in many economic clusters. Integrated multimedia information analysis will be used for surveillance, forensic evidence gathering, and public safety. Multimedia knowledge generation and knowledge engineering is employed for making pictures, lm and audio archives accessible in cultural heritage and the entertainment industry. Multimedia information systems and standards will be used to eciently store large piles of information: video from the laboratories, wearable video from ocial observations, and ultimately our own senses to become our autobiographic memory. Indexing and annotating multimedia to gain access to its semantics remains an important topic for anyone with a large pile of information. Multimedia presentation will replace the current graphical user interfaces for a much more eective information transfer, and so on.
Four urgent themes in multimedia computing

It is clear from our everyday life that information is overly present and yet underemployed. This holds also for digital multimedia information. The old idiom of information handling largely settled by paper, the tape recorder, or the television is simply not good enough to be translated straightforward into the digital age. The following four main themes arise at the societal level. The information handling is frequently fragmented by the distinctly dierent information types of former days: image by pictures and maps, sound by recordings and language by text. What is needed is to use the potential of the digital age to combine information of dierent kinds: images, sound, text, speech, maps and video. Information integration by types is needed either in the analysis stage, the keeping of information in a storage system, or in the presentation and interaction with the user at each time using the most appropriated information channel for the circumstances. Much of the information is not easily accessible, at least by the content of multimedia information forms. Although we have learned to deal with all forms of information instantaneously, such knowledge is not readily transferable to computer handling of information. Gaining access to the meaning of pictures, sounds, speech and language is not a trivial task for a digital computer. It is hard to learn from the diversity of information ows without seriously studying methods of multimedia learning. The information reaching the user is inexibly packaged in forms not very different from the one-modal communication of paper and records. In addition, the
9 information comes in enormous quantities, eectively reducing our ability to handle the information stream with care. What are needed are information ltering, information exploration and eective interaction. Such can be achieved by personalized and adaptive information delivery and by information condensation The digital multimedia information is frequently poorly organized. This is due to many dierent causes - ancient and new ones- such as the barriers that standards create, the heterogeneity between formal systems of terminology, the inexibility of information systems, the variable capacity of communication channels, and so on. What is needed is an integrated approach to break down the eective barriers of multimedia information handling systems.
Solutions
To provide solutions to the problems identied above a multi-disciplinary approach is required. To analyze the sensory and textual data, techniques from signal processing, speech recognition, computer vision, and natural language processing are needed. To understand the information, knowledge engineering and machine learning play an important role. For storing the tremendous amounts of data, database systems are required with high performance, the same holds for the network technology involved. Apart from technology, the role of the user in multimedia is even more important than in traditional systems, hence visualization and human computer interfacing form another fundamental basis in multimedia. In this course the focus is not on one of the underlying elds, but aims to provide the necessary insight in these elds to be able to use them in building solutions through multimedia information systems. To that end we decompose the design of the multimedia information system into four modules: 1. Import module: this module is responsible for acquiring the data and store it in the database with a set of relevant descriptors. 2. Data mining module: based on the descriptors this module aims at extracting new knowledge from the large set of data in the database. 3. Interaction module: this module deals with the end-user who has a specic search task and includes the information visualization which helps the user in understanding the content of the information system. 4. Export module: this modules deals with the delivery of information to other systems and external users. Of course there is another important module for data storage which should provide ecient access to large collections of data. For this lecture notes, we will assume that such a database management system is available, but in fact this is a complete research eld on its own. In gure 1.1 the overview of the dierent modules and their interrelations are shown. This dataow diagram is drawn conform the Gane-Sarson notation [41]. The diagrams consist of three elements: Interfaces: these are drawn as rectangles, and denote an abstract interface to a process. Data can originate from an Interface. Processes: elements drawn as rounded rectangles. Each process must have both input and output dataow(s). Datastores: drawn with open rectangles.
10
Multimedia Information Systems
Import
User Interaction
Export
Contextual KnowledgeRaw Data Formatted Result
Data Preparation
Data Mining
2D-screen Visualization
Information Request
Feedback
Results
Formatting
Selection Layout Constraints
Visualization
Multimedia Descriptors Ditgital Multimedia Data Ontologies Data New Knowledge
Structured DataSet
Interaction Processing
Multimedia Data Items Layout Descriptions Data Data, Metadata Request QuerySpec Multimedia Descriptions
Database Management System
SQL statement
Query Result
Database
Figure 1.1: Overall architecture of a Multimedia Information System, giving a blueprint for the lecture notes.
Dataows: drawn as arrows going from source to target. The Gane-Sarson method is a topdown method where each process can be detailed further, as long as the dataows remain consistent with the overall architecture. In fact, in subsequent chapters the individual processing steps in the system are further detailed and the dataow pictures exhibit such consistency throughout the notes. Thus, the architecture gives a blueprint for the lecture notes.
Overview of the lecture notes

The lecture notes start of in chapter 2 by giving a general overview of dierent domains in which multimedia information systems are encountered and consider the general steps involved in the process of putting data into the database. We consider the representation of the data, in particular for images, audio, video and text in chapter 3. Data analysis is then explored in chapter 5. We then move on to the specic domain of Video Information System and consider its indexing in chapter 6. The above modules comprise the import module. Having stored all data and descriptors in the database allows to start the data mining module which will be considered in chapter 7. The visualization part of the interaction module is considered in chapter 8, whereas the actual interaction with the user is described in chapter 9. Finally, in chapter 10 the data export module is explored.
Key terms
Every chapter ends with a set of key terms. Make sure that after reading each chapter you know the meaning of each of these key terms and you know the examples and methods associated with these key terms, especially those which are explained in more detail during the lectures. Check whether you know the meaning after
11 each lecture in which they were discussed as subsequent lectures often assume these terms are known. Enjoy the course, Marcel
Keyterms in this chapter

Multimedia information, Gane-Sarson dataow diagrams, data import, data export, interaction
12
Chapter 2
Data, domains, and applications

As indicated in the introduction multimedia information is appearing in all dierent kinds of application in various domains. Here we focus on video documents as their richest form they contain visual, auditory, and textual information. In this chapter we will consider how to analyze these domains and how to prepare the data for insertion into the database. To that end we rst describe, in section 2.1 dierent domains and the way video data is produced and used. From there we categorize the data from the various applications in order to be able to select the right class of tools later (section 2.2). Then we proceed to the way the data is actually acquired in section 2.3. The role of external knowledge is considered in section 2.4. We then consider in detail how a video document is created, as this forms the basis for later indexing (section 2.5). Finally, we consider how this leads to a general framework which can be applied in dierent domains.
2.1
Data and applications
In the broadcasting industry the use of video documents is of course obvious. Large amounts of data are generated in the studios creating news, lms, soaps, sport programs, and documentaries. In addition, their sponsors create signicant amount of material in the form of commercials. Storing the material produced in a multimedia information systems allows to reuse the material later. Currently, one of the most important applications for the broadcasting industry is to do multi-channelling i.e. distributing essentially the same information via the television, internet and mobile devices. In addition interactive television is slowly, but steadily, growing e.g. allowing to vote on your dierent idol. More and more of the video documents in broadcasting are created digitally, however the related information still is not distributed alongside the programs and hence not always available for storing it in the information system. Whereas the above video material is professionally produced, we see lot of unprofessional videos being shot by people with their private video camera, their webcam, or more recently with their PDA or telephone equipped with a camera. The range of videos one can encounter here is very large, as cameras can be used for virtually everything. However, in practice, many of the videos will mostly contain people or holiday scenes. The major challenge in consumer applications is organizing the data in such a way that you can later nd all the material you shot e.g. by location, time, persons, or event. 13
14
CHAPTER 2. DATA, DOMAINS, AND APPLICATIONS
Data Preparation
Raw Data Contextual Knowledge
Data acquisition
Ontology Creation
Sensory Data
Data Digitization
Digitally Produced Multimedia Data Other Data Digitized Sensory Data
DataCompression
Compressed Data
ClassInformation
DataRepresentation
Ontologies
DataFeatures
DataSimilarity
Multimedia Indexing
Ditgital Multimedia Data Multimedia Descriptors
Figure 2.1: Overview of the steps involved in importing data and domain knowledge into the database. DataRepresentation and DataAnalysis will be described in the next chapters, here we focus on the top part of the picture.
2.2. CATEGORIZATION OF DATA
15
In education the use of video material is somewhat related to the creation of documentaries in broadcasting, but has added interactive possibilities. Furthermore, you see and more and more lectures and scientic presentations being recorded with a camera and made accessible via the web. They can form the basis for new teaching material. For businesses the use of electronic courses is an eective way of reducing the timeload for instruction. Next to this, videos are used in many cases for recording business meetings. Rather than scribing the meeting, action lists are sucient as the videos can replace the extensive written notes. Furthermore, it allows people who were not attending the meeting to understand the atmosphere in which certain decisions were made. Another business application eld is the observation of customers. This is of great importance for marketing applications. Lets now move to the public sector where among other reasons, but for a large part due to september 11, there has been an increased interest in surveillance applications guarding public areas and detecting specic people, riots and the like. What is characteristic for these kind of applications is that the interest is only in a very limited part of the video, but clearly one does not know beforehand which part contains the events of interest. Not only surveillance, but also cameras on cash machines and the like provide the police with large amounts of material. In forensic analysis, video is therefore also becoming of great importance. Another application in the forensic eld is the detection and identication of various videos found on PCs or the Internet containing material like child pornography or racism. Finally, video would be a good way of recording a crime scene for later analysis. Somewhat related to surveillance is the use of video observation in health care. Think for example of a video camera in an old peoples home automatically identifying if someone falls down or has some other problem.
2.2
Categorization of data
Although all of the above applications use video, the nature of the videos is quite dierent for the dierent applications. We now give some categorizations to make it easier to understand the characteristics of the videos in the domain. A major distinction is between produced and observed video. Denition 1 (Produced video data) videos that are created by an author who is actively selecting content and where the author has control over the appearance of the video. Typical situations where the above is found is in the broadcasting industry. Most of the programs are made according to a given format. The people and objects in the video are known and planned. For analysis purposes it is important to further subdivide this category into three levels depending in which stage of the process we receive the data: raw data: the material as it is shot. edited data: the material that is shown in the nal program recorded data: the data as we receive it in our system Edited data is the richest form of video as it has both content and a layout. When appropriate actions are taken directly when the video is produced, many indices can directly be stored with the data. In practice, however, this production info is often not stored. In that case recorded data becomes dicult to analyze as things like layout information have to be reconstructed from the data. The other major category is formed by:
16
Denition 2 (Observed video data) videos where a camera is recording some scene and where the author does not have means to manipulate or plan the content. This kind of video found is found in most of the applications, and the most typical examples are surveillance videos and meetings. However, also in broadcasting this is found. E.g. in soccer videos the program as a whole is planned, but the content itself cannot be manipulated by the author of the document. Two other factors concerning the data are important: quality of the data: whats the resolution and signal-to-noise ratio of the data. The quality of the video can vary signicantly for the dierent applications, ranging from high-resolution, high quality videos in the broadcasting industry, to very low quality data from small cameras incorporated in mobile phones. A nal important factor is the Application control : how much control does one have on the circumstances under which the data is recorded. Again the broadcasting industry goes to extreme cases here. E.g. in lms the content is completely described in a script and lighting conditions are almost completely controlled, if needed enhanced using lters. In security applications the camera can be put at a xed position which we now. For mobile phones the recording conditions are almost arbitrary. Finally, a video is always shot for some reason following [59] we dene: Purpose the reason for which the video document is made being entertainment, information, communication, or data analysis.
2.3
Data acquisition
A lot of video information is already recorded in digital form and hence can easily be transported to the computer. If the video is shot in analog way a capture device has to be used in the computer to digitize the sensory data. In both cases the result is a stream of frames where each frame is typically an RGB-image. A similar thing holds for audio. Next to the audiovisual data there can be a lot of accompanying textual information like the teletext channel containing text in broadcasting, the script of a movie, scientic documents related to a presentation given, or the documents which are subject of discussion at a business meeting. Now let us make the result of acquisition of multimedia data more precise. For each modality the digital is a temporal sequence of fundamental units, which in itself do not have a temporal dimension. The nature of these units is the main factor discriminating the dierent modalities. The visual modality of a video document is a set of ordered images, or frames. So the fundamental units are the single image frames. Similarly, the auditory modality is a set of samples taken within a certain time span, resulting in audio samples as fundamental units. Individual characters form the fundamental units for the textual modality. As multimedia datastreams are very large data compression is usually applied, except when the data can be processed directly and there is no need for storing the video. For video the most common compression standards are MPEG-1, MPEG-2 and MPEG-4. For audio mp3 is the best known compression standard. Finally, apart from the multimedia data, there is lot of factual information related to the video sources, like e.g. the date of a meeting, the value at the stock market of the company discussed in a video, or viewing statistics for broadcast.
2.4. ONTOLOGY CREATION
17
2.4
Ontology creation
As indicated above, many information sources are used for a video. Furthermore, videos rarely occur in isolation. Hence, there is always a context in which the video has to be considered. To that end for indexing and analysis purposes it is important to make use of external knowledge sources. Examples are plenty. In news broadcast news sites on the internet provide info on current issues, the CIA factbook contains information on current countries and presidents. Indexing lm is greatly helped by considering the Internet Movie Database and so on. Furthermore, within a company or institute local knowledge sources might be present that can help in interpreting the data. For security applications a number of images of suspect people might be available. All of the above information and knowledge sources have their own format and style of use. It is therefore important to structure the dierent knowledge sources into ontologies and make the information elements instantiations of concepts in the ontology. In this manner the meaning of a concept is uniquely dened, and can hence can be used in a consistent way.
2.5
1
Produced video documents
As a baseline for video we consider how video documents are created in a production environment. In chapter 6 we will then consider the indexing of recorded videos in such an environment as in this manner all dierent aspects of a video will be covered. An author uses visual, auditory, and textual channels to express his or her ideas. Hence, the content of a video is intrinsically multimodal. Let us make this more precise. In [94] multimodality is viewed from the system domain and is dened as the capacity of a system to communicate with a user along dierent types of communication channels and to extract and convey meaning automatically. We extend this denition from the system domain to the video domain, by using an authors perspective as: Denition 3 (Multimodality) The capacity of an author of the video document to express a predened semantic idea, by combining a layout with a specic content, using at least two information channels. We consider the following three information channels or modalities, within a video document: Visual modality: contains the mise-en-sc`ne, i.e. everything, either naturally e or articially created, that can be seen in the video document; Auditory modality: contains the speech, music and environmental sounds that can be heard in the video document; Textual modality: contains textual resources that describe the content of the video document; For each of those modalities, denition 3 naturally leads to a semantic perspective i.e. which ideas did the author have, a content perspective indicating which content is used by the author and how it is recorded, and a layout perspective indicating how the author has organized the content to optimally convey the message. We will now discuss each of the three perspectives involved.
1 This
section is adapted from [131].
18
2.5.1
Semantic index
The rst perspective expresses the intended semantic meaning of the author. Dened segments can have a dierent granularity, where granularity is dened as the descriptive coarseness of a meaningful unit of multimodal information [30]. To model this granularity, we dene segments on ve dierent levels within a semantic index hierarchy. The rst three levels are related to the video document as a whole. The top level is based on the observation that an author creates a video with a certain purpose. We dene: Purpose: set of video documents sharing similar intention; The next two levels dene segments based on consistent appearance of layout or content elements. We dene: Genre: set of video documents sharing similar style; Sub-genre: a subset of a genre where the video documents share similar content; The next level of our semantic index hierarchy is related to parts of the content, and is dened as: Logical units: a continuous part of a video documents content consisting of a set of named events or other logical units which together have a meaning; Where named event is dened as: Named events: short segments which can be assigned a meaning that doesnt change in time; Note that named events must have a non-zero temporal duration. A single image extracted from the video can have meaning, but this meaning will never be perceived by the viewer when this meaning is not consistent over a set of images. At the rst level of the semantic index hierarchy we use purpose. As we only consider video documents that are made within a production environment the purpose of data analysis is excluded. Genre examples range from feature lms, news broadcasts, to commercials. This forms the second level. On the third level are the dierent sub-genres, which can be e.g. horror movie or ice hockey match. Examples of logical units, at the fourth level, are a dialogue in a drama movie, a rst quarter in a basketball game, or a weather report in a news broadcast. Finally, at the lowest level, consisting of named events, examples can range from explosions in action movies, goals in soccer games, to a visualization of stock quotes in a nancial news broadcast.
2.5.2
Content
The content perspective relates segments to elements that an author uses to create a video document. The following elements can be distinguished [12]: Setting: time and place in which the videos story takes place, can also emphasize atmosphere or mood; Objects: noticeable static or dynamic entities in the video document; People: human beings appearing in the video document; Typically, setting is related to logical units as one logical unit is taken in the same location. Objects and people are the main elements in named events.
2.5. PRODUCED VIDEO DOCUMENTS
19
2.5.3
Recording the scene
The appearance of the dierent content elements can be inuenced by an author of the video document by using modality specic style elements. For the visual modality an author can apply dierent style elements. She can use specic colors and lighting which combined with the natural lighting conditions denes the appearance of the scene. Camera angles and camera distance can be used to dene the scale and observed pose of objects and people in the scene. Finally, camera movement in conjunction with the movement of objects and people in the scene denes the dynamic appearance. Auditory style elements are the loudness, rhythmic, and musical properties. The textual appearance is determined by the style of writing and the phraseology i.e. the choice of words, and the manner in which something is expressed in words. All these style elements contribute to expressing an authors intention.
2.5.4
Layout
The layout perspective considers the syntactic structure an author uses for the video document. Upon the fundamental units an aggregation is imposed, which is an artifact from creation. We refer to this aggregated fundamental units as sensor shots, dened as a continuous sequence of fundamental units resulting from an uninterrupted sensor recording. For the visual and auditory modality this leads to: Camera shots: result of an uninterrupted recording of a camera; Microphone shots: result of an uninterrupted recording of a microphone; For text, sensor recordings do not exist. In writing, uninterrupted textual expressions can be exposed on dierent granularity levels, e.g. word level or sentence level, therefore we dene: Text shots: an uninterrupted textual expression; Note that sensor shots are not necessarily aligned. Speech for example can continue while the camera switches to show the reaction of one of the actors. There are however situations where camera and microphone shots are recorded simultaneously, for example in live news broadcasts. An author of the video document is also responsible for concatenating the different sensor shots into a coherent structured document by using transition edits. He or she aims to guide our thoughts and emotional responses from one shot to another, so that the interrelationships of separate shots are clear, and the transitions between sensor shots are smooth [12]. For the visual modality abrupt cuts, or gradual transitions2 , like wipes, fades or dissolves can be selected. This is important for visual continuity, but sound is also a valuable transitional device in video documents. Not only to relate shots, but also to make changes more uid or natural. For the auditory transitions an author can have a smooth transition using music, or an abrupt change by using silence [12]. To indicate a transition in the textual modality, e.g. closed captions, an author typically uses >>>, or dierent colors. They can be viewed as corresponding to abrupt cuts as their use is only to separate shots, not to connect them smoothly. The nal component of the layout are the optional visual or auditory special eects, used to enhance the impact of the modality, or to add meaning. Overlayed
2 A gradual transition actually contains pieces of two camera shots, for simplicity we regard it as a separate entity.
20
Purpose
Semantic Index
Genre Sub-genre Logical units Named events
Content Layout
S O P
S O P
S O P
Video data
Visual
Auditory
Textual
Sensor Shots Fundamental Units Transition Edits Special Effects
Figure 2.2: A unifying framework for multimodal video indexing based on an authors perspective. The letters S, O, P stand for setting, objects and people. An example layout of the auditory modality is highlighted, the same holds for the others.
text, which is text that is added to video frames at production time, is also considered a special eect. It provides the viewer of the document with descriptive information about the content. Moreover, the size and spatial position of the text in the video frame indicate its importance to the viewer. Whereas visual eects add descriptive information or stretch the viewers imagination, audio eects add level of meaning and provide sensual and emotional stimuli that increase the range, depth, and intensity of our experience far beyond what can be achieved through visual means alone [12]. Note that we dont consider articially created content elements as special eects, as these are meant to mimic true settings, objects, or people.
2.6
Discussion
Based on the discussion in the previous sections we come to a unifying multimodal video indexing framework based on the perspective of an author. This framework is visualized in gure 2.2. It forms the basis for the discussion of methods for video indexing in chapter 6. Although the framework is meant for produced video it is in fact a blueprint for other videos also. Basically, all of the above categories of videos can be mapped to the framework. However, not in all cases, all modalities will be used, and maybe there is no layout as no editing has taken place and so on.

ontologies, sensory data, digital multimedia data, data compression, purpose, genre, sub-genre, logical unit, named events, setting, objects, people, sensor shots, camera shots, microphone shots, text shots, transition edits, special eects.
Chapter 3
Data representation
3.1 Categorization of multimedia descriptions
To be able to retrieve multimedia objects with the same ease as we are used to when accessing text or factual data as well as being able to lter out irrelevant items in the large streams of multimedia reaching us requires appropriate descriptions of the multimedia data. Although it might seem that multimedia retrieval is the trivial extension of text retrieval it is in fact far more dicult. Most of the data is of sensory origin (image, sound, video) and hence techniques from digital signal processing and computer vision are required to extract relevant descriptions. Such techniques in general yield features which do not relate directly to the users perception of the data, the so called semantic gap. More precisely dened as [128]: The semantic gap is the lack of coincidence between the information that one can extract from the sensory1 data and the interpretation that the same data have for a user in a given situation. Consequently, there are two levels of descriptions of multimedia content one on either side of the semantic gap. In addition to the content description there are also descriptions which are mostly concerned with the carrier of the information. Examples are the pixel size of an image, or the sampling rate of an audio fragment. It can also describe information like the owner of the data or the time the data was generated. It leads to the following three descriptive levels [69]: perceptual descriptions: descriptions that can be derived from the data conceptual descriptions: descriptions that cannot be derived directly from the data as an interpretation is necessary. non-visual/auditory descriptions: descriptions that cannot be derived from the data at all. The non-visual/auditory descriptions are Standards for the exchange of multimedia information like MPEG-7 [88][89][80] give explicit formats for descriptions attached to the multimedia object. However, these standards do not indicate how to nd the values/descriptions to be used for a specic multimedia data object. Especially not how to do this automatically. Clearly the semantic gap is exactly what separates the perceptual and conceptual level. In this chapter we will consider the representation of the dierent modalities
1 In the reference this is restricted to visual data, but the denition is applicable to other sensory data like audio as well.
21
22
CHAPTER 3. DATA REPRESENTATION
at the perceptual level. In later chapters we will consider indexing techniques which are deriving descriptions at the conceptual level. If such indexing techniques can be developed for a specic conceptual term, this term basically moves to the perceptual category as the system can generate such a description automatically without human intervention.
3.2
Basic notions
Multimedia data can be described at dierent levels. The rst distinction to be made is between the complete data item like an entire video, a song, or a text document and subparts of the data which can attain dierent forms. A second distinction is between the content of the multimedia data and the layout. The layout is closely related to the way the content is presented to the viewer, listener, or reader. Now let us take a closer look at what kind of subparts one can consider. This is directly related to the methods available for deriving those parts, and it has a close resemblance to the dierent classes of descriptions considered. Four dierent objects are considered: perceptual object: a subpart in the data which can be derived by weak segmentation, i.e. segmentation of the dataeld based on perceptual characteristics of the data. conceptual object: a part of the data which can be related to conceptual descriptions and which cannot be derived from the data without an interpretation, the process of nding these objects is called strong segmentation. partition: the result of a partitioning of the dataeld, which is not dependent on the data itself. layout object: the basic data elements which are structured and related by the layout. Examples of the above categories for the dierent modalities will be considered in the subsequent sections. In the rest of the notes we will use : multimedia item : a full multimedia data item, a layout element, or an element in the partioning of a full multimedia item. multimedia object is used it can be either a conceptual or perceptual object. Hence, a multimedia item can be for example a complete video, an image, or an audio CD, but also a shot in a video, or the upper left quadrant of an image. A multimedia object can e.g. be a paragraph in a text, or a house in an image.
3.3
2
2 The
Audio representation
info in this section is mostly based on [78].
3.3. AUDIO REPRESENTATION

1 0.5 amplitude
23
0.5
10
20
30 time [ms]
40
50
60
50 40 30 [dB] 20 10 0 10 20 0 1 2 3 4 frequency [kHz] 5 6 7 8
Figure 3.1: An example of a signal in the time-domain (top) and its spectrum (bottom).
3.3.1
Audio features
An audio signal is a digital signal in which amplitude is given as function of time. When analyzing audio one can consider the time-domain signal directly, but it can be advantageous to consider the signal also on the basis of the frequencies in the signal. The Fourier transform is a well known technique to compute the contribution of the dierent frequencies in the signal. The result is called the audio spectrum of the signal. An example of a signal and its spectrum are shown in gure 3.1. We now consider audio features which can be derived directly from the signal. Average Energy: the average squared amplitude of the signal, an indication of the loudness. Zero Crossing Rate: a measure indicating how often the amplitude switches from positive to negative and vice versa. Rhythm: measures based on the pattern produced by emphasis and duration of notes in music or by stressed and unstressed syllables in spoken words. Linear Prediction Coecients (LPC): measure of how well a sample in the signal can be predicted based on previous samples. In the frequency domain there are is also an abundance of features to compute. Often encountered features are: Bandwidth: the frequency range of the sound, in its simplest form the dierence between the largest and smallest non-zero frequency in the spectrum.
24
CHAPTER 3. DATA REPRESENTATION Fundamental frequency: the dominant frequency in the spectrum. Brightness (or spectral centroid): the is normalized sum of all frequencies times how much this frequency is present in the signal. Mel-Frequency Cepstral Coecients (MFCC): a set of coecients designed such that they correspond to how humans hear sound. Pitch: degree of highness or lowness of a musical note or voice. Harmonicity: degree to which the signal is built out of multiples of the fundamental frequency. Timbre: characteristic quality of sound produced by a particular voice or instrument (subjective).
3.3.2
Audio segmentation
An audio signal can be long and over time the characteristics of the audio signal and hence its features will change. Therefore, in practice one partitions the signal into small segments of say a duration of 10 milliseconds. For each of those segments any of the features mentioned can then be calculated. An audio signal can contain music, speech, silence, and other sounds (like cars etc.). The aim of weak segmentation of audio aims at decomposing the signal into these four components. An important step is decomposing the signal based on dierent ranges of frequency. A second important factor is harmonicity as this distinguishes music from other audio. Strong segmentation of audio aims at detecting dierent conceptual objects like cars, or individual instruments in a musical piece. This will require to build models for each of the dierent concepts.
3.3.3
Temporal relations
When I have two dierent audio segments A and B , there is a selected set of temporal relations that can hold between A and B namely precedes, meets, overlaps, starts, equals, nishes, during and there inverses denoted by add i at the end of the name (if B precedes A the relation precedes i holds between B and A). These are known as the Allens relations [5]. As equals is symmetric there are 13 such relations in total. The relations are such that for any two intervals A and B one and exactly one of the relations hold.
3.4
3.4.1
Image representation
Image features
A digital image is an array of pixels, where each pixel has a color. The basic representation for the color of the pixel is the triple R(ed), G(reen), B(lue). There are however many other color spaces which are more appropriate in certain tasks. We consider HSV and Lab here. A rst thing to realize is that the color of an object is actually a color spectrum, indicating how much a certain wavelength is present (white light contains an equal amount of all wavelengths). This is the basis for dening HSV. To be precise the three components of HSV are as follows: H(ue) is the dominant wavelength in the color spectrum. It is what you typically mean when you say the object is red, yellow, blue, green, purple etc. S(aturation) is a measure for the amount of white in the
3.4. IMAGE REPRESENTATION
25
spectrum. It denes the purity of a color distinguishing for example signal-red from pink. Finally, the V(olume) is a measure for the brightness or intensity of the color. This make the dierence between a dark and a light color if they have the same H and S values. Lab is another color space that is used often. The L is similar to the V in HSV. The a and b are similar to H and V. The important dierence is that in the Lab space the distance between colors in the color space is approximately equal to the perceived dierence in the colors. This is important in dening similarity see section chapter 4. In the above, the color is assigned to individual pixels. All these colors will generate patterns in the image, which can be small, or large. These patterns are denoted with the general term texture. Texture is of great importance in classifying dierent materials like the line-like pattern in a brick wall, or the dot-like pattern of sand. In an image there will be in general dierent colors and/or textures. This means there will be many positions in the image where there is a signicant change in image data, in particular a change in color or texture. These changes form (partial) lines called edges. The above measure give a local description of the data in the image. In many cases global measures, summarizing the information in the image are used. Most commonly these descriptions are in the form of color histograms counting how many pixels have a certain color. It can however, also, be a histogram on the directions of the dierent edge pixels in the image. An image histogram looses all information on spatial congurations of the pixels. If I have a peak in the histogram at the color red, the pixels can be scattered all around the image, or it can be one big red region. Color coherence vectors are an alternative representation which considers how many pixels in the neighborhood have the same color. A similar representation can be used for edge direction. The above histograms and coherence vectors can be considered as summaries of the data, they are non-reversible. I cannot nd back the original image if I have the histogram. The following two descriptions do have that property. The Discrete Cosine Transform (DCT) is a transform which takes an image and computes it frequency domain description. This is exactly the same as considered for audio earlier, but now in two dimensions. Coecients of the low frequency components given a measure of the amount of large scale structure where highfrequency information gives information on local detailed information. The (Haar) wavelet transform is a similar representation, which also takes into account where in the image the specic structure is found.
3.4.2
Image segmentation
For images we can also consider the dierent ways of segmenting an image. A partition decomposes the image into xed regions. Commonly this is either a xed set of rectangles, or one xed rectangle in the middle of the image, and a further partition of the remaining space in a xed number of equal parts. Weak segmentation boils down to grouping pixels in the image based on a homogeneity criterion on color or texture, or by connecting edges. It leads to a decomposition of the image where each region in the decomposition has a uniform color or texture. For strong segmentation, nding specic conceptual objects in the image, we again have to rely on models for each specic object, or a large set of hand-annotated examples.
26
3.4.3
Spatial relations
If I have two rectangular regions we can consider the Allens relations separately for the x- and y- coordinate. However, in general regions have arbitrary shape. There are various 2D spatial relations that I can consider. Relations like leftof, above, surrounded-by, and nearest neighbor are an indication of the relative positions of regions in the image. Constraints like inside, enclosed-by are indications of topological relations between regions.
3.5
3.5.1
Video representation
Video features
As a video is a set of temporally ordered images its representation clearly shares many of the representations considered above for images. However, the addition of a time component also adds many new aspects. In particular we can consider the observed movement of each pixel from one frame to another called the optic ow, or we can consider the motion of individual objects segmented from the video data.
3.5.2
Video segmentation
A partition of an image can be any combination of a temporal and spatial partition. For weak segmentation we have, in addition to color and texture based grouping of pixels, motion based grouping which groups pixels if the have the same optic ow i.e. move in the same direction with the same speed. Strong segmentation requires again object models. The result of either weak or strong segmentation is called a spatio-temporal object. For video there is one special case of weak segmentation which is temporal segmentation. Thus, the points in time are detected where there is a signicant change in the content of the frame. This will be considered in a more elaborate form later.
3.5.3
Spatio-temporal relations
Spatio-temporal relations are clearly a combination of spatial and temporal relations. One should note, however, that in a video spatial relations between two objects can vary over time. Two objects A and B can be in the relation A left-of B at some point in time, while the movement of A and B can yield the relation B left-of A at a later point in time.
3.6
3.6.1
Text representation
Text features
The basic representation for text is the so called bag-of-words approach. In this approach a kind of histogram is made indicating how often a certain word is present in the text. This histogram construction is preceded by a stop word elimination step in which words like the, in, etc. are removed. One also performs stemming on the words bringing each word back to its base. E.g. biking will be reduced to the verb to bike. The bag-of-words model commonly used is the Vector Space Model [116]. The model is based on linear algebra. A document is modeled as a vector of words where each word is a dimension in Euclidean space. Let T = {t1 , t2 , . . . , tn } denote the
3.6. TEXT REPRESENTATION

Dm D1 D2 D3 D4
27
T1 1 3 0 4 T2 2 0 0 0 T3 0 1 0 0 T4 0 0 1 2
Tn
Figure 3.2: A Term Document matrix.
set of terms in the collection. Then we can represent the terms dT in document dj j as a vector x = (x1 , x2 , . . . , xn ) with: xi = ti j 0 if ti dT j if ti dT j ; . (3.1)
Where ti represents the frequency of term ti in document dj . Combining all docj ument vectors creates a term-document matrix. An example of such a matrix is shown in gure 3.2. Depending on the context, a word has an amount of information. In an archive about information retrieval, the word retrieval does not add much information about this document, as the word retrieval is very likely to appear in many documents in the collection. The underlying rationale is that words that occur frequently in the complete collection have low information content. However, if a single document contains many occurrences of a word, the document is probably relevant. Therefore, a term can be given a weight, depending on its information content. A weighting scheme has two components: a global weight and a local weight. The global importance of a term is indicating its overall importance in the entire collection, weighting all occurrences of the term with the same value. Local term weighting measures the importance of the term in a document. Thus, the value for a term i in a document j is L(i, j) G(i), where L(i, j) is the local weighting for term i in document j and G(i) is the global weight. Several dierent term weighting schemes have been developed. We consider the following simple form here known i as Inverse Term Frequency Weigthing. The weight wj of a term ti in a document j is given by
i wj = ti j
log(
N ) ti
(3.2)
where N is the number of documents in the collection and ti denotes the total number of times word ti occurs in the collection. The logarithm dampens the eect of very high term frequencies. Going one step further one can also consider the co-occurrence of certain words in particular which words follow each other in the text. If one applies this to a large collection of documents to be used in analyzing other documents it is called a bi-gram language model. It gives the probability that a certain word is followed by another word. It is therefore also an instantiation of a Markov model. When three or more general n subsequent words are used we have a 3-gram or n-gram language model.
28
3.6.2
Text segmentation
Dierent parts of a document may deal with dierent topics and therefore it can be advantageous to partition a document into partitions of xed size for which the word histogram is computed. A useful technique, which can be considered an equivalent of weak segmentation, is part-of-speech tagging [79]. In this process each word is assigned the proper class e.g. verb, noun, etc. A simple way of doing this is by making use of a dictionary and a bi-gram language model. One can also take the tagged result and nd larger chunks as aggregates of the individual words. This technique is known as chunking [1]. A more sophisticated, but less general approach, is to generate a grammar for the text and use a parser to do the part of speech tagging. It is, however, very dicult to create grammars which can parse an arbitrary text.
3.6.3
Textual relations
As text can be considered as a sequence of words, the order in which words are placed in the text yields directly a relation precedes. This is similar to the Allens relations for time, but as words cannot overlap, there is no need for the other relations. In the case where also the layout of the text is taking into account i.e. how it is printed on paper, many other relations will appear as we can now consider the spatial relations between blocks in the documents.
3.7
Other data
As indicated earlier, next to the multimodal stream of text, audio, and video there is also other related information which is not directly part of the video document. This is often factual data. For these we use the common categorization into [156]: nominal data: these values are taken from a selected set of symbols, where no relation is supposed between the dierent symbols. ordinal data: these values are taken from a selected set of symbols, where an ordering relation is supposed between the dierent symbols. interval data: quantities measured in xed and equal units. ratio data: quantities for which the measurement scheme inherently denes a zero point.

Perceptual metadata, conceptual metadata, non-visual/auditory metadata, multimedia item, multimedia object, partitioning, perceptual object, weak segmentation, conceptual object, strong segmentation, audio spectrum, Allens relations, color space, texture, edges, spatial relations, topological relations, optic ow, spatio-temporal object, n-gram, part-of-speech tagging, chunking
Chapter 4
Similarity
When considering collections of multimedia items, it is not sucient to describe individual multimedia items. It is equally important to be able to compare dierent multimedia items. To that end we have to dene a so called (dis)similarity function S to compare two multimedia items I1 and I2 . The function S measures to what extent I1 and I2 look or sound similar, to what extent they share a common style, or to what extent they have the same interpretation. Thus, n general we distinguish three dierent levels of comparison: Perceptual similarity Layout similarity Semantic similarity Each of these levels will now be described.
4.1
Perceptual similarity
Computing the dissimilarity of two multimedia items based on their data is mostly done by comparing their feature vectors. For an image this can for example be a HSV color histogram, but it can also be a HSV histogram followed by a set of values describing the texture in the image. One dissimilarity function between the Euclidean distance. As not all elements in the vector might be equally important the distances between the individual elements can be weighted. To make the above more precise let F 1 = {fi1 }i=1,...,n and F 2 = {fi2 }i=1,...,n be the two vectors describing multimedia item I1 and I2 respectively. Then the dissimilarity SE according to weighted Euclidean distance is given by:
n
SE (F1 , F2 ) =
i=1
wi (fi2 fi1 )2
with w = {wi }i=1,n a weighting vector. For color histograms (or any other histogram for that matter) the histogram intersection is also used often to denote dissimilarity. It does not measure Euclidean distance but takes the minimum of the two entries. Using the same notation as above we nd:
n
S (F1 , F2 ) =
i=1
min (fi2 , fi1 ) 29
30
CHAPTER 4. SIMILARITY
In [133] it is shown that it is advantageous for computing the similarity that one computes the cumulative histogram F rather than the regular histogram. This is due to the fact that the use of cumulative histograms reduces the eect of having to limit the number of dierent bins in a histogram. The histograms entries are then computed as:
m
F (m) =
k=0
Fk ,
where F is the regular histogram. After computing the cumulative histogram, histogram intersection can be used to compute the similarity between the two histograms. For comparing two texts it is common to compute the similarity rather than dissimilarity. Given the vector-space representation the similarity between two text vectors q and d both containing n elements is dened as: S =qd , where denotes the inner product computed as
n
(4.1)
qd=
i=1
q i di
If you would consider the vector as a histogram indicating how many times a certain word occurs in the document (possibly weighted by the importance of the term) it boils down to histogram multiplication. Hence, when a terms occurs often in both texts the contribution of that term to the value of the innerproduct will be high. A problem with the above formulation is the fact that larger documents will contain more terms and hence are on average more similar than short pieces of text. Therefore, in practice the vectors are normalized by their length
n
||q|| =
i=1
2 qi
Leading to the so called Cosine Similarity. S= qd ||q||||d||
This measure is called the cosine similarity as it can be shown that it equals the cosine of the angle between the two length normalized vectors. Feature vectors and similarity will play a crucial role in interactive multimedia retrieval which will be considered in chapter 9.
4.2
Layout similarity
To measure the (dis)similarity of two multimedia items based on their layout a common approach is to transform the layout of the item to a string containing the essential layout structure. For simplicity we consider the layout of a video as this is easily transformed to a 1D-string, it can however be extended to 2 dimensions e.g. when comparing the layout of two printed documents. When the layouts of two multimedia items are described using strings they can be compared by making use of the edit-distance in [2] dened as follows:
4.3. SEMANTIC SIMILARITY
31
Concept 1 2 Instantiation 3 C_2
C_1
c1
c2
Figure 4.1: Example of the semantic distance of two instantiations c1, c2 of concepts C1 and C2 organized in a tree structure. In this case the distance would be equal to 3.
Denition 4 (Edit Distance) Given two strings A : a1 , a2 , a3 , ..., an and B = b1 , b2 , b3 , ..., bm over some alphabet , a set of allowed edit operations E and a unit cost for each operation, the edit distance of A and B is the minimum cost to transform A into B by making use of edit operations. As an example lets take two video sequences and consider the transitions cut (c), wipe (w), and dissolve (d) and let a shot be denoted by symbol s. Hence, = {c, w, d, s}. Two example sequences could now look like A = scswsdscs B = sdswscscscs Now let E = {insertion,deletion,substitution} and for simplicity let each operation have equal cost. To transform A into B we twice have to do a substitution to change the eects c and d used and two insert operations to add the nal cs. Hence, in this case the edit distance would be equal to 4.
4.3
Semantic similarity
When multimedia items are described using concepts which are derived from an ontology, either annotated by the user or derived using pattern recognition, we can compare two multimedia items based on their semantics. When the ontology is organized as a hierarchical tree an appropriate measure to measure the semantic distance i.e. the dissimilarity of two instantiations c1 , c2 of the concepts C1 and C2 is to take the number of steps in the tree one has to follow to move from concept C1 to C2 . If the ontology is organized as a set of hierarchical views this leads to a vector where each element in the vector is related to one of the views. A very simple example is shown in gure gure 4.1.

(dis)similarity function, histogram intersection, Euclidean distance, histogram crosscorrelation, cosine distance,edit distance,semantic distance
32
CHAPTER 4. SIMILARITY
Chapter 5
Multimedia Indexing Tools

5.1 Introduction
In the previous chapter we have made a rst step in limiting the size of the semantic gap, by representing multimedia data in terms of various features and similarities. In particular, we have considered features which are of the perceptual class. In this chapter we will consider general techniques to use these features to nd the conceptual label of a multimedia object by considering its perceptual features. The techniques required to do the above task are commonly known as pattern recognition, where pattern is the generic term used for any set of data elements or features. The descriptions in this chapter are taken for the largest part from the excellent review of pattern recognition in [57].
5.2
Pattern recognition methods
Many methods for pattern recognition exist. Most of the methods fall into one of the four following categories: Template matching: the pattern to be recognized is compared with a learned template, allowing changes in scale and pose; This simple and intuitive method can work directly on the data. For images a template is a usually a small image (lets say 20x20 pixels), for audio is it a set of samples. Given a set of templates in the same class, one template representing the class is computed, e.g. by pixelwise averaging. In its simplest form any new pattern is compared pixelwise (for images) or samplewise (for audio) to the set of stored templates. The new pattern is then assigned to the class for which the correlation between the templates is highest. In practice template matching becomes more dicult as one cannot assume that two templates to be compared are near exact copies of one another. An image might have a dierent scale, the object in the image might have a dierent pose, or the audio template might have a dierent loudness. Hence, substantial preprocessing is required before template matching can take place. Invariant features can help in this problem (see section 5.5). Statistical classication: the pattern to be recognized is classied based on the distribution of patterns in the space spanned by pattern features; Syntactic or structural matching: the pattern to be recognized is compared to a small set of learned primitives and grammatical rules for combining primitives; 33
34
CHAPTER 5. MULTIMEDIA INDEXING TOOLS
Multimedia Indexing
Classifier Learning
TestData Features
TestData ClassInformation Similarity
DataSimilarity
DataFeatures
Model Building
Decision Trees Decision Boundaries Templates .....
Classification
Multimedia Descriptors
Figure 5.1: Overview of the multimedia indexing steps.
5.3. STATISTICAL METHODS
35
For applications where the patterns have a apparent structure these methods are appealing. Thew allow the introduction of knowledge on how the patterns in the dierent classes are built from the basic primitives. As an example consider graphical symbols. Every symbol is built from lines, curves, and corners, which are combined into more complex shapes likes squares and polygons. Symbols can be distinguished by looking how the primitives are combined into creating the symbol. A major disadvantage of such methods is that it requires that all primitives in the pattern are detected correctly, which is not the case if the data is corrupted by noise. Neural networks: the pattern to be recognized is input to a network which has learned nonlinear input-output relationships; Neural networks mimic the way humans recognize patterns. They can be viewed as massively parallel computing systems consisting of an extremely large number of simple processors with many interconnections. In the human brain those simple processors are called neurons, when simulated on a computer one calls them perceptrons. In both cases, the processors have a set of weighted inputs and re if all inputs are above a certain threshold. To train a neural network, input patterns are fed to the system and the expected output is dened. Then the weights for the dierent connections are automatically learned by the system. In most systems there are also a lot of perceptrons which are neither connected to the input or output, but are part of the so-called hidden layer. Such a neural network is called a multi-layer perceptron. For specic cases neural networks and statistical classiers coincide. Neural networks are appealing as they can be applied in many dierent situations. It is, however, dicult to understand why a neural network assigns a certain class to an input pattern. Examples of the above methods are found throughout the lecture notes. The statistical methods are the most commonly encountered ones, they will be considered in more detail next.
5.3
Statistical methods
We consider methods here which assume that the patterns to be recognized are described as a feature vector, thus every pattern can be associated with a specic point in the feature space. In addition a distance function should be provided, which in our case would be a similarity function as dened in chapter 4. As an example consider the following very simple classication problem: a set of images of characters (i,o,a) for which only two features are calculated namely the width and the height of the bounding box of the character, and similarity is dened as the Euclidean distance. Before we start describing dierent methods, let us rst consider the general scheme for building classiers. The whole process is divided into two stages: Training: in this stage the system builds a model of the data and yields a classier based on a set of training patterns for which class information is provided by a supervisor. In the above a supervisor is dened. In literature those methods are therefore called supervised classication methods. The process is as follows. Every pattern is preprocessed to remove noise etc. Then a set of features are calculated, and a classier is learned through a statistical analysis of the dataset.
36
i o
i 71 4 2 38
Figure 5.2: Example confusion table.
To improve the results, the system can adjust the preprocessing, select the best features, or try other features. Testing: patterns from a test set are given to the system and the classier outputs the optimal label. In this process the system should employ exactly the same preprocessing steps and extract the same features as in the learning phase. To evaluate the performance of the system the output of the classier for the test patterns is compared to the label the supervisor has given to the test patterns. This leads to a confusion matrix, where one can see how often patterns are confused. In the above example it would for example indicate how many is were classied as being part of the class o. An example confusion table is shown in gure 5.2. After testing the performance of the system is known and the classier is used for classifying unknown patterns into their class. To make things more precise. Let the c categories be given by 1 , 2 , ...., c . Furthermore, let the vector consisting of n features be given as x = (x1 , x2 , ..., xn ). A classier is a system that takes a feature vector x and assigns to it the optimal class i . The confusion matrix is the matrix C where the elements cij denote the number of elements which have true class i , but are classied as being in class j . Conceptually the most simple classication scheme is k-Nearest Neighbor : assigns a pattern to the majority class among the k patterns with smallest distance in feature space; For this method, after selection of the relevant features a distance measure d has to be dened. In principle, this is the similarity function described in chapter 3. Very often it is simply Euclidean distance in feature space. Having dened d, classication boils down to nding the nearest neighbor(s) in feature space. In 1nearest neighbor classication the pattern x is assigned to the same class the nearest neighbor has. In k-nearest neighbors one uses majority voting on the class labels of the k points nearest in space. An example is shown in gure 5.3. The major disadvantage of using k-nearest neighbors is that it is computationally expensive to nd the k-nearest neighbors if the number of data objects is large. In statistical pattern recognition a probabilistic model for each class is derived. Hence, one only has to compare a new pattern with one model for each class. The crux of statistical pattern recognition is then to model the distribution of feature values for the dierent classes i.e. giving a way to compute the conditional probability P (x|i ). Bayes Classier : assigns a pattern to the class which has the maximum estimated posterior probability;
5.3. STATISTICAL METHODS
37
Feat 2
Feat 1
Figure 5.3: Illustration of classication based on k-nearest neighbors.
Feat 2
Classifier with minimum error
Feat 1
Figure 5.4: Example of a 2D linear separator.
Thus, assign input pattern x to class i if P (i |x) > P (j |x) for all j = i (5.1)
When feature values are be expected to be normally distributed (i.e. a Gaussian distribution) the above can be used to nd optimal linear boundaries in the feature space. An example is shown in gure 5.4. Note, that in general we do not know the parameters of the distribution (the mean and the standard deviation in the case of the normal distribution). These have to be estimated from the data. Most commonly, the mean and standard deviation of the training samples in each class are used. If only a limited set of samples per class are available a Parzen estimator can be used. In this method a normal distribution is placed at every sample with xed standard deviation. The probability of a sample to belong to the class is computed as a weighted linear combination of all these distributions. The weight is directly proportional to the distance of the new sample to the individual samples in the class. In the Bayes classier all features are considered at once, which is more accurate, but also more complicated than using features one by one. The latter leads to the Decision Tree: assigns a pattern to a class based on a hierarchical division of feature space; To learn a decision tree a feature and a threshold are selected which give a decomposition of the training set into two parts such that the one part contains all elements for which the value is smaller than the threshold, and the other part the remaining patterns. Then for each of the parts the process is repeated till the patterns into a part are assumed to be all of the same class. All the decisions made
38

Hierarchical division of space
Feat 2
Feat 1
Figure 5.5: Example of a hierarchical division of space based on a decision tree.
can be stored in a tree, hence the name. The relation between a decision tree in feature space is illustrated in gure 5.5. At classication time a pattern is taken and for the feature in the root node of the decision tree, the value of the corresponding feature is compared to the threshold. If the value is smaller the left branch of the tree is followed, the right branch is followed otherwise. This is repeated till a leaf of the tree is reached, which corresponds to a single class. All of the above models assume that the features of an object remain xed. For data which has a temporal dimension this is often not the case, when time progresses the features might change. In such cases it is needed to consider models which explicitly take the dynamic aspects into account. The Hidden Markov Model is a suitable tool for describing such time-dependent patterns which can in turn be used for classication. It bears a close relation to the Markov model considered in chapter 3. Hidden Markov Model (HMM): assigns a pattern to a class based on a sequential model of state and transition probabilities [79, 109]; Let us rst describe the components which make up a Hidden Markov Model. 1. A set S = s1 , ...., sm of possible states in which the system can be. 2. A set of symbols V which can be output when the system is in a certain state 3. The state transition probabilities A indicating the conditional probability that at time t+1 it moves to state si if before it was in state sj : p(qt+1 = si |qt = sj ) 4. The probabilities that a certain symbol in V is output when the system is in a certain state. 5. Initial probabilities that the system starts in a certain state. An example of a HMM is shown in gure 5.6. There are two basic tasks related to the use of HMM for which ecient methods exist in literature: 1. Given a sequential pattern how likely is that it is generated by some given Hidden Markov Model? 2. Given a sequential pattern and a Hidden Markov Model what is the most likely sequence of states the system went through? To use HMMs for classication we rst have to nd models for each class. This is often done in a supervised way. From there all the probabilities required are
5.4. DIMENSIONALITY REDUCTION
39
State1
p12
State2 p23
State3
State5
p34 p14 State4

Figure 5.6: Example of a Hidden Markov Model.
p45
Figure 5.7: An example of dimension reduction, from 2 dimensions to 1 dimension. (a) Two-dimensional datapoints. (b) Rotating the data so the vectors responsible for the most variation are perpendicular. (c) Reducing 2 dimensions to 1 dimension so that the dimension responsible for the most variation is preserved.
estimated by using the training set. Now to classify a pattern task 1 mentioned above is used to nd the most likely model and thus the most likely class. Task 2 can be used to take a sequence and classify each part of the sequence with it most likely state. Many variations of HMMs exists. For example, in the above description discrete variables were used. One can also use continuous variables leading to continuous observation density Markov models. Another extension are product HMM where rst an HMM is trained for every individual feature, the results of which are then combined in a new HMM for integration.
5.4
Dimensionality reduction
In many practical pattern recognition problems the number of features is very high and hence the feature space is of a very high dimension. A 20-dimensional feature space is hard to imagine and visualize, but is not very large in pattern recognition. Luckily enough there is often redundant information in the space and one can reduce the number of dimensions to work with. One of the best known methods for this purpose is Principal Component Analysis (PCA). Explaining the full details of the PCA is beyond the scope of the lecture notes, so we explain its simplest form, reducing a 2D feature space to a 1D feature space. See gure gure 5.7. The rst step is to nd the line which best ts the data. Then every datapoint is projected onto this line. Now, to build a classier we only consider how the dierent categories can be distinguished along this 1D line. Much simpler than the equivalent 2D problem. For higher dimensions the principle is the same, but instead of a line one can now also use the best tting plane which corresponds to using 2 of the principle components. In fact, the number of components in component analysis
40
is equal to the dimension of the original space. By leaving out one or more of the least important components, one gets the principal components.
5.5
Invariance
Selecting the proper classier is important in obtaining good results, but nding good features is even more important. This is a hard task as it depends on both the data and the classication task. A key notion in nding the proper features is invariance dened as [128]: A feature f of t is invariant under W if and only if ft remains the same regardless the unwanted condition expressed by W , t1 t2 = ft1 = ft2
W
(5.2)
To illustrate, consider again the simple example of bounding boxes of characters. If I want to say something about the relation between the width w and height h of the bounding box I could consider computing the dierence w h, but if I would scale the picture by a factor 2 I would get a result which is twice as large. If, however, I would take the aspect ratio w/h it would be invariant under scaling. In general, we observe that invariant features are related to the intrinsic properties of the element in the image or audio. An object in the image will not change if I use a dierent lamp. A feature invariant for such changes in color would be good for classifying the object. On the other hand, the variant properties do have a great inuence on how I perceive the image or sound. A loud piece of music might be far more annoying than hearing the same piece of music at normal level.

Classication, training, testing, confusion matrix, principal component analysis, knearest neighbor, template matching, Bayes classier, decision tree, Hidden Markov Model, invariance
Chapter 6
Video Indexing
1
6.1
Introduction
For browsing, searching, and manipulating video documents, an index describing the video content is required. It forms the crux for applications like digital libraries storing multimedia data, or ltering systems [95] which automatically identify relevant video documents based on a user prole. To cater for these diverse applications, the indexes should be rich and as complete as possible. Until now, construction of an index is mostly carried out by documentalists who manually assign a limited number of keywords to the video content. The specialist nature of the work makes manual indexing of video documents an expensive and time consuming task. Therefore, automatic classication of video content is necessary. This mechanism is referred to as video indexing and is dened as the process of automatically assigning content-based labels to video documents [48]. When assigning an index to a video document, three issues arise. The rst is related to granularity and addresses the question: what to index, e.g. the entire document or single frames. The second issue is related to the modalities and their analysis and addresses the question: how to index, e.g. a statistical pattern classier applied to the auditory content only. The third issue is related to the type of index one uses for labelling and addresses the question: which index, e.g. the names of the players in a soccer match, their time dependent position, or both. Which element to index clearly depends on the task at hand, but is for a large part also dictated by the capabilities of the automatic indexing methods, as well as on the amount of information that is already stored with the data at production time. As discussed in chapter 2 one of the most complex tasks is the interpretation of a recording of a produced video as we have to reconstruct the layout and analyze the content. If, however, we are analyzing the edited video with all layout information as well as scripts are available in for example MPEG-7 format (see chapter 10) the layout reconstruction and a lot of indexing is not needed and one can continue to focus on the remaining indexing tasks. Most solutions to video indexing address the how question with a unimodal approach, using the visual [27, 45, 102, 134, 141, 165, 169], auditory [35, 43, 82, 99, 100, 104, 157], or textual modality [17, 52, 170]. Good books [40, 51] and review papers [15, 18] on these techniques have appeared in literature. Instead of using one modality, multimodal video indexing strives to automatically classify (pieces of) a video document based on multimodal analysis. Only recently, approaches
1 This
chapter is adapted from [131].
41
42
CHAPTER 6. VIDEO INDEXING
Reconstructed layout Detected people Detected objects Detected setting
Video segmentation
Video data
Figure 6.1: Data ow in unimodal video document segmentation.
using combined multimodal analysis were reported [4, 8, 34, 55, 92, 105, 120] or commercially exploited, e.g. [28, 108, 151]. One review of multimodal video indexing is presented in [153]. The authors focus on approaches and algorithms available for processing of auditory and visual information to answer the how and what question. We extend this by adding the textual modality, and by relating the which question to multimodal analysis. Moreover, we put forward a unifying and multimodal framework. Our work should therefore be seen as an extension to the work of [15, 18, 153]. Combined they form a complete overview of the eld of multimodal video indexing. The multimodal video indexing framework is dened in section section 2.5. This framework forms the basis for structuring the discussion on video document segmentation in section 6.2. In section 6.3 the role of conversion and integration in multimodal analysis is discussed. An overview of the index types that can be distinguished, together with some examples, will be given in section 6.4. Finally, in section 6.5 we end with a perspective on open research questions. As indicated earlier we focus here on the indexing of a recording of produced and authored documents. Hence, we start o form the recorded datastream without making use of any descriptions that could have been added at production time. Given that this is the most elaborate task many of the methods are also applicable in the other domains that we have considered in chapter 2.
6.2
Video document segmentation
For analysis purposes the process of authoring should be reversed. To that end, rst a segmentation should be made that decomposes a video document in its layout and content elements. Results can be used for indexing specic segments. In many cases segmentation can be viewed as a classication problem, and hence pattern recognition techniques are appropriate. However, in video indexing literature many heuristic methods are proposed. We will rst discuss reconstruction of the layout for each of the modalities. Finally, we will focus on segmentation of the content. The data ow necessary for analysis is visualized in gure 6.1.
6.2. VIDEO DOCUMENT SEGMENTATION
43
6.2.1
Layout reconstruction
Layout reconstruction is the task of detecting the sensor shots and transition edits in the video data. For analysis purposes layout reconstruction is indispensable. Since the layout guides the spectator in experiencing the video document, it should also steer analysis. For reconstruction of the visual layout, several techniques already exist to segment a video document on the camera shot level, known as shot boundary detection 2 . Various algorithms are proposed in video indexing literature to detect cuts in video documents, all of which rely on computing the dissimilarity of successive frames. The computation of dissimilarity can be at the pixel, edge, block, or frame level. Which one is important is largely dependent on the kind of changes in content present in the video, whether the camera is moving etc. The resulting dissimilarities as function of time are compared with some xed or dynamic threshold. If the dissimilarity is suciently high a cut is declared. Block level features are popular as they can be derived from motion vectors, which can be computed directly from the visual channel, when coded in MPEG, saving decompression time. For an extensive overview of dierent cut detection methods we refer to the survey of Brunelli in [18] and the references therein. Detection of transition edits in the visual modality can be done in several ways. Since the transition is gradual, comparison of successive frames is insucient. The rst researchers exploiting this observation where Zhang et al [164]. They introduced the twin-comparison approach, using a dual threshold that accumulates signicant dierences to detect gradual transitions. For an extensive coverage of other methods we again refer to [18], we just summarize the methods mentioned. First, so called plateau detection uses every k -th frame. Another approach is based on eect modelling, where video production-based mathematical models are used to spot dierent edit eects using statistical classication. Finally, a third approach models the eect of a transition on intensity edges in subsequent frames. Detection of abrupt cuts in the auditory layout can be achieved by detection of silences and transition points, i.e. locations where the category of the underlying signal changes. In literature dierent methods are proposed for their detection. In [99] it is shown that average energy, En , is a sucient measure for detecting silence segments. En is computed for a window, i.e. a set of n samples. If the average for all the windows in a segment are found lower than a threshold, a silence is marked. Another approach is taken in [166]. Here En is combined with the zerocrossing rate (ZCR), where a zero-crossing is said to occur if successive samples have dierent signs. A segment is classied as silence if En is consistently lower than a set of thresholds, or if most ZCRs are below a threshold. This method also includes unnoticeable noise. Li et al [73] use silence detection for separating the input audio segment into silence segments and signal segments. For the detection of silence periods they use a three-step procedure. First, raw boundaries between silence and signal are marked in the auditory data. In the succeeding two steps a ll-in process and a throwaway process are applied to the results. In the ll-in process short silence segments are relabelled signal and in the throwaway process low energy signal segments are relabelled silence. Besides silence detection [73] also detects transition points in the signal segments by using break detection and break merging. They compute an onset break, when a clear increase of signal energy is detected, and an oset break, when a clear decrease is found, to indicate a potential change in category of the underlying signal, by
2 As an ironic legacy from early research on video parsing, this is also referred to as scene-change detection.
44
moving a window over the signal segment and compare En of dierent halves of the window at each sliding position. In the second step, adjacent breaks of the same type are merged into a single break. In [166] music is distinguished from speech, silence, and environmental sounds based on features of the ZCR and the fundamental frequency. To assign the probability of being music to an audio segment, four features are used: the degree of being harmonic (based on the fundamental frequency), the degree to which the audio spectrum exhibits a clear peak during a period of time an indication for the presence of a fundamental frequency , the variance of the ZCR, and the range of the amplitude of the ZCR. The rst step in reconstructing the textual layout is referred to as tokenization, in this phase the input text is divided into units called tokens or characters. Detection of text shots can be achieved in dierent ways, depending on the granularity used. If we are only interested in single words we can use the occurrence of white space as the main clue. However, this signal is not necessarily reliable, because of the occurrence of periods, single apostrophes and hyphenation [79]. When more context is taken into account one can reconstruct sentences from the textual layout. Detection of periods is a basic heuristic for the reconstruction of sentences, about 90% of periods are sentence boundary indicators [79]. Transitions are typically found by searching for predened patterns. Since layout is very modality dependent, a multimodal approach for its reconstruction wont be very eective. The task of layout reconstruction can currently be performed quite reliably. However, results might improve even further when more advanced techniques are used, for example methods exploiting the learning capabilities of statistical classiers.
6.2.2
Content segmentation
In subsection 2.5.2 we introduced the elements of content. Here we will discuss how to detect them automatically, using dierent detection algorithms exploiting visual, auditory, and textual information sources. People detection Detection of people in video documents can be done in several ways. They can be detected in the visual modality by means of their faces or other body parts, in the auditory modality by the presence of speech, and in the textual modality by the appearance of names. In the following, those modality specic techniques will be discussed in more detail. For an in-depth coverage of the dierent techniques we refer to the cited references. Most approaches using the visual modality simplify the problem of people detection to detection of a human face. Face detection techniques aim to identify all image regions which contain a face, regardless of its three-dimensional position, orientation, and lighting conditions used, and if present return their image location and extents [161]. This detection is by no means trivial because of variability in location, orientation, scale, and pose. Furthermore, facial expressions, facial hair, glasses, make-up, occlusion, and lightning conditions are known to make detection error prone. Over the years various methods for the detection of faces in images and image sequences are reported, see [161] for a comprehensive and critical survey of current face detection methods. From all methods currently available the one proposed by Rowley in [110] performs the best [106]. The neural network-based system is able to detect about 90% of all upright and frontal faces, and more important the system only sporadically mistakes non-face areas for faces.
45
When a face is detected in a video, face recognition techniques aim to identify the person. A common used method for face recognition is matching by means of Eigenfaces [103]. In eigenface methods templates of size lets say 20x20 are used. For the example numbers this leads to a 20x20=400 dimensional space. Using Principal Component Analysis a subspace capturing the most relevant information is computed. Every component in itself is again a 20x20 template. It allows to identify which information is most important in the matching process. A drawback of applying face recognition for video indexing, is its limited generic applicability [120]. Reported results [9, 103, 120] show that face recognition works in constrained environments, preferably showing a frontal face close to the camera. When using face recognition techniques in a video indexing context one should account for this limited applicability. In [85] people detection is taken one step further, detecting not only the head, but the whole human body. The algorithm presented rst locates the constituent components of the human body by applying detectors for head, legs, left arm, and right arm. Each individual detector is based on the Haar wavelet transform using specic examples. After ensuring that these components are present in the proper geometric conguration, a second example-based classier combines the results of the component detectors to classify a pattern as either a person or a non-person. A similar part-based approach is followed in [38] to detect naked people. First, large skin-colored components are found in an image by applying a skin lter that combines color and texture. Based on geometrical constraints between detected components an image is labelled as containing naked people or not. Obviously this method is suited for specic genres only. The auditory channel also provides strong clues for presence of people in video documents through speech in the segment. When layout segmentation has been performed, classication of the dierent signal segments as speech can be achieved based on the features computed. Again dierent approaches can be chosen. In [166] ve features are checked to distinguish speech from other auditory signals. First one is the relation between amplitudes of ZCR and energy curves. The second one is the shape of the ZCR curve. The third and fourth features are the variance and the range of the amplitude of the ZCR curve. The fth feature is about the property of the fundamental frequency within a short time window. A decision value is dened for each feature. Based on these features, classication is performed using the weighted average of these decision values. A more elaborated audio segmentation algorithm is proposed in [73]. The authors are able to segment not only speech but also speech together with noise, speech or music with an accuracy of about 90%. They compared dierent auditory feature sets, and conclude that temporal and spectral features perform bad, as opposed to Mel-frequency cepstral coecients (MFCC) and linear prediction coecients (LPC) which achieve a much better classication accuracy. When a segment is labelled as speech, speaker recognition can be used to identify a person based on his or her speech utterance. Dierent techniques are proposed, e.g. [91, 100]. A generic speaker identication system consisting of three modules is presented in [100]. In the rst module feature extraction is performed using a set of 14 MFCC from each window. In the second module those features are used to classify each moving window using a nearest neighbor classier. The classication is performed using a ground truth. In the third module results of each moving window are combined to generate a single decision for each segment. The authors report encouraging performance using speech segments of a feature lm. A strong textual cue for the appearance of people in a video document are words which are names. In [120], for example, natural language processing techniques using a dictionary, thesaurus, and parser are used to locate names in transcripts. The system calculates four dierent scores. The rst measure is a grammatical
46
score based on the part-of-speech tagging to nd candidate nouns. The second is a lexical score indicating whether a detected noun is related to persons. The situational score is the third score giving an indication whether the word is related to social activities involving people. Finally, the positional score for each word in the transcripts measures where in the text of the newsreader the word is mentioned. A net likelihood score is then calculated which together with the name candidate and segment information forms the systems output. Related to this problem is the task of named entity recognition, which is known from the eld of computational linguistics. Here one seeks to classify every word in a document into one of eight categories: person, location, organization, date, time, percentage, monetary value, or none of the above [11]. In the reference name recognition is viewed as a classication problem, where every word is either part of some name, or not. The authors use a variant of an HMM for the name recognition task based on a bi-gram language model. Compared to any other reported learning algorithm, their name recognition results are consistently better. In conclusion, people detection in video can be achieved using dierent approaches, all having limitations. Variance in orientation and pose, together with occlusion, make visual detection error prone. Speech detection and recognition is still sensitive to noise and environmental sounds. Also, more research on detection of names in text is needed to improve results. As the errors in dierent modalities are not necessarily correlated, a multimodal approach in detection of persons in video documents can be an improvement. Besides improved detection, fusion of dierent modalities is interesting with respect to recognition of specic persons. Object detection Object detection forms a generalization of the problem of people detection. Specic objects can be detected by means of specialized visual detectors, motion, sounds, and appearance in the textual modality. Object detection methods for the dierent modalities will be highlighted here. Approaches for object detection based on visual appearance can range from detection of specic objects to detection approaches of more general objects. An example from the former is given in [122], where the presence of passenger cars in image frames is detected by using multiple histograms. Each histogram represents the joint statistics of a subset of wavelet coecients and their position on the object. The authors use statistical modelling to account for variation, which enables them to reliably detect passenger cars over a wide range of points of view. In the above, we know what we are looking for and the number of classes is small so one can perform strong segmentation. If not, grouping based on motion i.e. weak segmentation is the best in absence of other knowledge. Moreover, since the appearance of objects might vary widely, rigid object motion detection is often the most valuable feature. Thus, when considering the approach for general object detection, motion is a useful feature. A typical method to detect moving objects of interest starts with a segmentation of the image frame. Regions in the image frame sharing similar motion are merged in the second stage. Result is a motion-based segmentation of the video. In [93] a method is presented that segments a single video frame into independently moving visual objects. The method follows a bottom-up approach, starting with a color-based decomposition of the frame. Regions are then merged based on their motion parameters via a statistical test, resulting in superior performance over other methods, e.g. [6, 159]. Specic objects can also be detected by analyzing the auditory layout segmentation of the video document. Typically, segments in the layout segmentation rst need to be classied as environmental sounds. Subsequently, the environmental sounds are further analyzed for the presence of specic object sound patterns. In
47
[157, 166] for example, specic object sound patterns e.g. dog bark, ringing telephones, and dierent musical instruments are detected by selecting the appropriate auditory features. Detecting objects in the textual modality also remains a challenging task. A logical intermediate step in detecting objects of interest in the textual modality is part-of-speech tagging. Though limited, the information we get from tagging is still quite useful. By extracting and analyzing the nouns in tagged text for example, and to apply chunking [1], one can make some assumptions about objects present. To our knowledge chunking has not yet been used in combination with detection of objects in video documents. Its application however, might prove to be a valuable extension to unimodal object detection. Successful detection of objects is limited to specic examples. A generic object detector still forms the holy grail in video document analysis. Therefore, multimodal object detection seems interesting. It helps if objects of interest can be identied within dierent modalities. Then the specic visual appearance, the specic sound, and its mentioning in the accompanying textual data can yield the evidence for robust recognition. Setting detection For the detection of setting, motion is not so relevant, as the setting is usually static. Therefore, techniques from the eld of content-based image retrieval can be used. See [129] for a complete overview of this eld. By using for example key frames, those techniques can easily be used for video indexing. We focus here on methods that assign a setting label to the data, based on analysis of the visual, auditory, or textual modality. In [137] images are classied as either indoor or outdoor, using three types of visual features: one for color, texture, and frequency information. Instead of computing features on the entire image, the authors use a multi-stage classication approach. First, sub-blocks are classied independently, and afterwards another classication is performed using the k-nearest neighbor classier. Outdoor images are further classied into city and landscape images in [145]. Features used are color histograms, color coherence vectors, Discrete Cosine Transform (DCT) coecients, edge direction histograms, and edge direction coherence vectors. Classication is done with a weighted k-nearest neighbor classier with leave-one out method. Reported results indicate that the edge direction coherence vector has good discriminatory power for city vs. landscape. Furthermore, it was found that color can be an important cue in classifying natural landscape images into forests, mountains, or sunset/sunrise classes. By analyzing sub-blocks, the authors detect the presence of sky and vegetation in outdoor image frames in another paper. Each sub-block is independently classied, using a Bayesian classication framework, as sky vs. non-sky or vegetation vs. non-vegetation based on color, texture, and position features [144]. Detecting setting based on auditory information, can be achieved by detecting specic environmental sound patterns. In [157] the authors reduce an auditory segment to a small set of parameters using various auditory features, namely loudness, pitch, brightness, bandwidth, and harmonicity. By using statistical techniques over the parameter space the authors accomplish classication and retrieval of several sound patterns including laughter, crowds, and water. In [166] classes of natural and synthetic sound patterns are distinguished by using an HMM, based on timbre and rhythm. The authors are capable of classifying dierent environmental setting sound patterns, including applause, explosions, rain, river ow, thunder, and windstorm. The transcript is used in [25] to extract geographic reference information for the
48
Video data
Visual layout Visual content

Semantic index
Textual layout Textual content
Video OCR
Named events
Logical units
Sub-genre
Purpose
Genre
Auditory layout Auditory content Speech recognition
Video segmentation
Conversion
Integration
Figure 6.2: Role of conversion and integration in multimodal video document analysis.
video document. The authors match named places to their spatial coordinates. The process begins by using the text metadata as the source material to be processed. A known set of places along with their spatial coordinates, i.e. a gazetteer, is created to resolve geographic references. The gazetteer used consists of approximately 300 countries, states and administrative entities, and 17000 major cities worldwide. After post processing steps, e.g. including related terms and removing stop words, the end result are segments in a video sequence indexed with latitude and longitude. We conclude that the visual and auditory modality are well suited for recognition of the environment in which the video document is situated. By using the textual modality, a more precise (geographic) location can be extracted. Fusion of the dierent modalities may provide the video document with semantically interesting setting terms such as: outside vegetation in Brazil near a owing river. Which can never be derived from one of the modalities in isolation.
6.3
Multimodal analysis
After reconstruction of the layout and content elements, the next step in the inverse analysis process is analysis of the layout and content to extract the semantic index. At this point the modalities should be integrated. However, before analysis, it might be useful to apply modality conversion of some elements into more appropriate form. The role of conversion and integration in multimodal video document analysis will be discussed in this section, and is illustrated in gure 6.2.
6.3.1
Conversion
For analysis, conversion of elements of visual and auditory modalities to text is most appropriate. A typical component we want to convert from the visual modality is overlayed text. Video Optical Character Recognition (OCR) methods for detection of text in video frames can be divided into component-based, e.g. [125], or texture-based methods, e.g. [74]. A method utilizing the DCT coecients of compressed video was proposed in [168]. By using Video OCR methods, the visual overlayed text object can be converted into a textual format. The quality of the results of Video
6.3. MULTIMODAL ANALYSIS
49
OCR vary, depending on the kind of characters used, their color, their stability over time, and the quality of the video itself. From the auditory modality one typically wants to convert the uttered speech into transcripts. Available speech recognition systems are known to be mature for applications with a single speaker and a limited vocabulary. However, their performance degrades when they are used in real world applications instead of a lab environment [18]. This is especially caused by the sensitivity of the acoustic model to dierent microphones and dierent environmental conditions. Since conversion of speech into transcripts still seems problematic, integration with other modalities might prove benecial. Note that other conversions are possible, e.g. computer animation can be viewed as converting text to video. However, these are relevant for presentation purposes only.
6.3.2
Integration
The purpose of integration of multimodal layout and content elements is to improve classication performance. To that end the addition of modalities may serve as a verication method, a method compensating for inaccuracies, or as an additional information source. An important aspect, indispensable for integration, is synchronization and alignment of the dierent modalities, as all modalities must have a common timeline. Typically the time stamp is used. We observe that in literature modalities are converted to a format conforming to the researchers main expertise. When audio is the main expertise, image frames are converted to (milli)seconds, e.g. [55]. In [4, 34] image processing is the main expertise, and audio samples are assigned to image frames or camera shots. When a time stamp isnt available, a more advanced alignment procedure is necessary. Such a procedure is proposed in [60]. The error prone output of a speech recognizer is compared and aligned with the accompanying closed captions of news broadcasts. The method rst nds matching sequences of words in the transcript and closed caption by performing a dynamic-programming3 based alignment between the two text strings. Segments are then selected when sequences of three or more words are similar in both resources. To achieve the goal of multimodal integration, several approaches can be followed. We categorize those approaches by their distinctive properties with respect to the processing cycle, the content segmentation, and the classication method used. The processing cycle of the integration method can be iterated, allowing for incremental use of context, or non-iterated. The content segmentation can be performed by using the dierent modalities in a symmetric, i.e. simultaneous, or asymmetric, i.e. ordered, fashion. Finally, for the classication one can choose between a statistical or knowledge-based approach. An overview of the dierent integration methods found in literature is in table 1. Most integration methods reported are symmetric and non-iterated. Some follow a knowledge-based approach for classication of the data into classes of the semantic index hierarchy [37, 90, 105, 119, 142]. In [142] for example, the auditory and visual modality are integrated to detect speech, silence, speaker identities, no face shot / face shot / talking face shot using knowledge-based rules. First, talking people are detected by detecting faces in the camera shots, subsequently a knowledge-based measure is evaluated based on the amount of speech in the shot. Many methods in literature follow a statistical approach [4, 22, 33, 34, 55, 60, 61, 92, 120, 155]. An example of a symmetric, non-iterated statistical integration
3 Dynamic programming is a programming technique in which intermediate results in some iterative process are stored so they can be reused in later iterations, rather than recomputed.
50
Table 6.1: An overview of dierent integration methods.

Content Segmentation Symmetric Asymmetric [4] [8] [22] [33] [34] [37] [55] [55] [60] [61] [90] [92] [105] [119] [120] [132] [142] [155] Classication Method Statistical Knowledge Processing Cycle Iterated Non-Iterated
method is the Name-It system presented in [120]. The system associates detected faces and names, by calculating a co-occurrence factor that combines the analysis results of face detection and recognition, name extraction, and caption recognition. A high-occurrence factor indicates that a certain visual face template is often associated with a certain name in either the caption in the image, or in the associated text hence a relation between face and name can be concluded. Hidden Markov Models are frequently used as a statistical classication method for multimodal integration [4, 33, 34, 55]. A clear advantage of this framework is that it is not only capable to integrate multimodal features, but is also capable to include sequential features. Moreover, an HMM can also be used as a classier combination method. When modalities are independent, they can easily be included in a product HMM. In [55] such a classier is used to train two modalities separately, which are then combined symmetrically, by computing the product of the observation probabilities. It is shown that this results in signicant improvement over a unimodal approach. In contrast to the product HMM method, a neural network-based approach doesnt assume features are independent. The approach presented in [55], trains an HMM for each modality and category. A three layer perceptron is then used to combine the outputs from each HMM in a symmetric and non-iterated fashion. Another advanced statistical classier for multimodal integration was recently proposed in [92]. A probabilistic framework for semantic indexing of video documents based on so called multijects and multinets is presented. The multijects model content elements which are integrated in the multinets to model the relations between objects, allowing for symmetric use of modalities. For the integration in the multinet the authors propose a Bayesian belief network [101], which is a probabilistic description of the relation between dierent variables. Signicant improvements of detection performance is demonstrated. Moreover, the framework supports detection based on iteration. Viability of the Bayesian network as a symmetric integrating classier was also demonstrated in [61], however that method doesnt support iteration. In contrast to the above symmetric methods, an asymmetric approach is presented in [55]. A two-stage HMM is proposed which rst separates the input video document into three broad categories based on the auditory modality, in the second stage another HMM is used to split those categories based on the visual modality.
6.4. SEMANTIC VIDEO INDEXES
51
A drawback of this method is its application dependency, which may result in less eectiveness in other classication tasks. An asymmetric knowledge-based integration method, supporting iteration, was proposed in [8]. First, the visual and textual modality are combined to generate semantic index results. Those form the input for a post-processing stage that uses those indexes to search the visual modality for the specic time of occurrence of the semantic event. For exploration of other integration methods, we again take a look in the eld of content-based image retrieval. From this eld methods are known to integrate the visual and textual modality by combining images with associated captions or HTML tags. Early reported methods used a knowledge base for integration, e.g. the Piction system [132]. This system uses modalities asymmetrically, it rst analyzes the caption to identify the expected number of faces and their expected relative positions. Then a face detector is applied to a restricted part of the image, if no faces are detected an iteration step is performed that relaxes the thresholds. More recently, Latent Semantic Indexing (LSI) [31] has become a popular means for integration [22, 155]. LSI is symmetric and non-iterated and works by statistically associating related words to the conceptual context of the given document. In eect it relates documents that use similar terms, which for images are related to features in the image. Thus, it has a strong relation to co-occurrence based methods. In [22] LSI is used to capture text statistics in vector form from an HTML document. Words with specic HTML tags are given higher weights. In addition, the position of the words with respect to the position of the image in the document is also accounted for. The image features, that is the color histogram and the dominant orientation histogram, are also captured in vector form and combined they form a unied vector that the authors use for content-based search of a WWW-based image database. Reported experiments show that maximum improvement was achieved when both visual and textual information are employed. In conclusion, video indexing results improve when a multimodal approach is followed. Not only because of enhancement of content ndings, but also because more information is available. Most methods integrate in a symmetric and noniterated fashion. Usage of incremental context by means of iteration can be a valuable addition to the success of the integration process. Usage of combined statistical classiers in multimodal video indexing literature is still scarce, though various successful statistical methods for classier combinations are known, e.g. bagging, boosting, or stacking [57]. So, probably results can be improved even more substantially when advanced classication methods from the eld of statistical pattern recognition, or other disciplines are used, preferably in an iterated fashion.
6.4
Semantic video indexes
The methodologies described in section 6.3 have been applied to extract a variety of the dierent video indexes described in subsection 2.5.1. In this section we systematically report on the dierent indexes and the information from which they are derived. As methods for extraction of purpose are not mentioned in literature, this level is excluded. Figure 6.3 presents an overview of all indexes and the methods in literature which can derive them.
6.4.1
Genre
Editing is an important stylistic element because it aects the overall rhythm of the video document [12]. Hence, layout related statistics are well suited for indexing a video document into a specic genre. Most obvious element of this editorial style
52
is the average shot length. Generally, the longer the shots, the slower the rhythm of the video document. The rate of shot changes together with the presence of black frames is used in [53] to detect commercials within news broadcast. The rationale behind detection of black frames is that they are often broadcasted for a fraction of a second before, after, and between commercials. However, black frames can also occur for other reasons. Therefore, the authors use the observation that advertisers try to make commercials more interesting by rapidly cutting between dierent shots, resulting in a higher shot change rate. A similar approach is followed in [75], for detecting commercials within broadcasted feature lms. Besides the detection of monochrome frames and shot change rate, the authors use the edge change ratio and motion vector length to capture high action in commercials. Average shot length, the percentage of dierent types of edit transitions, and six visual content features, are used in [141] to classify a video document into cartoons, commercials, music, news and sports video genres. As a classier a specic decision tree called C4.5 is used [67] which can work both on real and symbolic values. In [33] the observation is made that dierent genres exhibit dierent temporal patterns of face locations. They furthermore observe that the temporal behavior of overlaid text is genre dependent. In fact the following genre dependent functions can be identied: News: annotation of people, objects, setting, and named events; Sports: player identication, game related statistics; Movies/TV series: credits, captions, and language translations; Commercials: product name, claims, and disclaimers; Based on results of face and text tracking, each frame is assigned one of 15 labels, describing variations on the number of appearing faces and/or text lines together with the distance of a face to the camera. These labels form the input for an HMM, which classies an input video document into news, commercials, sitcoms, and soaps based on maximum likelihood. Detection of generic sport video documents seems almost impossible due to the large variety in sports. In [66], however, a method is presented that is capable of identifying mainstream sports videos. Discriminating properties of sport videos are the presence of slow-motion replays, large amounts of overlayed text, and specic camera/object motion. The authors propose a set of eleven features to capture these properties, and obtain 93% accuracy using a decision tree classier. Analysis showed that motion magnitude and direction of motion features yielded the best results. Methods for indexing video documents into a specic genre using a multimodal approach are reported in [37, 55, 61]. In [55] news reports, weather forecasts, commercials, basketball, and football games are distinguished based on audio and visual information. The authors compare dierent integration methods and classiers and conclude that a product HMM classier is most suited for their task, see also 6.3.2. The same modalities are used in [37]. The authors present a three-step approach. In the rst phase, content features such as color statistics, motion vectors and audio statistics are extracted. Secondly, layout features are derived, e.g. shot lengths, camera motion, and speech vs. music. Finally, a style prole is composed and an educational guess is made as to the genre in which a shot belongs. They report promising results by combining dierent layout and content attributes of video for analysis, and can nd ve (sub)genres, namely news broadcasts, car racing, tennis, commercials, and animated cartoons.
53
Besides auditory and visual information, [61] also exploits the textual modality. The segmentation and indexing approach presented uses three layers to process low-, mid-, and high-level information. At the lowest level features such as color, shape, MFCC, ZCR, and the transcript are extracted. Those are used in the midlevel to detect faces, speech, keywords, etc. At the highest level the semantic index is extracted through the integration of mid-level features across the dierent modalities, using Bayesian networks, as noted in subsection 6.3.2. In its current implementation the presented system classies segments as either part of a talk show, commercial or nancial news.
6.4.2
Sub-genre
Research on indexing sub-genres, or specic instances of a genre, has been geared mainly towards sport videos [37, 55, 114] and commercials [27]. Obviously, future index techniques may also extract other sub-genres, for example westerns, comedies, or thrillers within the feature lm genre. Four sub-genres of sport video documents are identied in [114]: basketball, ice hockey, soccer, and volleyball. The full motion elds in consecutive frames are used as a feature. To reduce the feature space, Principal Component Analysis is used. For classication two dierent statistical classiers were applied. It was found that a continuous observation density Markov model gave the best results. The sequences analyzed were post-edited to contain only the play of the sports, which is a drawback of the presented system. For instance, no crowd scenes or time outs were included. Some sub-genres of sport video documents are also detected in [37, 55], as noted in section 6.4.1. An approach to index commercial videos based on semiotic and semantic properties is presented in [27]. The general eld of semiotics is the study of signs and symbols, what they mean and how they are used. For indexing of commercials the semiotics approach classies commercials into four dierent sub-genres that relate to the narrative of the commercial. The following four sub-genres are distinguished: practical, critical, utopic, and playful commercials. Perceptual features e.g. saturated colors, horizontal lines, and the presence or absence of recurring colors, are mapped onto the semiotic categories. Based on research in the marketing eld, the authors also formalized a link between editing, color, and motion eects on the one hand, and feelings that the video arouses in the observer on the other. Characteristics of a commercial are related to those feelings and have been organized in a hierarchical fashion. A main classication is introduced between commercials that induce feelings of action and those that induce feelings of quietness. The authors subdivide action further into suspense and excitement. Quietness is further specied in relaxation and happiness.
6.4.3
Logical units
Detection of logical units in video documents is extensively researched with respect to the detection of scenes or Logical Story Units (LSU) in feature lms and sitcoms. An overview and evaluation of such methods is presented in [148]. A summary of that paper follows. After that we consider how to give the LSU a proper label. Logical story unit detection In cinematography an LSU is dened as a series of shots that communicate a unied action with a common locale and time [13]. Viewers perceive the meaning of a video at the level of LSUs [14, 113].
54
A problem for LSU segmentation using visual similarity is that it seems to conict with its denition based on the semantic notion of common locale and time. There is no one-to-one mapping between the semantic concepts and the datadriven visual similarity. In practice, however, most LSU boundaries coincide with a change of locale, causing a change in the visual content of the shots. Furthermore, usually the scenery in which an LSU takes place does not change signicantly, or foreground objects will appear in several shots, e.g. talking heads in the case of a dialogue. Therefore, visual similarity provides a proper base for common locale. There are two complicating factors regarding the use of visual similarity. Firstly, not all shots in an LSU need to be visually similar. For example, one can have a sudden close-up of a glass of wine in the middle of a dinner conversation showing talking heads. This problem is addressed by the overlapping links approach [50] which assigns visually dissimilar shots to an LSU based on temporal constraints. Secondly, at a later point in the video, time and locale from one LSU can be repeated in another, not immediate succeeding LSU. The two complicating factors apply to the entire eld of LSU segmentation based on visual similarity. Consequently, an LSU segmentation method using visual similarity depends on the following three assumptions: Assumption 1 The visual content in an LSU is dissimilar from the visual content in a succeeding LSU. Assumption 2 Within an LSU, shots with similar visual content are repeated. Assumption 3 If two shots x and y are visually similar and assigned to the same LSU, then all shots between x and y are part of this LSU. For parts of a video where the assumptions are not met, segmentation results will be unpredictable. Given the assumptions, LSU segmentation methods using visual similarity can be characterized by two important components, viz. the shot distance measurement and the comparison method. The former determines the (dis)similarity mentioned in assumptions 1 and 2. The latter component determines which shots are compared in nding LSU boundaries. Both components are described in more detail. Shot distance measurement. The shot distance represents the dissimilarity between two shots and is measured by combining (typically multiplying) measurements for the visual distance v and the temporal distance t . The two distances will now be explained in detail. v Visual distance measurement consists of dissimilarity function f for a visual v feature f measuring the distance between two shots. Usually a threshold f is used v v to determine whether two shots are close or not. f and f have to be chosen such that the distance between shots in an LSU is small (assumption 2), while the distance between shots in dierent LSUs is large (assumption 1). Segmentation methods in literature do not depend on specic features or dissimilarity functions, i.e. the features and dissimilarity functions are interchangeable amongst methods. Temporal distance measurement consists of temporal distance function t . As observed before, shots from not immediate succeeding LSUs can have similar content. Therefore, it is necessary to dene a time window t , determining what shots in a video are available for comparison. The value for t , expressed in shots or frames, has to be chosen such that it resembles the length of an LSU. In practice, the value has to be estimated since LSUs vary in length. Function t is either binary or continuous. A binary t results in 1 if two shots are less than t shots or frames apart and otherwise [162]. A continuous t reects the distance between two shots more precisely. In [113], t ranges from 0 to
55
1. As a consequence, the further two shots are apart in time, the closer the visual distance has to be to assigned them to the same LSU. Time window t is still used to mark the point after which shots are considered dissimilar. Shot distance is then set to regardless of the visual distance. The comparison method is the second important component of LSU segmentation methods. In sequential iteration, the distance between a shot and other shots is measured pair-wise. In clustering, shots are compared group-wise. Note that in the sequential approach still many comparisons can be made, but always of one pair of shots at the time. Methods from literature can now be classied according to the framework. The visual distance function is not discriminatory, since it is interchangeable amongst methods. Therefore, the two discriminating dimensions for classication of methods are temporal distance function and comparison method. Their names in literature and references to methods are given in table 6.2. Note that in [65] 2 approaches are presented.
Comparison method sequential
clustering
Table 6.2: Classication of LSU segmentation methods. Temporal distance function binary continuous overlapping links continuous video [50], [68], [71], [3], coherence [65], [24] [135], [77] time constrained time adaptive clustering [162], grouping [113], [76], [115] [65], [150]
Labelling logical units Detection of LSU boundaries alone is not enough. For indexing, we are especially interested in its accompanying label. A method that is capable of detecting dialogue scenes in movies and sitcoms, is presented in [4]. Based on audio analysis, face detection, and face location analysis the authors generate output labels which form the input for an HMM. The HMM outputs a scene labeled as either, establishing scene, transitional scene, or dialogue scene. According to the results presented, combined audio and face information gives the most consistent performance of dierent observation sets and training data. However, in its current design, the method is incapable of dierentiating between dialogue and monologue scenes. A technique to characterize and index violent scenes in general TV drama and movies is presented in [90]. The authors integrate cues from both the visual and auditory modality symmetrically. First, a measure of activity for each video shot is computed as a measure of action. This is combined with detection of ames and blood using a predened color table. The corresponding audio information provides supplemental evidence for the identication of violent scenes. The focus is on the abrupt change in energy level of the audio signal, computed using the energy entropy criterion. As a classier the authors use a knowledge-based combination of feature values on scene level. By utilizing a symmetric and non-iterated multimodal integration method four dierent types of scenes are identied in [119]. The audio signal is segmented into silence, speech, music, and miscellaneous sounds. This is combined with a visual similarity measure, computed within a temporal window. Dialogues are then
56
detected based on the occurrence of speech and an alternated pattern of visual labels, indicating a change of speaker. When the visual pattern exhibits a repetition the scene is labeled as story. When the audio signal isnt labeled as speech, and the visual information exhibits a sequence of visually non-related shots, the scene is labeled as action. Finally, scenes that dont t in the aforementioned categories are indexed as generic scenes. In contrast to [119], a unimodal approach based on the visual information source is used in [163] to detect dialogues, actions, and story units. Shots that are visually similar and temporally close to each other are assigned the same (arbitrary) label. Based on the patterns of labels in a scene, it is indexed as either dialogue, action, or story unit. A scheme for reliably identifying logical units which clusters sensor shots according to detected dialogues, similar settings, or similar audio is presented in [105]. The method starts by calculating specic features for each camera and microphone shot. Auditory, color, and orientation features are supported as well as face detection. Next an Euclidean metric is used to determine the distance between shots with respect to the features, resulting in a so called distance table. Based on the distance tables, shots are merged into logical units using absolute and adaptive thresholds. News broadcasts are far more structured than feature lms. Researchers have exploited this to classify logical units in news video using a model-based approach. Especially anchor shots, i.e. shots in which the newsreader is present, are easy to model and therefore easy to detect. Since there is only minor body movement they can be detected by comparison of the average dierence between (regions in) successive frames. This dierence will be minimal. This observation is used in [45, 124, 165]. In [45, 124] also the restricted position and size of detected faces is used. Another approach for the detection of anchor shots is taken in [10, 49, 56]. Repetition of visually similar anchor shots throughout the news broadcast is exploited. To rene the classication of the similarity measure used, [10] requires anchor shots candidates to have a motion quantity below a certain threshold. Each shot is classied as either anchor or report. Moreover, textual descriptors are added based on extracted captions and recognized speech. To classify report and anchor shots, the authors in [56] use face and lip movement detection. To distinguish anchor shots, the aforementioned classication is extended with the knowledge that anchor shots are graphically similar and occur frequently in a news broadcast. The largest cluster of similar shots is therefore assigned to the class of anchor shots. Moreover, the detection of a title caption is used to detect anchor shots that introduce a new topic. In [49] anchor shots are detected together with silence intervals to indicate report boundaries. Based on a topics database the presented system nds the most probable topic per report by analyzing the transcribed speech. Opposed to [10, 56], nal descriptions are not added to shots, but to a sequence of shots that constitute a complete report on one topic. This is achieved by merging consecutive segments with the same topic in their list of most probable topics. Besides the detection of anchor persons and reports, other logical units can be identied. In [34] six main logical units for TV broadcast news are distinguished, namely, begin, end, anchor, interview, report, and weather forecast. Each logical unit is represented by an HMM. For each frame of the video one feature vector is calculated consisting of 25 features, including motion and audio features. The resulting feature vector sequence is assigned to a logical unit based on the sequence of HMMs that maximizes the probability of having generated this feature vector sequence. By using this approach parsing and indexing of the video is performed in one pass through the video only. Other examples of highly structured TV broadcasts are talk and game shows. In [62] a method is presented that detects guest and host shots in those video
57
documents. The basic observation used is that in most talk shows the same person is host for the duration of the program but guests keep on changing. Also host shots are typically shorter since only the host asks questions. For a given show, the key frames of the N shortest shots containing one detected face are correlated in time to nd the shot most often repeated. The key host frame is then compared against all key frames to detect all similar host shots, and guest shots. In [160] a model for segmenting soccer video into the logical units break and play is given. A grass-color ratio is used to classify frames into three views according to video shooting scale, namely global, zoom-in, and close-up. Based on segmentation rules, the dierent views are mapped. Global views are classied as play and closeups as breaks if they have a minimum length. Otherwise a neighborhood voting heuristic is used for classication.
6.4.4
Named events
Named events are at the lowest level in the semantic index hierarchy. For their detection dierent techniques have been used. A three-level event detection algorithm is presented in [47]. The rst level of the algorithm extracts generic color, texture, and motion features, and detects spatiotemporal object. The mid-level employs a domain dependent neural network to verify whether the detected objects belong to conceptual objects of interest. The generated shot descriptors are then used by a domain-specic inference process at the third level to detect the video segments that contain events of interest. To test the eectiveness of the algorithm the authors applied it to detect animal hunt events in wildlife documentaries. Violent events and car chases in feature lms are detected in [86], based on analysis of environmental sounds. First, low level sounds as engines, horns, explosions, or gunre are detected, which constitute part of the high level sound events. Based on the dominance of those low level sounds in a segment it is labeled with a high level named event. Walking shots, gathering shots, and computer graphics shots in broadcast news are the named events detected in [56]. A walking shot is classied by detecting the characteristic repetitive up and down movement of the bottom of a facial region. When more than two similar sized facial regions are detected in a frame, a shot is classied as a gathering shot. Finally, computer graphics shots are classied by a total lack of motion in a series of frames. The observation that authors use lightning techniques to intensify the drama of certain scenes in a video document is exploited in [140]. An algorithm is presented that detects ashlights, which is used as an identier for dramatic events in feature lms, based on features derived from the average frame luminance and the frame area inuenced by the ashing light. Five types of dramatic events are identied that are related to the appearance of ashlights, i.e. supernatural power, crisis, terror, excitement, and generic events of great importance. Whereas a ashlight can indicate a dramatic event in feature lms, slow motion replays are likely to indicate semantically important events in sport video documents. In [97] a method is presented that localizes such events by detecting slow motion replays. The slow-motion segments are modelled and detected by an HMM. One of the most important events in a sport video document is a score. In [8] a link between the visual and textual modalities is made to identify events that change the score in American football games. The authors investigate whether a chain of keywords, corresponding to an event, is found from the closed caption stream or not. In the time frames corresponding to those keywords, the visual stream is analyzed. Key frames of camera shots in the visual stream are compared with predened
58
Video document
Entertainment
[40] [84] [42, 84] [23, 84] [20] [20]
Information
[20, 23, 35, 84]
Communication
[20, 23, 34, 35, 40, 45, 84]
Talk show
Music
Sport
Feature film
Cartoon
Sitcom
Soap
Documentary
News
Commercial
[35, 72]
[72]
[72]
[72]
[23]
[23]
[35]
[40]
[16]
[16]
Basketball Ice hockey
Soccer
Volleyball Tennis
Car racing Football Wildlife
Financial
Semiotic
Semantic
[41]
[41]
[94]
[94]
[3, 66, 73, 96]
[73, 96]
[53, 73, 96]
[7, 21, 28, 31, 36, 77, 98]
[21]
[7, 21, 31, 36]
[21, 35]
Guest
Host
Play
Break
Dialogue Story
Action
Anchor
Interview Report
Weather
[5, 11, 27, 48, 71, 75 81, 100, 102]
[59]
[83]
[50]
[50]
[29]
[36]
[36]
[36]
Sport events
Highlights
Dramatic events
Car chase
Violence
Hunts
Walking
Gathering
Graphical
Figure 6.3: Semantic index hierarchy with instances as found in literature. From top to bottom instances from genre, sub-genre, logical units, and named events. The dashed box is used to group similar nodes. Note, this picture is taken from another paper, the numbers indicated as references are not correct.
templates using block matching based on the color distribution. Finally, the shot is indexed by the most likely score event, for example a touchdown. Besides American football, methods for detecting events in tennis [84, 134, 167], soccer [16, 44], baseball [111, 167] and basketball [121, 169] are reported in literature. Commonly, the methods presented exploit domain knowledge and simple (visual) features related to color, edges, and camera/object motion to classify typical sport specic events e.g. smashes, corner kicks, and dunks using a knowledge-based classier. An exception to this common approach is [111], which presents an algorithm that identies highlights in baseball video by analyzing the auditory modality only. Highlight events are identied by detecting excited speech of the commentators and the occurrence of a baseball pitch and hit. Besides semantic indexing, detection of named events also forms a great resource for reuse of video documents. Specic information can be retrieved and reused in dierent contexts, or reused to automatically generate summaries of video documents. This seems especially interesting for, but is not limited to, video documents from the sport genre.
6.4.5
Discussion
Now that we have described the dierent semantic index techniques, as encountered in literature, we are able to distinguish the most prominent content and layout properties per genre. As variation in the textual modality is in general too diverse for dierentiation of genres, and more suited to attach semantic meaning to logical units and named events, we focus here on properties derived from the visual and auditory modality only. Though, a large amount of genres can be distinguished, we limit ourselves to the ones mentioned in the semantic index hierarchy in gure 6.3, i.e. talk show, music, sport, feature lm, cartoon, sitcom, soap, documentary, news, and commercial. For each of those genres we describe the characteristic properties. Most prominent property of the rst genre, i.e. talk shows, is their well-dened structure, uniform setting, and prominent presence of dialogues, featuring mostly
59
non-moving frontal faces talking close to the camera. Besides closing credits, there is in general a limited use of overlayed text. Whereas talk shows have a well-dened structure and limited setting, music clips show great diversity in setting and mostly have ill-dened structure. Moreover, music will have many short camera shots, showing lots of camera and object motion, separated by many gradual transition edits and long microphone shots containing music. The use of overlayed text is mostly limited to information about the performing artist and the name of the song on a xed position. Sport broadcasts come in many dierent avors, not only because there exist a tremendous amount of sport sub-genres, but also because they can be broadcasted live or in summarized format. Despite this diversity, most authored sport broadcasts are characterized by a voice over reporting on named events in the game, a watching crowd, high frequency of long camera shots, and overlayed text showing game and player related information on a xed frame position. Usually sport broadcasts contain a vast amount of camera motion, objects, and players within a limited uniform setting. Structure is sport-specic, but in general, a distinction between dierent logical units can be made easily. Moreover, a typical property of sport broadcasts is the use of replays showing events of interest, commonly introduced and ended by a gradual transition edit. Feature lm, cartoon, sitcom, and soap share similar layout and content properties. They are all dominated by people (or toons) talking to each other or taking part in action scenes. They are structured by means of scenes. The setting is mostly limited to a small amount of locales, sometimes separated by means of visual, e.g. gradual, or auditory, e.g. music, transition edits. Moreover, setting in cartoons is characterized by usage of saturated colors, also the audio in cartoons is almost noise-free due to studio recording of speech and special eects. For all mentioned genres the usage of overlayed text is limited to opening and/or closing credits. Feature lm, cartoon, sitcom, and soap dier with respect to people appearance, usage of special eects, presence of object and camera motion, and shot rhythm. Appearing people are usually lmed frontal in sitcoms and soaps, whereas in feature lms and cartoons there is more diversity in appearance of people or toons. Special eects are most prominent in feature lms and cartoons, laughter of an imaginary public is sometimes added to sitcoms. In sitcoms and soaps there is limited camera and object motion. In general cartoons also have limited camera motion, though object motion appears more frequently. In feature lms both camera and object motion are present. With respect to shot rhythm it seems legitimate to state that this has stronger variation in feature lms and cartoons. The perceived rhythm will be slowest for soaps, resulting in more frequent use of camera shots with relative long duration. Documentaries can also be characterized by their slow rhythm. Other properties that are typical for this genre are the dominant presence of a voice over narrating about the content in long microphone shots. Motion of camera and objects might be present in the documentary, the same holds for overlayed text. Mostly there is no well-dened structure. Special eects are seldom used in documentaries. Most obvious property of news is its well-dened structure. Dierent news reports and interviews are alternated by anchor persons introducing, and narrating about, the dierent news topics. A news broadcast is commonly ended by a weather forecast. Those logical units are mostly dominated by monologues, e.g. people talking in front of a camera showing little motion. Overlayed text is frequently used on xed positions for annotation of people, objects, setting, and named events. A report on an incident may contain camera and object motion. Similarity of studio setting is also a prominent property of news broadcasts, as is the abrupt nature of transitions between sensor shots. Some prominent properties of the nal genre, i.e. commercials, are similar to
60
those of music. They have a great variety in setting, and share no common structure, although they are authored carefully, as the message of the commercial has to be conveyed in twenty seconds or so. Frequent usage of abrupt and gradual transition, in both visual and auditory modality, is responsible for the fast rhythm. Usually lots of object and camera motion, in combination with special eects, such as a loud volume, is used to attract the attention of the viewer. Dierence with music is that black frames are used to separate commercials, the presence of speech, the superuous and non-xed use of overlayed text, a disappearing station logo, and the fact that commercials usually end with a static frame showing the product or brand of interest. Due to the large variety in broadcasting formats, which is a consequence of guidance by dierent authors, it is very dicult to give a general description for the structure and characterizing properties of the dierent genres. When considering sub-genres this will only become more dicult. Is a sports program showing highlights of todays sport matches a sub-genre of sport or news? Reducing the prominent properties of broadcasts to instances of layout and content elements, and splitting of the broadcasts into logical units and named events seems a necessary intermediate step to arrive at a more consistent denition of genre and sub-genre. More research on this topic is still necessary.
6.5
Conclusion
Viewing a video document from the perspective of its author, enabled us to present a framework for multimodal video indexing. This framework formed the starting point for our review on dierent state-of-the-art video indexing techniques. Moreover, it allowed us to answer the three dierent questions that arise when assigning an index to a video document. The question what to index was answered by reviewing dierent techniques for layout reconstruction. We presented a discussion on reconstruction of content elements and integration methods to answer the how to index question. Finally, the which index question was answered by naming dierent present and future index types within the semantic index hierarchy of the proposed framework. At the end of this review we stress that multimodal analysis is the future. However, more attention, in the form of research, needs to be given to the following factors: 1. Content segmentation Content segmentation forms the basis of multimodal video analysis. In contrast to layout reconstruction, which is largely solved, there is still a lot to be gained in improved segmentation for the three content elements, i.e. people, objects, and setting. Contemporary detectors are well suited for detection and recognition of content elements within certain constraints. Most methods for detection of content elements still adhere to a unimodal approach. A multimodal approach might prove to be a fruitful extension. It allows to take additional context into account. Bringing the semantic index on a higher level is the ultimate goal for multimodal analysis. This can be achieved by the integrated use of dierent robust content detectors or by choosing a constrained domain that ensures the best detection performance for a limited detector set. 2. Modality usage Within the research eld of multimodal video indexing, focus is still too much geared towards the visual and auditory modality. The semantic rich textual
6.5. CONCLUSION
61
modality is largely ignored in combination with the visual or auditory modality. Specic content segmentation methods for the textual modality will have their reection on the semantic index derived. Ultimately this will result in semantic descriptions that make a video document as accessible as a text document. 3. Multimodal integration The integrated use of dierent information sources is an emerging trend in video indexing research. All reported integration methods indicate an improvement of performance. Most methods integrate in a symmetric and noniterated fashion. Usage of incremental context by means of iteration can be a valuable addition to the success of the integration process. Most successful integration methods reported are based on the HMM and Bayesian network framework, which can be considered as the current state-of-the-art in multimodal integration. There seems to be a positive correlation between usage of advanced integration methods and multimodal video indexing results. This paves the road for the exploration of classier combinations from the eld of statistical pattern recognition, or other disciplines, within the context of multimodal video indexing. 4. Technique taxonomy We presented a semantic index hierarchy that grouped dierent index types as found in literature. Moreover we characterized the dierent genres in terms of their most prominent layout and content elements, and by splitting its structure into logical units and named events. What the eld of video indexing still lacks is a taxonomy of dierent techniques that indicates why a specic technique is suited the best, or unsuited, for a specic group of semantic index types. The impact of the above mentioned factors on automatic indexing of video documents will not only make the process more ecient and more eective than it is now, it will also yield richer semantic indexes. This will form the basis for a range of new innovative applications.

Shot boundary detection, edit detection, layout reconstruction, people detection, object detection, setting detection, conversion, synchronization, multimodal integration, iterated/non-iterated integration, symmetric/a-symmetric integration, statistics or knowledge based integration, semantic video index, Logical Story Unit detection, overlapping links, visual distance, temporal distance, logical unit labelling, named event detection
62
Chapter 7
Data mining
1
Having stored all data and its classications in the database we can explore the database for interesting patterns, non-obvious groupings and the like. This is the eld known as data mining. In [156] four dierent classes of datamining are distinguished: Classication learning: a learning scheme that takes a set of classied examples from which it is expected to learn a way of classifying unseen instances. Association learning: learning any association between features, not just ones that predict the classlabel. Clustering: grouping of instances that belong together as they share common characteristics. Numeric prediction: predict an outcome of an attribute of an instance, where the attribute is not discrete, but a numeric quantity.
7.1
Classication learning
This type of data mining we already encountered in chapter 5. Hence, this could in principle be done at data import time. It can still be advantageous to do it after data import also. In the data-mining process using any of the techniques described in the following sections might lead to new knowledge and new class denitions. Hence, new relations between the data and conceptual labels could be learned. Furthermore, if a user has used manual annotation to label the multimedia items in the database with concepts from an ontology we can see whether based on the data automatic classication could be done. Finally, if the user would interact with the system in the data-mining process by labelling a limited number of multimedia items we could learn automatic classication methods by learning from these examples. This is commonly known as active learning. The output of classication learning could be a decision tree or a classication rule. Looking at the rules which are generated gives us an perspective on the data and helps in understanding the relations between data, attributes, and concepts.
1 This chapter is mostly based on [156] implementations of the techniques can be found at http://www.cs.waikato.ac.nz/~ml/weka/index.html
63
64
CHAPTER 7. DATA MINING
Data Mining
Nominal Ordinal
Assocation Rules
DataType Classification
Ratio Interval
Machine Learning
Regression Trees Instance Based Representations Clusters
Knowledge Representation
Data New Knowledge
Figure 7.1: Architecture of the data-mining modules
7.2
Association learning
Association learning has a strong connection to classication learning, but rather than predicting the class any attribute or set of attributes can be predicted. As an example consider a set of analyzed broadcasts and viewing statistics indicating how many people watched the program. Some associations that might be learned would be: if (commercial_starts) then (number-of-people-zapping="high") if (commercial_starts AND (previous-program=documentary)) then (number-of-people-zapping="low") When performing association learning we should be aware that we could always yield an association between any two attributes. The target of course is to nd the ones that are meaningful. To that end two measures have been dened to measure the validity of a rule: Coverage the number of instances which are correctly predicted by the rule. Accuracy the proportion of correctly predicted instances with respect to the total number of instances to which the rule applies. Of course these two are conicting. The smaller the number of instances the rules applies to the easier it is to generate rules with sucient accuracy. In it limit form, just write a separate rule for each instance leading to a accuracy of 100%, but a coverage of 1 for each rule. Practical association rule learning methods rst nd a large set rules with sucient coverage and from there optimize the set of rules to improve the accuracy.
7.3. CLUSTERING
65
Feat 2
Feat 1
Figure 7.2: Illustration of the initial steps in clustering based on the Minimum Spanning Tree. Note: two colors are used to indicate the notion of classes. However, at this point in the analysis no classes have been dened yet.
7.3
Clustering
We now turn our attention to the case where we want to nd categories in a large set of patterns without having a supervisor labelling the elements in the dierent classes. Although again there are many dierent methods, we only consider two methods namely k-means and Minimum Spanning Tree based clustering. In k-means the system has to be initialized by indicating a set of k representative samples, called seeds, one for each of the expected categories. If such seeds cannot be selected, the user should at least indicate how many categories are expected, the system then automatically selects k seeds for initialization. Now every pattern is assigned to the nearest seed. This yields an initial grouping of the data. For each of the groups the mean is selected as the new seed for that particular cluster. Then the above process of assignment to the nearest seed and nding the mean is repeated till the clusters become stable. The Minimum Spanning Tree considers the two patterns which are closest to each other and they are connected. Than the process is repeated for all other pairs of points till all patterns are connected. This leads to a hierarchical description of the data elements. One can use any level in the tree to decompose the dataset into clusters. In gure 7.2 the process is illustrated.
7.4
Numeric prediction
Our nal data mining algorithm considers numeric prediction. Whereas classication rules and association rules mostly deal with nominal or ordinal attributes, numeric prediction deals with numeric quantities. The most common rules are coming directly from statistics and are known there as linear regression methods. Lets assume you want to predict some numeric attribute x. In linear regression we assume that for an instance j the attribute xj can be predicted by taking some linear combination of the values of some k other attributes aj . Thus, i xi = w0 + w1 ai + w2 ai + ... + wk ai 1 2 k where the wj are weights for the dierent attributes. Now when given a suciently large dataset where the ai and xi are given we have to nd the wj in such a j way that for a new instance for which xi is not known we can predict xi by taking the weighted sum dened above.
66
CHAPTER 7. DATA MINING
When we would have k training instances the solution to the above equation in general can be solved uniquely and every xi for each of the k training samples would be correctly estimated. This is however not desirable as a new instance would not be predicted correctly. So in practice one takes a set of n samples where n >> k and computes the weights wj such that the prediction on the training set is as good as possible. For that purpose the mean squared error is used:
n
xi
2 wj ak
i=1
j=0
Although this might look complicated standard algorithms for this are available and in fact quite simple.

Clustering,association rules,expectation maximization, linear regression, Minimum Spanning Tree, k-means
Chapter 8
Information visualization
8.1 Introduction
When a lot of information is stored in the database it is important for a user to be able to visualize the information. Scheidermann1 makes this more precise and denes the purpose of information visualization as: Provide a compact graphical presentation AND user interface for manipulating large numbers of items (102 - 106 ), possibly extracted from far larger datasets. Enables users to make discoveries, decisions, or explanations about patterns (trend, cluster, gap, outlier...), groups of items, or individual items. To do so a number of basic tasks are dened: Overview : Gain an overview of the entire collection Zoom: Zoom in on items of interest Filter : Filter out uninteresting items Details-on-demand : Select an item or group and get details when needed Relate: View relationships among items And to support the overall process we need: History: Keep a history of actions to support undo, replay, and progressive renement Extract: Allow extraction of sub-collections and of the query parameters Now let us consider the visualization process in more detail. The visualization module gets as input a structured dataset (e.g. the result of user interaction see chapter 9) or a database request after which data and its structure (possibly obtained by datamining) is retrieved from the database. The rst step is to generate a view on the dataset. Steered by parameters from the user, a selection of the data is made for visualization purposes catering for the 5 basic tasks overview, zoom, lter, details-on-demand, and relate. When visualizing information we have to take into account the large set of possible datatypes that we can encounter. The rst categorization is formed by the dierent standard types of information we considered in chapter 3.
1 http://www.cs.umd.edu/hcil/pubs/presentations/INFORMSA3/INFORMSA3.ppt
67
68
CHAPTER 8. INFORMATION VISUALIZATION nominal, ordinal, interval, ratio Next to these we have the multimedia types text, audio, image, video
For each of these items we will also have computed features and the similarity between dierent multimedia items. To be able to understand and explore the datasets we need to visualize the data as well as the features and the similarity in such a way that the user gets sucient insight to make decisions. The data may have an inherent structure e.g. as the corresponding ontology is organized as a hierarchical tree, or the dierent items have been grouped into dierent clusters. For structure the following three basic types are dened: Temporal structures: data with a temporal ordering e.g. elements related to a timeline. Tree structures: data with a hierarchical structure. Network structures: data with a graph-like structure e.g. a network where the nodes are formed by the multimedia items and where the weighted edges are labelled with the similarity between the items. Each of the datatypes has an intrinsic data dimension d. A set of nominal data elements has d equal to 0 as elements are not ordered. For feature vectors of size n describing multimedia items, we have d. A set of ordinal elements has d. There are dierent ways to visualize any of the above types, several of which can be found in literature. The choice for a specic type of visualization depends on the characteristics of the data, in particular d and on user preferences. Each type of visualization takes some datatype as input and generates a display space with perceived dimension d . E.g. a row of elements has d = 1, a wireframe of a cube in which data elements are placed has d = 3. If the appearance of the cube is changing over time the perceived dimension can go up to 4. Thus, for d 4 it is in principle possible to have d = d. For higher dimensional data it is necessary to employ a projection operator to reduce the number of dimensions to a maximum of 4. For a vector space containing vectors typically principal component analysis is used to limit the number of dimensions to 2 or 3. Clearly the perceived dimension is not equal to the screen dimensions which for regular displays is always 2, hence for every visualization type a mapping has to be made to a 2D image or graph, possibly with an animation like use of time. Clearly it depends on the device characteristics, most importantly screen size, how the mapping has to be done. Finally, all visualizations have to be combined and presented to the user who will then navigate and issue new commands to change the view and so on. An overview of the visualization process is presented in gure 8.1.
8.2
2
Visualizing video content
Having considered the visualization of the whole information space we now consider the visualization of the multimedia items themselves, in particular video. Visualization of video content depends on the structural level of the unit. Visualization of frame content is trivial. When aiming to visualize structure we should realize
2 This
is adapted from [146]
8.2. VISUALIZING VIDEO CONTENT

Information Visualization
69
Visualization Control
Device Characteristics User Preferences
2D-screen Visualization ViewParameters
Structured DataSet
View Generation
Data View
Data Type Separation
Temporal Structures Tree structures Network structures nominal,ordinal, interval,ratio text,audio, image,video
Visualization Type Selector
Display Spaces with perceived dimensions d*
MappingTo 2D screen coordinates (+time)
SetOf 2D-displays
Combine
Data, Metadata Data Request
Figure 8.1: Overview of the visualization process.
that an LSU has a tree structure and hence the above mentioned methods for tree visualization can be used. What remains, and what is specic for video, is the visualization of the shot. The pictorial content of a shot can be visualized in various ways. Showing the video as a stream is trivial. Stream visualization is used only when the user is very certain the shot content is useful. The advantage of stream visualization is that it shows every possible detail in the video content. Also, it is easy to synchronize with the other modalities, especially audio. The obvious disadvantage is the time needed to evaluate the shot content. Also, the viewer is forced to watch shots one at a time. A user is able to view several images in parallel, but for shots with moving content this is confusing. Alternatively, a static visual video summary could be displayed, allowing the viewer to evaluate large video sequences quickly. The most important summary visualizations are key frames, mosaicking, and slice summaries. Key frame visualization is the most popular visualization of shot content. For each shot one of more keyframes are selected representing in some optimal way the content of the shot. Key frames are especially suited for visualizing static video shots, such as in dialog scenes, because the spatiotemporal aspect hardly plays a role then. In such cases, viewing one frame reveals the content of the entire shot. Mosaicking combines the pictorial content of several (parts of) frames into one image. Salient stills3 [139] are a noteworthy instance of mosaicking in which the output image has the same size as the input frames. The advantage of mosaicking is the visualization beyond frame borders and the visualization of events by showing the same object several times at dierent positions. The disadvantage is its dependence on uniform movement of camera or objects, making mosaicking only applicable in very specic situations. A mosaic is shown in gure 8.2. Slice summaries [147] are used for giving a global impression of shot content. They are based on just one line of pixels of every frame, meaning the value for the vertical dimension is xed in the case of a horizontal slice, which consequently has the size width*length of the sequence. A vertical slice has size height*length. The xed values are usually the center of the frame. Video slices were originally used for detecting cuts and camera movement [154], but they are now used for visualization as well, such as in the OM-representation [87]. In special cases, a
3 See
for examples http://www.salientstills.com/images/images other.html
70
CHAPTER 8. INFORMATION VISUALIZATION
Figure 8.2: An example of a mosaic.
Figure 8.3: An example of a vertical slice.
slice summary gives a very exact view on the visual content in a shot, so that one understands immediately what is happening in the video. This does, however, require the object to move slowly relative to the camera in a direction perpendicular to the slice direction. The disadvantage of slice summaries is the assumption that relevant content is positioned in the slice areas. Although generally the region of interest is near the center of a frame, it is possible a slice summary misses the action in the corners of a frame. Even better is to use a dual slice summary that simultaneously shows both a horizontal and a vertical video slice. An example of a vertical slice can be seen in gure 8.3. The use of two directions increases the chance of showing a visually meaningful summary. The vertical slice provides most information about the video content since in practice it is more likely that the camera or object moves from left to right than from bottom to top. Slices in both directions provide supporting information about shot boundaries and global similarities in content over time. Thus, the viewer can evaluate the relationship between shots quickly, e.g. in a dialog. Recurrent visual
8.2. VISUALIZING VIDEO CONTENT
71
content can be spotted easily in a large number of frames, typically 600 frames on a 15 inch screen. Finally, we can also create a video abstract which is a short video consisting of parts from the original video. This is very much what you see when you view a trailer of a movie.

Overview, lter, details, zoom, relate, tree structure, network structure, temporal structure, intrinsic data dimension, perceived dimension,key frames, mosaic, slice summary, video abstract.
72
CHAPTER 8. INFORMATION VISUALIZATION
Chapter 9
Interactive Content Based Multimedia Retrieval

1
9.1
Introduction
Finding information in a multimedia archive is a notoriously dicult task due to the semantic gap. Therefore, the user and the system should support each other in nding the required information. Interaction of users with a data set has been studied most thoroughly in categorical information retrieval [98]. The techniques reported there need rethinking when used for image retrieval as the meaning of an image, due to the semantic gap, can only be dened in context. Multimedia retrieval requires active participation of the user to a much higher degree than required by categorized querying. In content-based image retrieval, interaction is a complex interplay between the user, the images, and their semantic interpretations.
9.2
User goals
Before dening software tools for supporting the interactive retrieval of multimedia items we rst consider the dierent goals a user can have when accessing a multimedia archive. Assume a user is accessing a multimedia archive I in which a large collection of multimedia items are stored. We distinguish the following ve dierent goals. answer search: nding an answer to a specic question by locating the required evidence in the multimedia item set. Finding an answer in a dataset is well known problem in text retrieval (see e.g. the Question and Answering track of the TREC conference [152]). Now consider an example question like How many spikes are there on the statue of liberty. If the system would have to answer this question based on an image archive, a set of pictures should be located on which the statue of liberty is shown from dierent sides. From there, in each image, the statue has to be segmented from the background, and from there the spikes have to be found. Clearly this is far beyond the current state of the art in image databases. Therefore, we will not consider this goal here. The rst of the three goals we will consider is:
1 This
chapter is adapted from [158].
73
74
CHAPTER 9. INTERACTIVE RETRIEVAL target search: nding a specic multimedia item in the database.
This is a task in which the assumption is that the user has a clear idea what he is looking for. For e.g. he wants to nd the picture of the Annapurna in moonlight which he took on one of his holidays. The generic required answer has the following form At = m with m I. The next search goal is: category search: nding a set of images from the same category. As categories can be dened in various ways, this search goal covers a lot of dierent subgoals. The multimedia items can for example have a similar style e.g. a set of horror movies, contain similar objects e.g. pictures of cars in a catalogue, or be recorded in the same setting, like dierent bands at an outdoor concert. Its generic form is: Ac = {mi }i=1,..,ni with mi I and ni the number of elements in the category searched for. The nal search goal that will be considered in this chapter is: associative search: a search with no other goal than interesting ndings. This is the typical browsing users do when surng on the web or other heterogeneous datasets. Users are guided by their associations and move freely from a multimedia item in one category to another. It can be viewed as multi-category search. Users are not looking for specic multimedia items, thus whenever they nd an interesting item any other element from the same category would be of interest also. So the nal generic form to consider is: Aa = {{mij }j=1,..,ni }i=1,..,nc with mij I, nc the number of categories the user is interested in, and ni the number of elements in category i. . In addition to the above search tasks which are trying to nd a multimedia item, we can also consider object search: nding a subpart of the content of a multimedia item. This goal addresses elements within the content of a multimedia item. E.g. a user is looking for a specic product appearing in a sitcom on television. In fact, when dealing with objects, one can again consider answer, target, and category search. Conceptually the dierence is not that large, but it can have a signicant impact in the design of methods supporting the dierent tasks. In this chapter we focus on searching for multimedia items. The goals mentioned are not independent. In fact they can be special cases of one another. If a user doing an associative search nds only one category of multimedia items of interest it reduces to category search. In turn, if this category consists of a single element in the dataset, it becomes target search. Finally, if the target multimedia item contains one object only which can be clearly distinguished from its background, it becomes object search. This is illustrated in gure 9.1. In the general case, the dierent goals have a great impact on the system supporting the retrieval.
9.3. QUERY SPACE: DEFINITION
75
Associative Search
Only one category of interest
Category Search
Only one image in the desired category
Target Search
Image contains one object only
Object Search
Figure 9.1: Illustration of the relations between the dierent search goals
9.3
Query Space: denition
There are a variety of methods in literature for interactive retrieval of multimedia items. Here we focus in particular on interactive image retrieval, but indications will be given how to generalize to other modalities.
9.3.1
Query Space: data items
To structure the description of methods we rst dene query space. The rst component of query space is the selection of images IQ , from the large image archive I. Typically, the choice is based on factual descriptions like the name of the archive, the owner, date of creation, or web-site address. Any standard retrieval technique can be used for the selection.
9.3.2
Query Space: features
The second component is a selection of the features FQ F derived from the images in IQ . In practice, the user is not always capable of selecting the features most t to reach the goal. For example, how should a general user decide between image description using the HSV or Lab color space? Under all circumstances, however, the user should be capable to indicate the class of features relevant for the task, like shape, texture, or both. In addition to feature class, [42] has the user indicate the requested invariance. The user can, for example, specify an interest in features robust against varying viewpoint, while the expected illumination is specied as white light in all cases. The appropriate features can then automatically be selected by the system.
76
CHAPTER 9. INTERACTIVE RETRIEVAL
Z F1 I
Label 1 label 2 P2 P1
F2
Figure 9.2: Illustration of query space showing a set I consisting of 6 images only, for which two features are computed, and two labels are present for each image.
9.3.3
Query Space: similarity
As concerns the third component of query space, the user should also select a similarity function, SQ . To adapt to dierent data-sets and goals, SQ should be a parameterized function. Commonly, the parameters are weights for the dierent features.
9.3.4
Query Space: interpretations
The fourth component of query space is a set of labels ZQ Z to capture goaldependent semantics. Given the above, we dene an abstract query space: The query space Q is the goal dependent 4-tuple {IQ , FQ , SQ , ZQ } The concept of query space is illustrated in gure 9.2.
9.4
Query Space: processing steps
When a user is interactively accessing a multimedia dataset the system performs a set of ve processing steps namely initialization, specication, visualization, feedback, and output. Of these ve steps the visualization and feedback step form the iterative part of the retrieval task. In each of the steps every component of the query space can be modied. The remaining three steps are relevant for any retrieval system These processing steps are illustrated in gure 9.3.
9.4.1
Initialization
To start a query session, an instantiation Q = {IQ , FQ , SQ , ZQ } of the abstract query space is created. When no knowledge about past or anticipated use of the system is available, the initial query space Q0 should not be biased towards specic images, or make some image pairs a priori more similar than others. The active set of images IQ is therefore equal to all of IQ . Furthermore, the features of FQ are
9.4. QUERY SPACE: PROCESSING STEPS

Interaction Processing
77
Results
Information Request
Feedback
Query Specification QuerySpace Initialization
Final QuerySpace
QuerySpace Output
FeedBack Processing
QuerySpec Multimedia Descriptions Initial Queryspace Updated Queryspace
PrepareFor Display
Structured DataSet
Figure 9.3: The processing steps in interactive content based retrieval.
normalized based on the distribution of the feature values over IQ e.g. [39, 112]. To make SQ unbiased over FQ , the parameters should be tuned, arriving at a natural distance measure. Such a measure can be obtained by normalization of the similarity between individual features to a xed range [149, 112]. For the instantiation of a semantic label, the semantic gap prevents attachment to an image with full certainty. Therefore, in the ideal case, the instantiation ZQ of ZQ assigns for each i IQ and each z ZQ a probability Pi (z) rather than a strict label. Without prior information no labels are attached to any of the multimedia items. However, one can have an automatic system which indexes the data for example the ones discussed in [131] for video indexing. As this is the result of an automatic process P captures the uncertainty of the result. If a user has manually annotated a multimedia item i with label z this can be considered as ground truth data and Pi (z) = 1 The above is of course not the ideal situation. While currently systems for interactive retrieval are only starting to appear it can be expected to be commonplace in the years to come. Thus, there might already be a large of amount of previous use of the systems. The results of these earlier sessions can be stored in the form of one or more proles. Any of such proles can be considered as a specication of how the query space should be initialized. The query space forms the basis for specifying queries, display of query results, and for interaction.
9.4.2
Query specication
For specifying a query q in Q, many dierent interaction methodologies have been proposed. A query falls in one of two major categories: exact query, where the query answer set A(q) equals the images in IQ satisfying a set of given criteria. approximate query where A(q) is a ranking of the images in IQ with respect
78
Example query
Spatial predicate
sun water
Example query result
Figure 9.4: Example queries for each of the 6 dierent query types and possible results from the Corel image database.
to the query. Within each of the two categories, three subclasses can be dened depending on whether the query relates to the spatial content of the image, to the global image information, or to groups of images. An overview of specication of queries in image retrieval is shown in gure 9.4. Figure 9.5 and gure 9.6 show the equivalent classes for video and text retrieval respectively. expressed in terms of the standard boolean operators and, or and not. What makes them special is the use of the specic image based predicates. There are three dierent exact query classes: Exact query by spatial predicate is based on the location of silhouettes i.e. the result of strong segmentation, homogeneous regions resulting from strong segmentation, or signs2 . Query on silhouette location is applicable in narrow domains only. Typically, the user queries using an interpretation z ZQ . To answer the query, the system then selects an appropriate algorithm for segmenting the image and extracting the domain-dependent features. In [126] the user interactively indicates semantically salient regions to provide a starting point. The user also provides sucient context to derive a measure for the probability of z. Implicit spatial relations between regions sketched by the user in [130] yield a pictorial predicate. Other systems let the user explicitly dene the predicate on relations between homogeneous regions [21]. In both cases, to be added to the query result, the homogenous regions as extracted from the image must comply with the predicate. A web search system in which the user places icons representing categories like human, sky, and water in the requested spatial order is presented in [72]. In [117], users pose spatial-predicate queries on geographical signs located in maps based on their absolute or relative positions.
2A
sign is an object in the image with xed shape and orientation e.g. a logo.
exact approximate
Image predicate
Amount of sky>20% and amount of sand > 30%
Group predicate
Location = Africa
Spatial example
Image example
Group example
pos
neg
... ... ...
79
Example query
Spatial predicate
sun Moving-towards water
exact approximate exact approximate
Image predicate
Average object speed > 1 m/s amount of sand > 30%
Group predicate
Location = Africa
Spatial example
Video example
Group example
pos
neg
... ... ...
Figure 9.5: idem for video.
Example query
Spatial predicate
Sun <before> water

This is a text concerning sun and water is this order.
Text predicate
Sun <more frequent than> water
This is a text in which the sun is mentioned twice as the sun is important. Water is only mentioned once. This is just one text about the dogons in Mali, West Africa.
Group predicate
Text <about> Africa
Spatial example
A piece of text in which the word sun is mentioned earlier than water
In this text sun is also mentioned before the word water is.
Text example
This is a text about sun and water
This is another text about sun and water.
Group example
pos Text about sun and beaches
neg Text about the sun and energy
This is another text about sun and fun at the beach, but not about the sun as an energy source.
... ... ...
Figure 9.6: idem for text.
80
CHAPTER 9. INTERACTIVE RETRIEVAL Exact query by image predicate is a specication of predicates on global image descriptions, often in the form of range predicates. Due to the semantic gap, range predicates on features are seldom used in a direct way. In [96], ranges on color values are pre-dened in predicates like MostlyBlue and SomeYellow. Learning from user annotations of a partitioning of the image allows for feature range queries like: amount of sky >50% and amount of sand >30% [107]. Exact query by group predicate is a query using an element z ZQ where ZQ is a set of categories that partitions IQ . Both in [23] and [143] the user queries on a hierarchical taxonomy of categories. The dierence is that the categories are based on contextual information in [23] while they are interpretations of the content in [143].
In the approximate types of query specications the user species a single feature vector or one particular spatial conguration in FQ , where it is anticipated that no image will satisfy the query exactly. Approximate query by spatial example results in an image or spatial structure corresponding to literal image values and their spatial relationships. Pictorial specication of a spatial example requires a feature space such that feature values can be selected or sketched by the user. Low-level feature selectors use color pickers, or selections from shape and texture examples [39, 46]. Kato [64] was the rst to let users create a sketch of the global image composition which was then matched to the edges in IQ . Sketched outlines of objects in [70] are rst normalized to remove irrelevant detail from the query object, before matching it to objects segmented from the image. When specication is by parameterized template [32, 123] each image in IQ is processed to nd the best match with edges of the images. The segmentation result is improved if the user may annotate the template with salient details like color corners and specic textures. Pre-identication of all salient details in images in IQ can then be employed to speed up the search process [127]. When weak segmentation of the query image and all images in IQ is performed, the user can specify the query by indicating example regions [21, 130]. Approximate query by image example feeds the system a complete array of pixels and queries for the most similar images, in eect asking for the k-nearest neighbors in feature space. Most of the current systems have relied upon this form of querying [39, 46]. The general approach is to use a SQ based on global image features. Query by example queries are subclassied [149] into query by external image example, if the query image is not in the database, versus query by internal image example. The dierence in external and internal example is minor for the user, but aects the computational support as for internal examples all relations between images can be pre-computed. Query by image example is suited for applications where the target is an image of the same object or set of objects under dierent viewing conditions [42]. In other cases, the use of one image cannot provide sucient context for the query to select one of its many interpretations [118]. Approximate image query by group example is specication through a selection of images which ensemble denes the goal. The rationale is to put the image in its proper semantic context to make one of the possible interpretations z ZQ clearly stand out with respect to other possible interpretations. One option is that the user selects m > 1 images from a palette of images presented to nd images best matching the common characteristics of the m images [29].
81
An m-query set is capable of dening the target more precisely. At the same time, the m-query set denes relevant feature value variations and nullies irrelevant variations in the query. Group properties are amplied further by adding negative examples. This is achieved in [7] by constructing a query q best describing positive and negative examples indicated by the user. When for each group in the database a small set of representative images can be found, they are stored in a visual dictionary from which the user creates the query [118]. Of course, the above queries can always be combined into more complex queries. For example, both [21, 130] compare the similarity of regions using features. In addition, they encode spatial relations between the regions in predicates. Even with such complex queries, a single query q is rarely sucient to make A(q) the user desired answer set. For most image queries, the user must engage in active interaction with the system on the basis of the query results as displayed.
9.4.3
Query output
In most current retrieval systems the output is an unordered set of multimedia items for exact queries and a ranked set of images for approximate search. However, when considering the dierent user search goals identied in section 9.2 the output should depend on the user goal. As the system should base its result on query space, we let the output or answer operator A work on query space rather than on the query as used in section 9.4.2. To dierentiate between the dierent goals we again use the superscript t for target, c for category, and a for associative. It leads to the following four operators, where we assume that the number of answers is given by n: For target search the operator is very simple as only one element is requested, thus the most likely one is selected. At (Q) = arg max {P (m = target)}
mIQ
Category search yields a set of multimedia items, which should all be in the same category. To generate the answer the user should specify the label z corresponding to the category and a threshold Tc on the probability measure for the label on a multimedia item: Ac (Q, z, Tc ) = {m IQ |Pm (z) Tc } Finally the associative search yields a set of results for a vector of labels (z1 , ...., zn ) corresponding to interesting categories: Aa (Q, (z1 , ...., zn )), Tc ) = (Ac (Q, zi , Tc ))i=1,n
9.4.4
Query output evaluation
Having arrived at a nal output after the search quest of the user, we can evaluate the search performance of the system. In information retrieval there are two basic measures namely precision and recall. The rst measure tells us how much of the results returned are indeed correct. The second measure quanties how many of the relevant items in the database we actually found. In practice, the two measure are competing, if the parameters of the system are tuned to favor precision, the recall will go down and vice versa. To make this more precise. Let A be the user goal, and let A denote the result of the system. Precision p and recall are then given as:
82
p r
= =
|A A| |A| |A A| |A|
where |.| denotes the number of elements in a set. So let us now see how we can apply these measures for evaluation of the three search goals namely target, category, and associative search. For target search it is very easy to evaluate the result as either the target is found or not, we can simply ll in the denitions dened earlier for formalizing the search and the output: |At (Q) At (Q)| |At (Q)| t |A (Q) At (Q)| |At (Q)|
pt (Q) rt (Q)
= =
Clearly for target search this value is either 0 (found) or 1 (not found). For category search precision and recall can also be used directly: |Ac (Q, z, Tc ) Ac (Q, z, Tc )| c (Q, z, T )| |A c c c (Q, z, Tc )| |A (Q, z, Tc ) A |Ac (Q, z, Tc )|
pc (Q, z, Tc ) rc (Q, z, Tc )
= =
Evaluating associative search is somewhat more dicult. As we have interpreted this task as multi-category search we have to consider precision and recall for all dierent categories. Thus we get: pa (Q, (z1 , ...., zn ), Tc ) = ra (Q, (z1 , ...., zn ), Tc ) = (pc (Q, zi , Tc ))i=1,n (rc (Q, zi , Tc ))i=1,n
For object search two types of evaluation are needed. First, we can use the evaluation measures given for category search to evaluate whether the right set of multimedia items is found. Given that a multimedia items contains at least one instance of the object searched for, we should evaluate how well the object is localized. This can in fact be done with two measures which are closely related to precision and recall measuring the overlap between the objects found and the desired object.
9.5
Query space interaction
In early systems, the process of query specication and output is iterated, where in each step the user revises the query. Updating the query is often still appropriate for exact queries. For approximate queries, however, the interactive session should be considered in its entirety. During the session the system updates the query space, attempting to learn the goals from the users feedback. This leads to the framework depicted in gure 9.3, including all 5 processing steps mentioned earlier. We dene:
9.5. QUERY SPACE INTERACTION
83
An interactive query session is a sequence of query spaces {Q0 , Q1 , ...., Qn1 , Qn } where the interaction of the user yields a relevance feedback RFi in every iteration i of the session. In this process, the transition from Qi to Qi+1 materializes the feedback of the user. In a truly successful session A(Qn ) = A(Q) i.e. the user had a search goal based on the initial query space he/she started with, and the output on the last query space does indeed give this goal as an answer is the users search goal. In the following subsections we will rst consider what the optimal change in each step of the interaction process is. Then we consider how the query space can be displayed as this is the means through which the user can interact. We subsequently consider how the user can give feedback and nally we explain how the query space can be updated based on the users feedback and how this whole process should be evaluated.
9.5.1
Interaction goals
In the process of interaction the query space should get closer and closer to the result requested by the user. We now consider what this means for the query space update in every step, where we make a distinction between the dierent search goals. When doing an object search the set I should be reduced in every step in such a way that no multimedia items containing the object are removed from the query space. The features F should be such that within the items containing the object the object is optimally discriminated from the background. Similarity S should be putting items containing the object closer to each other, while putting them further away from the other items. Finally, items containing the object should have a high probability in Z while others get a lower probability. For target search we encounter similar things. I should ideally be reduced to the single target item. The features F should be such that they highlight important information in the target item. S should make the target stand out clearly from the other items. When the label target is used clearly the probability of the target should go to 1, while the probability for other items goes to 0. Category search is not much dierent from target search and hence the goals are very similar, now focussing on a category of items. As browsing can be viewed as multi-category search it yields a similar analysis, however, as the desired categories might in fact be very dierent from one another we have to allow for multiple possible query spaces, which at output time should be combined.
9.5.2
Query space display
There are several ways to display the query space to the user (see chapter 8). In addition, system feedback can be given to help the user in understanding the result. Recall that d as dened chapter 8 is equal to the intrinsic dimension of the data. Hence, for our purposes we have to consider the intrinsic dimension of the query space, or as the initial query space is the result of a query, equal to the intrinsic dimension of the query result. When the query is exact, the result of the query is a set of images fullling the predicate. As an image either fullls the predicate or not, there is no intrinsic order in the query result and hence d = 0. For approximate queries, the images in IQ are given a similarity ranking based on SQ with respect to the query. In many systems the role of V is limited to bounding the number of images displayed which are then displayed in a 2D rectangular grid [39, 23]. Note, however that we should have d = 1. If the user renes its query
84
using query by example, the images displayed do not have to be the images closest to the query. In [149], images are selected that together provide a representative overview of the whole active database. An alternative display model displays the image set minimizing the expected number of total iterations [29]. The space spanned by the features in FQ is a high-dimensional space. When images are described by feature vectors, every image has an associated position in this space. In [118, 138, 54] the operator V maps the high-dimensional feature space onto a display space with d = 3. Images are placed in such a way that distances between images in D reect SQ . A simulated sheye lens is used to induce perception of depth in [138]. In the reference, the set of images to display depends on how well the user selections conform to selections made in the community of users. To improve the users comprehension of the information space, [54] provides the user with a dynamic view on FQ through continuous variation of the active feature set. The display in [63] combines exact and approximate query results. First, the images in IQ are organized in 2D-layers according to labels in ZQ . Then, in each layer, images are positioned based on SQ . In exact queries based on accumulative features, back-projection [136, 36] can be used to give system feedback, indicating which parts of the image fulll the criteria. For example, in [107] each tile in the partition of the image shows the semantic label like sky, building, or grass the tile received. For approximate queries, in addition to mere rank ordering, in [21] system feedback is given by highlighting the subparts of the images contributing most to the ranking result.
9.5.3
User feedback
For target search, category search, or associative search various ways of user feedback have been considered. All are balancing between obtaining as much information from the user as possible and keeping the burden on the user minimal. The simplest form of feedback is to indicate which images are relevant [29]. In [26, 81], the user in addition explicitly indicates non-relevant images. The system in [112] considers ve levels of signicance, which gives more information to the system, but makes the process more dicult for the user. When d 2, the user can manipulate the projected distances between images, putting away non-relevant images and bringing relevant images closer to each other [118]. The user can also explicitly bring in semantic information by annotating individual images, groups of images, or regions inside images [83] with a semantic label. In general, user feedback leads to an update of query space:
i i+1 i+1 i+1 i+1 i i i i {IQ , FQ , SQ , ZQ } {IQ , FQ , SQ , ZQ }
RF
(9.1)
Dierent ways of updating Q are open. In [149] the images displayed correspond to a partitioning of IQ . By selecting an image, one of the sets in the partition is selected and the set IQ is reduced. Thus the user zooms in on a target or a category. The methods follows the pattern:
i i+1 i IQ IQ
RF
(9.2)
In current systems, the feature vectors in FQ , corresponding to images in IQ are assumed xed. This has great advantages in terms of eciency. When features are parameterized, however, feedback from the user could lead to optimization of the parameters. For example, in parameterized detection of objects based on salient contour details, the user can manipulate the segmentation result to have the system select a more appropriate salient detail based on the image evidence [127]. The general pattern is: i+1 i RFi FQ FQ (9.3)
9.5. QUERY SPACE INTERACTION
85
For associative search users typically interact to learn the system the right associations. Hence, the system updates the similarity function:
i i+1 i SQ SQ
RF
(9.4)
In [26, 112] SQ is parameterized by a weight vector on the distances between individual features. The weights in [26] are updated by comparing the variance of a feature in the set of positive examples to the variance in the union of positive and negative examples. If the variance for the positive examples is signicantly smaller, it is likely that the feature is important to the user. The system in [112] rst updates the weight of dierent feature classes. The ranking of images according to the overall similarity function is compared to the rankings corresponding to each individual feature class. Both positive and negative examples are used to compute the weight of the feature, computed as the inverse of the variance over the positive examples. The feedback RFi in [118] leads to an update of the user-desired distances between pairs of images in IQ . The parameters of the continuous similarity function should be updated to match the new distances. A regularization term is introduced limiting the deviation from the initial natural distance function. This ensures a balance between the information the data provides and impact of the feedback of the user. With such a term the system can smoothly learn the feedback from the user. The nal set of methods follow the pattern:
i i+1 i ZQ ZQ
RF
(9.5)
The system in [83] pre-computes a hierarchical grouping of partitionings (or images for that matter) based on the similarity for each individual feature. The feedback from the user is employed to create compound groupings corresponding to a user given z ZQ . The compound groupings are such that they include all of the positive and none of the negative examples. Unlabelled images in the compound group receive label z. The update of probabilities P is based on dierent partitionings of IQ . For category and target search, a system may rene the likelihood of a particular interpretation, updating the label based on feature values or on similarity values. The method in [81] falls in this class. It considers category search, where ZQ is {relevant, non-relevant}. In the limit case for only one relevant image, the method boils down to target search. All images indicated by the user as relevant or non-relevant in current or previous iterations are collected and a Parzen estimator is constructed incrementally to nd optimal separation between the two classes. The generic pattern which uses similarity in updating probabilities is the form used in [29] for target search with ZQ = {target}. In the reference an elaborate Bayesian framework is derived to compute the likelihood of any image in the database to be the target, given the history of actions RFi . In each iteration, the user selects examples from the set of images displayed. Image pairs are formed by taking one selected and one displayed, but non-selected image. The probability of being the target for an image in IQ is increased or decreased depending on the similarity to the selected and the non-selected example in the pair.
9.5.4
Evaluation of the interaction process
In section 9.4.4 we evaluated the output of the search task. When the user is interacting with the system this is not sucient. Given some output performance it makes a great dierence whether the user needed one or two iteration steps to reach this result, or was interacting over and over to get there. So one evaluation measure to consider is:
86
CHAPTER 9. INTERACTIVE RETRIEVAL Iteration count: the number of iterations needed to reach the result.
However, this is not sucient, we can design very simple iteration strategies where the user selects one multimedia item to continue, or a very elaborate one where the user has to indicate the relevance of each and every item on display. As it is hard to compare the dierent feedback mechanisms we take a general approach and make the assumption that for a user the most dicult step is to assess whether a multimedia item is relevant, whether the similarity between two items is correct etc. Thus we consider the following general evaluation measure: Assessment count: the number of times a user had to assess some element in the query space. Clearly, like in the case of precision and recall, the design of an iteration process will always be a balance between iteration and assessment count. Reducing one will introduce an increase in the other.
9.6
Discussion and conclusion
We consider the emphasis on interaction in image retrieval as one of the major changes with the computer vision tradition, as was already cited in the 1992 workshop [58]. Interaction was rst picked up by frontrunners, such as the NEClaboratories in Japan and the MIT-Medialab to name a few. Now, interaction and feedback have moved into the focus of attention. Putting the user in control and visualization of the content has always been a leading principle in information retrieval research. It is expected that more and more techniques from traditional information retrieval will be employed or reinvented, in content-based image retrieval. Text retrieval and image retrieval share the need for visualizing the information content in a meaningful way as well as the need to accept a semantic response of the user rather than just providing access to the raw data. User interaction in image retrieval has, however, some dierent characteristics from text retrieval. There is no sensory gap and the semantic gap from keywords to full text in text retrieval is of a dierent nature. No translation is needed from keywords to pictorial elements. A critical point in the advancement of content-based retrieval is the semantic gap, where the meaning of an image is rarely self-evident. Use of content-based retrieval for browsing will not be within the grasp of the general public as humans are accustomed to rely on the immediate semantic imprint the moment they see an image, and they expect a computer to do the same. The aim of content-based retrieval systems must be to provide maximum support in bridging the semantic gap between the simplicity of available visual features and the richness of the user semantics. Any information the user can provide in the search process should be employed to provide the rich context required in establishing the meaning of a picture. The interaction should form an integral component in any modern image retrieval system, rather than a last resort when the automatic methods fail. Already at the start, interaction can play an important role. Most of current systems perform query space initialization irrespective of whether a target search, a category search, or an associative search is requested. But the fact of the matter is that the set of appropriate features and the similarity function depend on the user goal. Asking the user for the required invariance, yields a solution for a specic form of target search. For category search and associative search the user-driven initialization of query space is still an open issue.
9.6. DISCUSSION AND CONCLUSION
87
For image retrieval we have identied six query classes, formed by the Cartesian product of the result type {exact, approximate} and the level of granularity of the descriptions {spatial content, image, image groups}. The queries based on spatial content require segmentation of the image. For large data sets such queries are only feasible when some form of weak segmentation can be applied to all images, or when signs are selected from a predened legend. A balance has to be found between exibility on the user side and scalability on the system side. Query by image example has been researched most thoroughly, but a single image is only suited when another image of the same object(s) is the aim of the search. In other cases there is simply not sucient context. Queries based on groups as well as techniques for prior identication of groups in data sets are promising lines of research. Such group-based approaches have the potential to partially bridge the semantic gap while leaving room for ecient solutions. Due to the semantic gap, visualization of the query space in image retrieval is of great importance for the user to navigate the complex query space. While currently 2- or 3-dimensional display spaces are mostly employed in query by association, target and category search are likely to follow. It if often overlooked that the query result has an inherent display dimension Most methods simply display images in a 2D grid. Enhancing the visualization of the query result is, however, a valuable tool in helping the user-navigating query space. As apparent from the query space framework, there is an abundance of information available for display. New visualization tools are urged for to allow for user- and goal-dependent choices on what to display.In all cases, an inux of computer graphics and virtual reality is foreseen in the near future. Through manipulation of the visualized result, the user gives feedback to the system. The interaction patterns as enumerated here reveal that in current methods feedback leads to an update of just one of the components of query space. There is no inherent reason why this should be the case. In fact, joint updates could indeed be eective and well worth researching. For example the pattern which updates category membership based on a dynamic similarity function, would combine the advantages of browsing with category and target search. One nal word about the impact of interactivity on the architecture of the system. The interacting user brings about many new challenges for the response time of the system. Content-based image retrieval is only scalable to large data sets when the database is able to anticipate on what interactive queries will be made. A frequent assumption is that the image set, the features, and the similarity function are known in advance. In a truly interactive session, the assumptions are no longer valid. A change from static to dynamic indexing is required. Interaction helps a great deal in limiting the semantic gap. Another important direction in limiting the gap is by making use of more than one modality. Often a picture has a caption, is part of a website etc. This context can be of great help. Although there are currently several methods for indexing which do take multiple modalities into account there are no methods which integrate the modalities in interactive search.

Answer search, target search, category search, associative search, object search, query space, exact query, approximate query, spatial predicate, image predicate, group predicate, spatial example, image example, group example, interactive query session, system feedback, relevance feedback, precision, recall, iteration count, assessment count, visualization operator, display space, perceived dimension, projection operator, screen space
88
Chapter 10
Data export
10.1 Introduction
When we aim to distribute items in the form of multimedia documents from the database e.g. the result of some specic query, or if we want to put a multimedia document on the web, the format should be self-descriptive so that the client end can interpret it in the correct way. Proprietary standards require to use the same tools at the client and the server side. The use of standards make this far easier as the client and server are free in choosing their tools as long as they conform to the standard. Of course currently the most important standard used for exchange is XML. For use in a multimedia context several extensions have been proposed. Given the nature of multimedia documents we need formats for exchanging the following: multimedia document content: formats dealing with the dierent forms of data representations and content descriptors. multimedia document layout: formats dealing with the spatial and temporal layout of the document. There are a number of dierent standards that consider the two elements above. We focus on MPEG-7 [80] for content description and SMIL as a standard for multimedia document layout.
10.2
MPEG-7
MPEG-7 is an ISO standard which provides a rich set of tools for completely describing multimedia content. It does so at all the levels of description identied in chapter 3 i.e. at the conceptual, perceptual, and non audio-visual level, in addition it provides descriptions which are aimed more at the data management level: Classical archival oriented description i.e. non audio-visual descriptions. Information regarding the contents creation and production e.g. director, title, actors, and location. Information related to using the content e.g. broadcast schedules and copyright pointers. Information on storing and representing the content e.g. storage format and encoding format. 89
90
CHAPTER 10. DATA EXPORT Perceptual descriptions of the information in the content Information regarding the contents spatial, temporal, or spatio-temporal structure e.g. scene cuts, segmentation in regions, and region tracking. information about low-level features in the content e.g. colors, textures, sound timbres, and melody descriptions. Descriptions at the semantic level semantic information related to the reality captured by the content e.g. objects, events, and interactions between objects. Information for organizing, managing, and accessing the content. Information about how objects are related and gathered in collections. Information to support ecient browsing of content e.g. summaries, variations, and transcoding information. Information about the interaction of the user with the content e.g. user preferences and usage history.
As an example, below is the annotation of a two shot video where the rst is annotated with some information. Note that the description is bulky, but contains all information required.
<?xml version="1.0" encoding="iso-8859-1"?> <Mpeg7 xmlns="urn:mpeg:mpeg7:schema:2001" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mpeg7="urn:mpeg:mpeg7:schema:2001" xsi:schemaLocation="urn:mpeg:mpeg7:schema:2001 Mpeg7-2001.xsd"> <Description xsi:type="ContentEntityType"> <MultimediaContent xsi:type="VideoType"> <Video> <TemporalDecomposition> <VideoSegment> <TextAnnotation type="scene" relevance="1" confidence="1"> <FreeTextAnnotation>Outdoors</FreeTextAnnotation> <FreeTextAnnotation>Nature(High-level)</FreeTextAnnotation> <FreeTextAnnotation>Forest</FreeTextAnnotation> </TextAnnotation> <MediaTime> <MediaTimePoint>T00:00:00:0F30</MediaTimePoint> <MediaIncrDuration mediaTimeUnit="PT1N30F">160</MediaIncrDuration> </MediaTime> <TemporalDecomposition> <VideoSegment> <MediaTime> <MediaTimePoint>T00:00:02:19F30</MediaTimePoint> </MediaTime> </VideoSegment> </TemporalDecomposition> </VideoSegment> </TemporalDecomposition> </Video> </MultimediaContent> </Description> </Mpeg7>
10.3. SMIL: SYNCHRONIZED MULTIMEDIA INTEGRATION LANGUAGE 91 It should be noted that MPEG-7 is not geared towards a specic application, but it aims to provide a generic framework to facilitate exchange and reuse of multimedia content across dierent application domains.
10.3
SMIL: Synchronized Multimedia Integration Language
To describe SMIL we will follow [19] and [20]. In the reference SMIL is dened as: SMIL is a collection of XML elements and attributes that we can use to describe the temporal and spatial coordination of one or more media objects. It has been implemented in many mediaplayers, in particular in the RealPlayer and to some extent in Internet Explorer. An important component of the format is the denition of timing and synchronization of dierent media components. To that end three dierent basic media containers are dened: seq, or sequential time container where the dierent elements in the container are played in the order given. par, or parallel time container indicating that the elements in the container can be played at the same time. excl, or exclusive container indicating that only one of the elements of the container can be active at some given point int time. To specify the timing of an individual media item or the container as a whole, the attributes begin, end, and dur (for duration) are used. The above containers can be nested to create a presentation hierarchy. To that end, it should be noted that the timing mentioned above can be dynamic. The composite timeline is computed by considering the whole hierarchy. For example the duration of a par container equals the time of the longest playing media item. In many cases this will not be known till the document is actually played. Individual items are specied with relative starts and ends. For example an item in a seq sequence can indicate that it will start 2 seconds after the preceding one. Finally, in a excl container the item to select is typically based on some event like another item that starts playing or a button selected by the user. Examples of the three containers are given in gure 10.1. In a perfect world any system using any connection should be able to adhere to the timings as given in the attributes. However, this can not always be assured. SMIL deals with these problems in two dierent ways. First, it has three high level attributes to control synchronization: syncBehavior : lets a presentation dene whether there can be slippage in implementing the presentations composite timeline. syncTolerance: denes how much slip is allowed syncMaster : lets a particular element become the master timebase against which all others are measured. The above deals with small problems occurring over the communication channel. However, SMIL also allows to dene the presentation in such a way that it can adapt to the particular device or communication channel used. In his manner a
92
CHAPTER 10. DATA EXPORT
<seq>
<img id="a" dur="6s" begin="0s" src="..." /> <img id="b" dur="4s" begin="0s" src="..." /> <img id="c" dur="5s" begin="2s" src="..." /> </seq>
<par>
</par>
<img id="a" dur="6s" begin="0s" src="..." /> <img id="b" dur="4s" begin="0s" src="..." /> <img id="c" dur="5s" begin="2s" src="..." />
<excl>
<img id="a" dur="6s" src="..." begin="0s";button1.activateEvent/> <img id="b" dur="4s" src="..." begin="0s";button2.activateEvent/> <img id="c" dur="5s" src="..." begin="2s";button3.activateEvent/> </excl>
Figure 10.1: Abstract examples with timelines and associated code of the seq, par, and excl container.
presentation can be developed that can be run both on a full edged PC where for example a complete video presentation is shown and on a mobile phone where a slide show with thumbnail images is presented. The second solution provided by SMIL is in the form of the <switch> statement. In this way we can for example test on the systemBitrate (typically dened by the connection) and choose the appropriate media accordingly. So, for example, the following code fragments selects the video if the bitrate is 112Kbytes or above, a slideshow if bitrate is above 56Kbytes and nally if the connection has an even lower bitrate a text is shown. <switch> <video src="videofile.avi" systemBitrate="115200"/> <seq systemBitrate="57344"> <img src="img1.png" dur="5s"/> ... <img src="imeg12.png" dur="9s"/> </seq> <text src="desc.html" dur="30s"/> </switch> The layout in a SMIL document is specied by a hierarchial specication of regions on the page. As, subregions are dened relative to the position of their parents they can be moved as a whole. An example is given in gure 10.2.

XML,SMIL,MPEG7,mutlimedia document layout
10.3. SMIL: SYNCHRONIZED MULTIMEDIA INTEGRATION LANGUAGE 93
(0,0)
ID3
ID2 ID4
ID1
<toplayout id="id1" witdth="800" height="600"> <region id="id2" left="80" top="40" width="30" height="40"> <region id="id3" left="40" top="60" width="20" height="10"> </region> <region id="id4" left="420" top="80" width="30" height="40"> </toplayout>
Figure 10.2: Simple layout example in SMIL. The toplayout denes the size of the presentation as a whole. The other regions are dened as subregions of their parent.
94
CHAPTER 10. DATA EXPORT
Bibliography
[1] S. Abney. Part-of-speech tagging and partial parsing. In S. Young and G. Bloothooft, editors, Corpus-Based Methods in Language and Speech Processing, pages 118136. Kluwer Academic Publishers, Dordrecht, 1997. [2] D.A. Adjeroh, I. King, and M.C. Lee. A distance measure for video sequences. Computer Vision and Image Understanding, 75(1):2545, 1999. [3] Ph. Aigrain, Ph. Joly, and V. Longueville. Medium Knowledge-Based MacroSegmentation of Video Into Sequences, chapter 8, pages 159173. Intelligent Multimedia Information Retrieval. AAAI Press, 1997. [4] A.A. Alatan, A.N. Akansu, and W. Wolf. Multi-modal dialogue scene detection using hidden markov models for content-based multimedia indexing. Multimedia Tools and Applications, 14(2):137151, 2001. [5] J. Allen. Maintaining knowledge about temporal intervals. Communications of the ACM, 26:832843, 1983. [6] Y. Altunbasak, P.E. Eren, and A.M. Tekalp. Region-based parametric motion segmentation using color information. Graphical models and image processing, 60(1):1323, 1998. [7] J. Assfalg, A. del Bimbo, and P. Pala. Using multiple examples for content based retrieval. In Proc. Int. Conf. on Multimedia and Expo. IEEE Press, 2000. [8] N. Babaguchi, Y. Kawai, and T. Kitahashi. Event based indexing of broadcasted sports video by intermodal collaboration. IEEE Transactions on Multimedia, 4(1):6875, 2002. [9] P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman. Eigenfaces vs. sherfaces: Recognition using class specic linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):711720, 1997. [10] M. Bertini, A. Del Bimbo, and P. Pala. Content-based indexing and retrieval of TV news. Pattern Recognition Letters, 22(5):503516, 2001. [11] D. Bikel, R. Schwartz, and R.M. Weischedel. An algorithm that learns whats in a name. Machine Learning, 34(1-3):211231, 1999. [12] J.M. Boggs and D.W. Petrie. The Art of Watching Films. Mayeld Publishing Company, Mountain View, USA, 5th edition, 2000. [13] J.M. Boggs and D.W. Petrie. The art of watching lms. Mayeld Publishing Company, Mountain View, CA, 5th edition, 2000. [14] R.M. Bolle, B.-L. Yeo, and M. Yeung. Video query: Research directions. IBM Journal of Research and Development, 42(2):233252, 1998. 95
96
BIBLIOGRAPHY
[15] R.M. Bolle, B.-L. Yeo, and M.M. Yeung. Video query: Research directions. IBM Journal of Research and Development, 42(2):233252, 1998. [16] A. Bonzanini, R. Leonardi, and P. Migliorati. Event recognition in sport programs using low-level motion indices. In IEEE International Conference on Multimedia & Expo, pages 12081211, Tokyo, Japan, 2001. [17] M. Brown, J. Foote, G. Jones, K. Sparck-Jones, and S. Young. Automatic content-based retrieval of broadcast news. In ACM Multimedia 1995, San Francisco, USA, 1995. [18] R. Brunelli, O. Mich, and C.M. Modena. A survey on the automatic indexing of video data. Journal of Visual Communication and Image Representation, 10(2):78112, 1999. [19] D.C.A. Bulterman. Smil 2.0: Part1: overview, concepts, and structure. IEEE Multimedia, pages 8288, July-September 2001. [20] D.C.A. Bulterman. Smil 2.0: Part2: examples and comparisons. IEEE Multimedia, pages 7484, January-March 2002. [21] C. Carson, S. Belongie, H. Greenspan, and J. Malik. Region-based image querying. In Proc. Int. workshop on Content-based access of Image and Video libraries. IEEE Press, 1997. [22] M. La Cascia, S. Sethi, and S. Sclaro. Combining textual and visual cues for content-based image retrieval on the world wide web. In IEEE Workshop on Content-Based Access of Image and Video Libraries, 1998. [23] S-F. Chang, J. R. Smith, M. Beigi, and A. Benitez. Visual information retrieval from large distributed online repositories. Comm. ACM, 40(12):63 71, 1997. [24] P. Chiu, Girgensohn, W. Polak, E. Rieel, and L. Wilcox. A genetic algorithm for video segmentation and summarization. In IEEE International Conference on Multimedia and Expo, volume 3, pages 13291332, 2000. [25] M. Christel, A. Olligschlaeger, and C. Huang. Interactive maps for a digital video library. IEEE Multimedia, 7(1):6067, 2000. [26] G. Ciocca and R. Schettini. Using a relevance feedback mechanism to improve content-based image retrieval. In Proc. Visual99: Information and Information Systems, number 1614 in Lect. Notes in Comp. Sci., pages 107114. Springer Verlag GmbH, 1999. [27] C. Colombo, A. Del Bimbo, and P. Pala. Semantics in visual information retrieval. IEEE Multimedia, 6(3):3853, 1999. [28] Convera. http://www.convera.com. [29] I. J. Cox, M. L. Miller, T. P. Minka, and T. V. Papathomas. The bayesian image retrieval system, PicHunter: theory, implementation, and pychophysical experiments. IEEE trans. IP, 9(1):20 37, 2000. [30] G. Davenport, T. Aguierre Smith, and N. Pincever. Cinematic principles for multimedia. IEEE Computer Graphics & Applications, pages 6774, 1991. [31] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391407, 1990.
BIBLIOGRAPHY
97
[32] A. del Bimbo and P. Pala. Visual image retrieval by elastic matching of user sketches. IEEE trans. PAMI, 19(2):121132, 1997. [33] N. Dimitrova, L. Agnihotri, and G. Wei. Video classication based on HMM using text and faces. In European Signal Processing Conference, Tampere, Finland, 2000. [34] S. Eickeler and S. Mller. Content-based video indexing of TV broadcast news u using hidden markov models. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 29973000, Phoenix, USA, 1999. [35] K. El-Maleh, M. Klein, G. Petrucci, and P. Kabal. Speech/music discrimination for multimedia applications. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 24452448, Istanbul, Turkey, 2000. [36] F. Ennesser and G. Medioni. Finding Waldo, or focus of attention using local color information. IEEE trans. PAMI, 17(8):805 809, 1995. [37] S. Fischer, R. Lienhart, and W. Eelsberg. Automatic recognition of lm genres. In ACM Multimedia 1995, pages 295304, San Francisco, USA, 1995. [38] M.M. Fleck, D.A. Forsyth, and C. Bregler. Finding naked people. In European Conference on Computer Vision, volume 2, pages 593602, Cambridge, UK, 1996. [39] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: the QBIC system. IEEE Computer, 1995. [40] B. Furht, S.W. Smoliar, and H-J. Zhang. Video and Image Processing in Multimedia Systems. Kluwer Academic Publishers, Norwell, USA, 2th edition, 1996. [41] C. Gane and T. Sarson. Structured systems analysis: Tools and Techniques. New York: IST. Inc., 1977. [42] T. Gevers and A. W. M. Smeulders. Pictoseek: combining color and shape invariant features for image retrieval. IEEE trans. IP, 9(1):102 119, 2000. [43] A. Ghias, J. Logan, D. Chamberlin, and B.C. Smith. Query by humming musical information retrieval in an audio database. In ACM Multimedia 1995, San Francisco, USA, 1995. [44] Y. Gong, L.T. Sin, and C.H. Chuan. Automatic parsing of TV soccer programs. In IEEE International Conference on Multimedia Computing and Systems, pages 167174, 1995. [45] B. Gnsel, A.M. Ferman, and A.M. Tekalp. Video indexing through integrau tion of syntactic and semantic features. In Third IEEE Workshop on Applications of Computer Vision, Sarasota, USA, 1996. [46] A. Gupta and R. Jain. Visual information retrieval. Comm. ACM, 40(5):71 79, 1997. [47] N. Haering, R. Qian, and I. Sezan. A semantic event-detection approach and its application to detecting hunts in wildlife video. IEEE Transactions on Circuits and Systems for Video Technology, 10(6):857868, 2000.
98
BIBLIOGRAPHY
[48] A. Hampapur, R. Jain, and T. Weymouth. Feature based digital video indexing. In IFIP 2.6 Third Working Conference on Visual Database Systems, Lausanne, Switzerland, 1995. [49] A. Hanjalic, G. Kakes, R.L. Lagendijk, and J. Biemond. Dancers: Delft advanced news retrieval system. In IS&T/SPIE Electronic Imaging 2001: Storage and Retrieval for Media Databases 2001, San Jose, USA, 2001. [50] A. Hanjalic, R.L. Lagendijk, and J. Biemond. Automated high-level movie segmentation for advanced video retrieval systems. IEEE Transactions on Circuits and Systems for Video Technology, 9(4):580588, 1999. [51] A. Hanjalic, G.C. Langelaar, P.M.B. van Roosmalen, J. Biemond, and R.L. Lagendijk. Image and Video Databases: Restoration, Watermarking and Retrieval. Elsevier Science, Amsterdam, The Netherlands, 2000. [52] A.G. Hauptmann, D. Lee, and P.E. Kennedy. Topic labeling of multilingual broadcast news in the informedia digital video library. In ACM DL/SIGIR MIDAS Workshop, Berkely, USA, 1999. [53] A.G. Hauptmann and M.J. Witbrock. Story segmentation and detection of commercials in broadcast news video. In ADL-98 Advances in Digital Libraries, pages 168179, Santa Barbara, USA, 1998. [54] A. Hiroike, Y. Musha, A. Sugimoto, and Y. Mori. Visualization of information spaces to retrieve and browse image data. In Proc. Visual99: Information and Information Systems, volume 1614 of Lect. Notes in Comp. Sci., pages 155 162. Springer Verlag GmbH, 1999. [55] J. Huang, Z. Liu, Y. Wang, Y. Chen, and E.K. Wong. Integration of multimodal features for video scene classication based on HMM. In IEEE Workshop on Multimedia Signal Processing, Copenhagen, Denmark, 1999. [56] I. Ide, K. Yamamoto, and H. Tanaka. Automatic video indexing based on shot classication. In First International Conference on Advanced Multimedia Content Processing, volume 1554 of Lecture Notes in Computer Science, Osaka, Japan, 1999. Springer-Verlag. [57] A.K. Jain, R.P.W. Duin, and J. Mao. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4 37, 2000. [58] R. Jain, editor. NSF Workshop on Visual Information Management Systems, Redwood, CA, 1992. [59] R. Jain and A. Hampapur. Metadata in video databases. ACM SIGMOD, 23(4):2733, 1994. [60] P.J. Jang and A.G. Hauptmann. Learning to recognize speech by watching television. IEEE Intelligent Systems, 14(5):5158, September/October 1999. [61] R.S. Jasinschi, N. Dimitrova, T. McGee, L. Agnihotri, J. Zimmerman, and D. Li. Integrated multimedia processing for topic segmentation and classication. In IEEE International Conference on Image Processing, pages 366369, Thessaloniki, Greece, 2001. [62] O. Javed, Z. Rasheed, and M. Shah. A framework for segmentation of talk & game shows. In IEEE International Conference on Computer Vision, Vancouver, Canada, 2001.
BIBLIOGRAPHY
99
[63] T. Kakimoto and Y. Kambayashi. Browsing functions in three-dimensional space for digital libraries. Int. Journ. of Digital Libraries, 2:6878, 1999. [64] T. Kato, T. Kurita, N. Otsu, and K. Hirata. A sketch retrieval method for full color image database - query by visual example. In Proc. of the ICPR, Computer Vision and Applications, pages 530533, 1992. [65] J.R. Kender and B.L. Yeo. Video scene segmentation via continuous video coherence. In CVPR98, Santa Barbara, CA. IEEE, IEEE, June 1998. [66] V. Kobla, D. DeMenthon, and D. Doermann. Identication of sports videos using replay, text, and camera motion features. In SPIE Conference on Storage and Retrieval for Media Databases, volume 3972, pages 332343, 2000. [67] R. Kohavi, D. Sommereld, and J. Dougherty. Data mining using MLC++: A machine learning library in C++. In Proceedings of the 8th International Conference on Tools with Articial Intelligence., pages 234245, 1996. http://www.sgi.com/Technology/mlc. [68] Y.-M. Kwon, C.-J. Song, and I.-J. Kim. A new approach for high level video structuring. In IEEE International Conference on Multimedia and Expo, volume 2, pages 773776, 2000. [69] B. Wielinga M. Worring L. Hollink, A. Schreiber. Classication of user image descriptions. Submitted for publication, 2002. [70] L. J. Latecki and R. Lakmper. Contour-based shape similarity. In Proc. a Visual99: Information and Information Systems, number 1614 in Lect. Notes in Comp. Sci., pages 617624, 1999. [71] S.-Y. Lee, S.-T. Lee, and D.-Y. Chen. Automatic Video Summary and Description, volume 1929 of Lecture Notes in Computer Science, pages 3748. Springer-Verlag, Berlin, 2000. [72] M. S. Lew, K. Lempinen, and H. Briand. Webcrawling using sketches. In Proc. Visual97: Information Systems, pages 7784. Knowledge Systems Institute, 1997. [73] D. Li, I.K. Sethi, N. Dimitrova, and T. McGee. Classication of general audio data for content-based retrieval. Pattern Recognition Letters, 22(5):533544, 2001. [74] H. Li, D. Doermann, and O. Kia. Automatic text detection and tracking in digital video. IEEE Transactions on Image Processing, 9(1):147156, 2000. [75] R. Lienhart, C. Kuhmnch, and W. Eelsberg. On the detection and recogniu tion of television commercials. In IEEE Conference on Multimedia Computing and Systems, pages 509516, Ottawa, Canada, 1997. [76] R. Lienhart, S. Pfeier, and W. Eelsberg. Scene determination based on video and audio features. In Proc. of the 6th IEEE Int. Conf. on Multimedia Systems, volume 1, pages 685690, 1999. [77] T. Lin and H.-J. Zhang. Automatic video scene extraction by shot grouping. In Proceedings of ICPR 00, Barcelona, Spain, 2000. [78] G. Lu. Indexing and retrieval of audio: a survey. Multimedia Tools and Applications, 15:269290, 2001.
100
BIBLIOGRAPHY
[79] C.D. Manning and H. Schtze. Foundations of Statistical Natural Language u Processing. The MIT Press, Cambridge, USA, 1999. [80] J.M. Martinez, R. Koenen, and F. Pereira. MPEG-7 the generic multimedia content description standard, part 1. IEEE Multimedia, april-june 2002. [81] C. Meilhac and C. Nastar. Relevance feedback and category search in image databases. In Proc. Int. Conf. on Multimedia Computing and Systems, pages 512517. IEEE Press, 1999. [82] K. Minami, A. Akutsu, H. Hamada, and Y. Tomomura. Video handling with music and speech detection. IEEE Multimedia, 5(3):1725, 1998. [83] T. P. Minka and R. W. Picard. Interactive learning with a society of models.. Pattern Recognition, 30(4):565582, 1997. [84] H. Miyamori and S. Iisaku. Video annotation for content-based retrieval using human behavior analysis and domain knowledge. In IEEE International Conference on Automatic Face and Gesture Recognition, pages 2630, Grenoble, France, 2000. [85] A. Mohan, C. Papageorgiou, and T. Poggio. Example-based object detection in images by components. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(4):349361, 2001. [86] S. Moncrie, C. Dorai, and S. Venkatesh. Detecting indexical signs in lm audio for scene interpretation. In IEEE International Conference on Multimedia & Expo, pages 11921195, Tokyo, Japan, 2001. [87] H. Mller and E. Tan. Movie maps. In International Conference on Informau tion Visualization, London, England, 1999. IEEE. [88] Frank Nack and Adam T. Lindsay. Everything you wanted to know about mpeg-7: Part 1. IEEE Multimedia, 6(3):6577, 1999. [89] Frank Nack and Adam T. Lindsay. Everything you wanted to know about mpeg-7: Part 2. IEEE Multimedia, 6(4):6473, 1999. [90] J. Nam, M. Alghoniemy, and A.H. Tewk. Audio-visual content-based violent scene characterization. In IEEE International Conference on Image Processing, volume 1, pages 353357, Chicago, USA, 1998. [91] J. Nam, A. Enis Cetin, and A.H. Tewk. Speaker identication and video analysis for hierarchical video shot classication. In IEEE International Conference on Image Processing, volume 2, Washington DC, USA, 1997. [92] M.R. Naphade and T.S. Huang. A probabilistic framework for semantic video indexing, ltering, and retrieval. IEEE Transactions on Multimedia, 3(1):141 151, 2001. [93] H.T. Nguyen, M. Worring, and A. Dev. Detection of moving objects in video using a robust motion similarity measure. IEEE Transactions on Image Processing, 9(1):137141, 2000. [94] L. Nigay and J. Coutaz. A design space for multimodal systems: concurrent processing and data fusion. In INTERCHI93 Proceedings, pages 172178, Amsterdam, the Netherlands, 1993. [95] D.W. Oard. The state of the art in text ltering. User Modeling and UserAdapted Interaction, 7(3):141178, 1997.
BIBLIOGRAPHY
101
[96] V. E. Ogle. CHABOT - retrieval from a relational database of images. IEEE Computer, 28(9):40 48, 1995. [97] H. Pan, P. Van Beek, and M.I. Sezan. Detection of slow-motion replay segments in sports video for highlights generation. In IEEE International Conference on Acoustic, Speech and Signal Processing, 2001. [98] M. L. Pao and M. Lee. Concepts of Information Retrieval. Libraries Unlimited, 1989. [99] N.V. Patel and I.K. Sethi. Audio characterization for video indexing. In Proceedings SPIE on Storage and Retrieval for Still Image and Video Databases, volume 2670, pages 373384, San Jose, USA, 1996. [100] N.V. Patel and I.K. Sethi. Video classication using speaker identication. In IS&T SPIE, Proceedings: Storage and Retrieval for Image and Video Databases IV, San Jose, USA, 1997. [101] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, USA, 1988. [102] A.K. Peker, A.A. Alatan, and A.N. Akansu. Low-level motion activity features for semantic characterization of video. In IEEE International Conference on Multimedia & Expo, New York City, USA, 2000. [103] A. Pentland, B. Moghaddam, and T. Starner. View-based and modular eigenspaces for face recognition. In IEEE International Conference on Computer Vision and Pattern Recognition, Seattle, USA, 1994. [104] S. Pfeier, S. Fischer, and W. Eelsberg. Automatic audio content analysis. In ACM Multimedia 1996, pages 2130, Boston, USA, 1996. [105] S. Pfeier, R. Lienhart, and W. Eelsberg. Scene determination based on video and audio features. Multimedia Tools and Applications, 15(1):5981, 2001. [106] T.V. Pham and M. Worring. Face detection methods: A critical evaluation. Technical Report 2000-11, Intelligent Sensory Information Systems, University of Amsterdam, 2000. [107] R. W. Picard and T. P. Minka. Vision texture for annotation. Multimedia systems, 3:314, 1995. [108] Praja. http://www.praja.com. [109] L.R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of IEEE, 77(2):257286, 1989. [110] H.A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):23 38, 1998. [111] Y. Rui and T. S. Huang. Optimizing learning in image retrieval. In Proc. Computer Vision and Pattern Recognition, pages 236 243. IEEE Press, 2000. [112] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra. Relevance feedback: a power tool for interactive content-based image retrieval. IEEE trans. CVT, 1998.
102
BIBLIOGRAPHY
[113] Y. Rui, T.S. Huang, and S. Mehrotra. Constructing table-of-content for videos. Multimedia Systems, Special section on Video Libraries, 7(5):359368, 1999. [114] E. Sahouria and A. Zakhor. Content analysis of video using principal components. IEEE Transactions on Circuits and Systems for Video Technology, 9(8):12901298, 1999. [115] E. Sahouria and A. Zakhor. Content analysis of video using principal components. IEEE Transactions on Circuits and Systems for Video Technology, 9(8):12901298, 1999. [116] G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [117] H. Samet and A. Soer. MARCO: MAp Retrieval by COntent. IEEE trans. PAMI, 18(8):783798, 1996. [118] S. Santini, A. Gupta, and R. Jain. User interfaces for emergent semantics in image databases. In Proc. 8th IFIP Working Conf. on Database Semantics (DS-8), 1999. [119] C. Saraceno and R. Leonardi. Identication of story units in audio-visual sequences by joint audio and video processing. In IEEE International Conference on Image Processing, Chicago, USA, 1998. [120] S. Satoh, Y. Nakamura, and T. Kanade. Name-it: Naming and detecting faces in news videos. IEEE Multimedia, 6(1):2235, 1999. [121] D.D. Saur, Y.-P. Tan, S.R. Kulkarni, and P.J. Ramadge. Automated analysis and annotation of basketball video. In SPIEs Electronic Imaging conference on Storage and Retrieval for Image and Video Databases V, volume 3022, pages 176187, San Jose, USA, 1997. [122] H. Schneiderman and T. Kanade. A statistical method for 3D object detection applied to faces and cars. In IEEE Computer Vision and Pattern Recognition, Hilton Head, USA, 2000. [123] S. Sclaro. Deformable prototypes for encoding shape categories in image databases. Pattern Recognition, 30(4):627 641, 1997. [124] K. Shearer, C. Dorai, and S. Venkatesh. Incorporating domain knowledge with video and voice data analysis in news broadcasts. In ACM International Conference on Knowledge Discovery and Data Mining, pages 4653, Boston, USA, 2000. [125] J. Shim, C. Dorai, and R. Bolle. Automatic text extraction from video for content-based annotation and retrieval. In IEEE International Conference on Pattern Recognition, pages 618620, 1998. [126] C-R. Shyu, C. E. Brodley, A. C. Kak, and A. Kosaka. ASSERT: a physician in the loop content-based retrieval system for HCRT image databases. Image Understanding, 75(1/2):111132, 1999. [127] A. W. M. Smeulders, S. D. Olabariagga, R. van den Boomgaard, and M. Worring. Interactive segmentation. In Proc. Visual97: Information Systems, pages 512. Knowledge Systems Institute, 1997.
BIBLIOGRAPHY
103
[128] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):13491380, 2000. [129] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):13491380, 2000. [130] J. R. Smith and S-F. Chang. Integrated spatial and feature image query. Multimedia systems, 7(2):129 140, 1999. [131] C.G.M. Snoek and M. Worring. Multimodal video indexing: a review of the state-of-the-art. Multimedia Tools and Applications, 2002. to appear. [132] R.K. Srihari. Automatic indexing and content-based retrieval of captioned images. IEEE Computer, 28(9):4956, 1995. [133] M. Stricker and M. Orengo. Similarity of color images. In Storage and Retrieval of Image and Video Databases III, pages 381392. SPIE Press vol. 2420, 1995. [134] G. Sudhir, J.C.M. Lee, and A.K. Jain. Automatic classication of tennis video for high-level content-based retrieval. In IEEE International Workshop on Content-Based Access of Image and Video Databases, in conjunction with ICCV98, Bombay, India, 1998. [135] H. Sundaram and S.-F. Chang. Determining computable scenes in lms and their structures using audio visual memory models. In Proceedings of the 8th ACM Multimedia Conference, Los Angeles, CA, 2000. [136] M. J. Swain and B. H. Ballard. Color indexing. Int. Journ. of Computer Vision, 7(1):11 32, 1991. [137] M. Szummer and R.W. Picard. Indoor-outdoor image classication. In IEEE International Workshop on Content-based Access of Image and Video Databases, in conjunction with ICCV98, Bombay, India, 1998. [138] J. Tatemura. Browsing images based on social and content similarity. In Proc. Int. Conf. on Multimedia and Expo. IEEE Press, 2000. [139] L. Teodosio and W. Bender. Salient video stills: content and context preserved. In Proceedings of the First ACM International Conference on Multimedia, pages 3946, 1993. [140] B.T. Truong and S. Venkatesh. Determining dramatic intensication via ashing lights in movies. In IEEE International Conference on Multimedia & Expo, pages 6164, Tokyo, Japan, 2001. [141] B.T. Truong, S. Venkatesh, and C. Dorai. Automatic genre identication for content-based video categorization. In IEEE International Conference on Pattern Recognition, Barcelona, Spain, 2000. [142] S. Tsekeridou and I. Pitas. Content-based video parsing and indexing based on audio-visual interaction. IEEE Transactions on Circuits and Systems for Video Technology, 11(4):522535, 2001. [143] A. Vailaya, M. Figueiredo, A. Jain, and H. Zhang. Content-based hierarchical classication of vacation images. In Proc. Int. Conf. on Multimedia Computing and Systems, 1999.
104
BIBLIOGRAPHY
[144] A. Vailaya and A.K. Jain. Detecting sky and vegetation in outdoor images. In Proceedings of SPIE: Storage and Retrieval for Image and Video Databases VIII, volume 3972, San Jose, USA, 2000. [145] A. Vailaya, A.K. Jain, and H.-J. Zhang. On image classication: City images vs. landscapes. Pattern Recognition, 31:19211936, 1998. [146] J. Vendrig. Interactive exploration of multimedia content. PhD thesis, 2002. [147] J. Vendrig and M. Worring. Feature driven visualization of video content for interactive indexing. In R. Laurini, editor, Visual Information and Information Systems, volume 1929 of Lecture Notes in Computer Science, pages 338348, Berlin, 2000. Springer-Verlag. [148] J. Vendrig and M. Worring. Evaluation of logical story unit segmentation in video sequences. In IEEE International Conference on Multimedia & Expo, pages 10921095, Tokyo, Japan, 2001. [149] J. Vendrig, M. Worring, and A. W. M. Smeulders. Filter image browsing: exploiting interaction in retrieval. In Proc. Visual99: Information and Information Systems, volume 1614 of Lect. Notes in Comp. Sci. Springer Verlag GmbH, 1999. [150] E. Veneau, R. Ronfard, and P. Bouthemy. From video shot clustering to sequence segmentation. In Proceedings of ICPR 00, volume 4, pages 254 257, Barcelona, Spain, 2000. [151] Virage. http://www.virage.com. [152] E.M. Voorhees, editor. Proceedings of the 10th Text Retrieval Conference (TREC). NIST, 2001. [153] Y. Wang, Z. Liu, and J. Huang. Multimedia content analysis using both audio and visual clues. IEEE Signal Processing Magazine, 17(6):1236, 2000. [154] K. Weixin, R. Yao, and L. Hanqing. A new scene breakpoint detection algorithm using slice of video stream. In H.H.S. Ip and A.W.M. Smeulders, editors, MINAR98, pages 175180, Hongkong, China, 1998. IAPR. [155] T. Westerveld. Image retrieval: Content versus context. In ContentBased Multimedia Information Access, RIAO 2000 Conference, pages 276 284, Paris, France, 2000. [156] I.H. Witten and E. Frank. Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann Publishers, 2000. [157] E. Wold, T. Blum, D. Keislar, and J. Wheaton. Content-based classication, search, and retrieval of audio. IEEE Multimedia, 3(3):2736, 1996. [158] M. Worring, A.W.M. Smeulders, and S. Santini. Interaction in content-based retrieval: an evaluation of the state-of-the-art. In R. Laurini, editor, Advances in Visual Information Systems, number 1929 in Lect. Notes in Comp. Sci., pages 2636, 2000. [159] L. Wu, J. Benois-Pineau, and D. Barba. Spatio-temporal segmentation of image sequences for object-oriented low bit-rate image coding. Image Communication, 8(6):513544, 1996.
BIBLIOGRAPHY
105
[160] P. Xu, L. Xie, S.-F. Chang, A. Divakaran, A. Vetro, and H. Sun. Algorithms and systems for segmentation and structure analysis in soccer video. In IEEE International Conference on Multimedia & Expo, pages 928931, Tokyo, Japan, 2001. [161] M.-H. Yang, D.J. Kriegman, and N. Ahuja. Detecting faces in images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):3458, 2002. [162] M. Yeung, B.-L. Yeo, and B. Liu. Segmentation of video by clustering and graph analysis. Computer Vision and Image Understanding, 71(1):94109, 1998. [163] M.M. Yeung and B.-L. Yeo. Video content characterization and compaction for digital library applications. In IS&T/SPIE Storage and Retrieval of Image and Video Databases V, volume 3022, pages 4558, 1997. [164] H.-J. Zhang, A. Kankanhalli, and S.W. Smoliar. Automatic partitioning of full-motion video. Multimedia Systems, 1(1):1028, 1993. [165] H.-J. Zhang, S.Y. Tan, S.W. Smoliar, and G. Yihong. Automatic parsing and indexing of news video. Multimedia Systems, 2(6):256266, 1995. [166] T. Zhang and C.-C.J. Kuo. Hierarchical classication of audio data for archiving and retrieving. In IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 6, pages 30013004, Phoenix, USA, 1999. [167] D. Zhong and S.-F. Chang. Structure analysis of sports video using domain models. In IEEE International Conference on Multimedia & Expo, pages 920923, Tokyo, Japan, 2001. [168] Y. Zhong, H.-J. Zhang, and A.K. Jain. Automatic caption localization in compressed video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(4):385392, 2000. [169] W. Zhou, A. Vellaikal, and C.-C.J. Kuo. Rule-based video classication system for basketball video indexing. In ACM Multimedia 2000, Los Angeles, USA, 2000. [170] W. Zhu, C. Toklu, and S.-P. Liou. Automatic news video segmentation and categorization based on closed-captioned text. In IEEE International Conference on Multimedia & Expo, pages 10361039, Tokyo, Japan, 2001.

Multimedia Lecture Notes

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Multimedia Lecture Notes

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture Notes: Multimedia Information Systems

Four urgent themes in multimedia computing

Contextual KnowledgeRaw Data Formatted Result

Database Management System

Overview of the lecture notes

Keyterms in this chapter

Data, domains, and applications

Data and applications

CHAPTER 2. DATA, DOMAINS, AND APPLICATIONS

2.2. CATEGORIZATION OF DATA

CHAPTER 2. DATA, DOMAINS, AND APPLICATIONS

2.4. ONTOLOGY CREATION

Produced video documents

section is adapted from [131].

CHAPTER 2. DATA, DOMAINS, AND APPLICATIONS

2.5. PRODUCED VIDEO DOCUMENTS

Recording the scene

CHAPTER 2. DATA, DOMAINS, AND APPLICATIONS

Genre Sub-genre Logical units Named events

Sensor Shots Fundamental Units Transition Edits Special Effects

Keyterms in this chapter

CHAPTER 3. DATA REPRESENTATION

info in this section is mostly based on [78].

3.3. AUDIO REPRESENTATION

50 40 30 [dB] 20 10 0 10 20 0 1 2 3 4 frequency [kHz] 5 6 7 8

3.4. IMAGE REPRESENTATION

CHAPTER 3. DATA REPRESENTATION

3.6. TEXT REPRESENTATION

Figure 3.2: A Term Document matrix.

CHAPTER 3. DATA REPRESENTATION

Keyterms in this chapter

min (fi2 , fi1 ) 29

Leading to the so called Cosine Similarity. S= qd ||q||||d||

4.3. SEMANTIC SIMILARITY

Concept 1 2 Instantiation 3 C_2

Keyterms in this chapter

Multimedia Indexing Tools

Pattern recognition methods

CHAPTER 5. MULTIMEDIA INDEXING TOOLS

TestData ClassInformation Similarity

Decision Trees Decision Boundaries Templates .....

Figure 5.1: Overview of the multimedia indexing steps.

5.3. STATISTICAL METHODS

CHAPTER 5. MULTIMEDIA INDEXING TOOLS

Figure 5.2: Example confusion table.

5.3. STATISTICAL METHODS

Classifier with minimum error

Figure 5.4: Example of a 2D linear separator.

CHAPTER 5. MULTIMEDIA INDEXING TOOLS

Figure 5.5: Example of a hierarchical division of space based on a decision tree.

5.4. DIMENSIONALITY REDUCTION

p34 p14 State4

CHAPTER 5. MULTIMEDIA INDEXING TOOLS

Keyterms in this chapter

chapter is adapted from [131].

CHAPTER 6. VIDEO INDEXING

Reconstructed layout Detected people Detected objects Detected setting

Video document segmentation

6.2. VIDEO DOCUMENT SEGMENTATION

CHAPTER 6. VIDEO INDEXING

6.2. VIDEO DOCUMENT SEGMENTATION

CHAPTER 6. VIDEO INDEXING

6.2. VIDEO DOCUMENT SEGMENTATION

CHAPTER 6. VIDEO INDEXING