Sunteți pe pagina 1din 332

MULTIMEDIA SYSTEMS

AND TECHNIQUES
THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE

MULTIMEDIA SYSTEMS AND APPLICATIONS

Consulting Editor

Borko Furht
Florida Atlantic University

Recently Published Titles:

VIDEO AND IMAGE PROCESSING IN MULTIMEDIA SYSTEMS, by


Borko Furht, Stephen W. Smoliar, HongJiang Zhang
ISBN: 0-7923-9604-9

MULTIMErHA S'lSTfMS AD A"LI~TI(JNS


Advanced Book Series
MULTIMEDIA SYSTEMS
AND TECHNIQUES

edited by

Borko Furbt
Florida Atlantic University

"
~.

KLUWER ACADEMIC PUBLISHERS


Boston / Dordrecht / London
Distributors for North America:
Kluwer Academic Publishers
101 Philip Drive
Assinippi Park
Norwell, Massachusetts 02061 USA

Distributors for all other countries:


Kluwer Academic Publishers Group
Distribution Centre
Post Office Box 322
3300 AH Dordrecht, THE NETHERLANDS

Library of Congress Cataloging-in-Publication Data

A C.I.P. Catalogue record for this book is available


from the Library of Congress.

ISBN-13: 978-1-4612-8577-9 e-ISBN-13: 978-1-4613-1341-0


DOl: 10.1007/978-1-4613-1341-0

Copyright ~ 1996 by Kluwer Academic Publishers


Softcover reprint ofthe hardcover 1st edition 1996

All rights reserved. No part of this publication may be reproduced, stored in


a retrieval system or transmitted in any form or by any means, mechanical,
photo-copying, recording, or otherwise, without the prior written permission of
the publisher, Kluwer Academic Publishers, 101 Philip Drive, Assinippi Park,
Norwell, Massachusetts 02061

Printed on acid-free paper.


CONTENTS

PREFACE Xl

1 MULTIMEDIA OBJECTS
Rei Hamakawa and Atsushi Atarashi 1
1 Introduction 1
2 A Class Hierarchy for Multimedia Objects 4
3 Composite Multimedia Object Model by Gibbs et al. 17
4 MHEG 21
5 Object Composition and Playback Models by Hamakawa et al. 26
6 Conclusion 36
REFERENCES 37

2 COMPRESSION TECHNIQUES AND


STANDARDS
Barko Furht 43
1 Introduction to Multimedia Compression 43
2 JPEG Algorithm for Still Image Compression 49
3 PX64 Compression Algorithm for Video Communications 67
4 MPEG Compression for Motion-Intensive Applications 72
5 Conclusion 84

3 MULTIMEDIA INTERFACES: DESIGNING FOR


DIVERSITY
Meera Blattner 87
1 Introduction 87
2 What is an Interface? 88
3 Designing Multimedia Interfaces 91
vi MULTIMEDIA SYSTEMS AND TECHNIQUES

4 Applications and New Technologies 107


5 Conclusion 117
REFERENCES 118

4 MULTIMEDIA STORAGE SYSTEMS


Harrick M. Vin and P. Venkat Rangan 123
1 Introduction 123
2 Multimedia Storage Servers 125
3 Managing the Storage Space Requirement of Digital Multimedia126
4 Efficient Retrieval of Multimedia Objects 132
5 Commercial Video Servers 140
6 Concluding Remarks 141
REFERENCES 142

5 MULTIMEDIA NETWORKS
Barko Furht and Hari Kalva 145
1 Introduction 145
2 Traditional Networks and Multimedia 149
3 Asynchronous Transfer Mode 156
4 Summary of Network Characteristics 165
5 Comparison of Switching Technologies for Multimedia Com-
munications 166
6 Information Superhighways 167
7 Conclusion 171

6 MULTIMEDIA SYNCHRONIZATION
B. Prabhakaran 177
1 Introduction 177
2 Language Based Synchronization Model 181
3 Petri Nets Based Models 185
4 Fuzzy Synchronization Models 197
5 Content-based Inter-media Synchronization 200
6 Multimedia Synchronization and Database Aspects 204
7 Multimedia Synchronization and Object Retrieval Schedules 206
8 Multimedia Synchronization and Communication Requirements 209
9 Summary and Conclusion 213
Contents vii

REFERENCES 214

7 INFOSCOPES: MULTIMEDIA INFORMATION


SYSTEMS
Ramesh Jain 217
1 Introduction 217
2 Information and Data 220
3 Operations in InfoScopes 225
4 InfoScope Architecture 228
5 Knowledge organization 234
6 Interfaces 237
7 Example Systems 239
8 Conclusion and Future Research 250
REFERENCES 251

8 SCHEDULING IN MULTIMEDIA SYSTEMS


A. L. Narasimha Reddy 255
1 Introduction 255
:2 Data Organization 258
3 Disk Scheduling 259
4 Network Scheduling 275
5 General Discussion 289
REFERENCES 290

9 VIDEO INDEXING AND RETRIEVAL


Stephen W. Smoliar and HongJiang Zhang 293
1 Introduction 293
2 Parsing 296
3 Representation and Classification 307
4 Indexing and Retrieval 311
5 Interactive Tools 315
6 Conclusion 318

INDEX 323
CONTRIBUTORS

Atsushi Atarashi Narasimha Reddy


C&C Research Laboratories IBM Almaden Research Center
NEC Corporation San Jose. California
Kawasaki, Kanagawa, Japan
Stephen W. Smoliar
Meera Blattner Institute of Systems Science
University of California, Davis National University of Singapore
Lawrence Livermore National Laboratory
Davis, California Harrick M. Vin
University of Texas at Austin
Borko Forht Austin. Texas
Florida Atlantic University
Boca Raton. Florida HongJiang Zhang
Institute of Systems Science
Rei Hamakawa National University of Singapore
C&C Research Laboratories
NEC Corporation
Kawasaki, Kanagawa, Japan

Ramesh Jain
University of California at San Diego
San Diego, California

Hari Kalva
Columbia University
New York. New York

B. Prabhakaran
Indian Institute of Technology
Madras. India

P. Venkat Rangan
University of California at San Diego
San Diego, California
PREFACE

Multimedia computing has emerged in the last few years as a major area of
research. Multimedia computer systems have opened a wide range of applica-
tions by combining a variety of information sources, such as voice, graphics,
animation, images, audio, and full-motion video. Looking at the big picture,
multimedia can be viewed as the merging of three industries: computer, com-
munication, and broadcasting industries.

Research and development efforts in multimedia computing can be divided


into two areas. As the first area of research, much effort has been centered
on the stand-alone multimedia workstation and associated software systems
and tools. such as music composition, computer-aided education and training,
and interactive video. However, the combination of multimedia computing
with distributed systems offers even greater potential. New applications based
on distributed multimedia systems include multimedia information systems,
collaborative and videoconferencing systems. on-demand multimedia services,
and distance learning.

This book is the first book of two-volume books on Multimedia Systems and Ap-
plications. This book comprises nine chapters and covers fundamental concepts
and techniques used in multimedia systems. The topics include multimedia ob-
jects and related models, multimedia compression techniques and standards,
multimedia interfaces. multimedia storage techniques, multimedia communica-
tion and networking, multimedia synchronization techniques, multimedia in-
formation systems, scheduling in multimedia systems, and video indexing and
retrieval techniques.

The second book on Multimedia Tools and Applications covers tools applied in
multimedia systems including multimedia application development techniques,
multimedia authoring systems, and tools for content-based retrieval. It also
presents several key multimedia applications including multimedia publishing
systems, distributed collaborative multimedia applications, multimedia-based
education and training, videoconferencing systems, digital libraries, interactive
television systems, and multimedia electronic message systems.
xii MULTIMEDIA SYSTEMS AND TECHNIQUES

This book is intended for anyone involved in multimedia system design and
applications and can be used as the textbook for a graduate course on multi-
media.

I would like to thank all authors of the chapters for their contributions to this
book. Special thanks for formatting and finalizing the book goes to Donna
Rubinoff from Florida Atlantic University.

Borko Furht
MULTIMEDIA SYSTEMS
AND TECHNIQUES
1
MULTIMEDIA OBJECTS
Rei Hamakawa and Atsushi Atarashi
Cf3C Research Laboratories,
NEC Corporation, 1-1 Miyazaki 4-Chome, Miyamae-ku,
Kawasaki, KANAGAWA 216, Japan

ABSTRACT
This chapter describes multimedia objects. The special suitability to multimedia of
the object-oriented approach has recently become increasingly clear. We first describe
the general concept of multimedia objects, and explain the merits of an object-oriented
approach in multimedia applications. we then summarize recent important research
activities in the field of multimedia objects and briefly discuss those unresolved issues
which are most likely to be subjects of significant future studies.

1 INTRODUCTION
The phrase "multimedia objects" refers to elements of multimedia data, such as
video, audio, animation, images, text, etc., that are used as objects in object-
oriented programming.

In the development of new multimedia systems, it is often far easier to use an


object-oriented approach than to attempt non-object-oriented design. Object-
oriented approaches in multimedia design applications have, until fairly re-
cently, been limited to use with static media, such as text and images, but now
that time-based media (including video and audio) can be computer processed
as digital data, the potential of object-oriented design is significantly enhanced.

In this chapter, we first briefly review object-oriented design, and then go on


to describe its advantages for application to multimedia.
2 CHAPTER 1

1.1 Concepts in object-oriented design


Among the many object-oriented programming languages currently in common
use (Smalltalk, C++, Objective C, Eiffel, etc.), specifications may differ to
some extent, but the underlying concepts are essentially the same. The most
important concepts are 1:

Objects In programming, an object is composed of a data structure and an


algorithm, i.e. a structure for enclosing a set of data items and a set of
operations2 . An object has both state and behavior: behavior refers to
how an object acts or reacts; a state represents the cumulative result of
behavior to a given point in time.

Messages Messages are requests sent to objects to get them to perform specific
operations.
Classes A class is a specification of the common structure (the concrete rep-
resentation of state) and behavior of a given set of objects. Objects in the
set covered by a given class may be referred to as instances of that class,
and creating a new object in the set may be referred to as instantiation.
Subclasses and inheritance A class can exist in relationship with a subclass
below it. With respect to its subclass, this original class exists as a su-
perclass. The objects covered by a subclass share state and behavior with
those covered by a superclass. The downward sharing of structure and
behavior is referred to as inheritance.

1.2 The affinity of the object-oriented


approach to multimedia applications
Four fundamental characteristics contribute significantly to the suitability of
object-oriented programming to multimedia applications: 1) abstraction; 2)
modularity fextensibility; 3) information encapsulation; and 4) compatibility
with event-driven programming.
1 More detailed explanations can be found in such books related to the object-oriented
approach as [3)[7][8)[36].
2 While the term "operation" is often used interchangeably with "method", strictly speak-
ing, a ''method'' represents not the actual operation itself but rather a code for implementing
an operation.
Multimedia Objects 3

One of the difficulties in designing multimedia applications, beyond the basic


requirement of handling a large variety of different media, is the need to be
able to deal with an added variety of media formats (MPEG, JPEG, etc.) and
hardware (LD, CD, VCR, etc.). While each of these may differ significantly
one among the other, they also overlap in many significant areas, and one
effective way of economizing on the amount of programming required is to
employ inheritance in the form of class hierarchies (Le. to apply increasing
levels of abstraction).

Because the requirements placed upon multimedia applications are continually


evolving and expanding, it is important to be able to change or add to existing
programs easily. The functions of existing media may have to be modified,
new media or media formats may have to be added, and new devices may have
to be accommodated. The class-hierarchy nature of object-oriented program-
ming allows such changes to be made locally, with minimum disruption to the
overall program - modularity and extensibility increase the ease with which
programs can react to changing requirements.

Some multimedia applications may initially require extremely complex mech-


anisms for controlling media in both spatial and temporal dimensions, and
certain hardware connections may require highly individualized programs. At
the application programming level, then. it will be extremely helpful if the pro-
grammer can be shielded from concerns over the numerous mechanical details
of particular media and hardware. One of the strengths of object-oriented pro-
gramming is its ability to encapsulate such particularized information into
"black boxes" which the programmer can employ without being concerned over
their specific content.

While increasing advances in the power and reach of graphical interfaces have
given users important new freedom of action, they have also made application
programming a significantly more complex task. No longer bound to a specific
order of actions, such as might be found in procedure-oriented interfaces, users
may perform actions in whatever order they please, and the program must be
capable of reacting to this new unpredictability. Object-oriented programming
is particularly well-suited to such event-driven programming because with
it the programmer can simply treat each button or menu item as a separate
object with its own individual behavior.

We should note in passing, however, that no design will be useful simply by


virtue of the fact that it is object-oriented. Without careful attention to the
effective use of class hierarchies, messages, etc., the special strengths of the
object-oriented approach can easily be squandered.
4 CHAPTER 1

2 A CLASS HIERARCHY FOR


MULTIMEDIA OBJECTS
This section describes a sample class hierarchy for multimedia objects. Our
objectives in presenting the hierarchy are as follows: 1) to demonstrate how we
can actually benefit from applying the object-oriented approach to multimedia
programming; 2) to clarify what must be considered in designing classes for
multimedia objects; and 3) to make it easier to understand subsequent discus-
sions dealing with recent research activities in the field of multimedia objects.

Please note that the class hierarchy we present here is a very simple one. De-
tailed issues of implementation, which are very important in real systems, are
unnecessary for our purposes here.

In our descriptions of classes, we use a variation of C++, the most widely


used object-oriented language[39]. In order to concentrate on object status
and behavior, the description lists only instance variables and methods 3 , since
what we wish to know about these classes are object status and behavior.

2.1 The BaseObject Class


The BaseObject class provides an abstraction of all objects, including such
temporal media objects as video and audio, such discrete media objects as text
and images, and such GUI objects as buttons and scrollbars. This means that
the BaseObject class is the root of the hierarchy and that each subsequent class
in the hierarchy inherits the attributes of the BaseObject class (Figure 1).

Designing the BaseObject class requires careful attention to the proper level of
abstraction. When the level of abstraction has been properly chosen, we are
fully able to enjoy the many advantages of the object-oriented approach.

Let us consider, as an example, a multimedia document editor, capable of


importing various types of multimedia objects into documents. When the de-
signers of the class hierarchy are skillful enough in their object abstraction, it
becomes possible for application programmers in charge of the editor, on the
basis of the methods and instance variables included in the BaseObject class
alone, to create an object importing module for the editor simply as a small
3In the C++ glossary, instance variables and methods are called, respectively, data mem-
bers and member functions
Multimedia Objects 5

Figure 1 The basic class hierarchy.

piece of code. They need to know none of the details of specific classes of
multimedia objects. When the designers of the class hierarchy are not skillful
enough in their object abstraction, the application programmers have to use
a number of different pieces of code to create the module, each piece being
designed to import objects of a specific class. This can be achieved only after a
painful process of looking at the class definition and determining how to import
objects into documents.
6 CHAPTER 1

A sample definition for a BaseObject class is given in Figure 2. It includes


instance variables for the dimensions and location of an object 4 , basic methods
such as draw and move, as well as methods regarding object editing.

class BaseObject {
II object's dimension and location
int width; II width of the object
int height; II height of the object
int xpos; II x-position on display
int ypos; II y-position on display
II methods for drawinglmoving
void drawO;
void move(int deltax,int deltay);

II methods for editing


void cut (. .. ) ;
void copy( ... );
void paste( ... );

Figure 2 Sample BaseObject class definition.

2.2 The TemporalMedia Class


The TemporalMedia class provides an abstraction of all temporal media, e.g.
video, audio etc. The most distinctive feature of temporal media is that their
content depends on time; conventional discrete media, such as text and images,
are independent of time. More specifically, each temporal media has a temporal
coordinate, and the content of the media varies according to the time value
along that coordinate.

For example, video data is represented as a sequence of frames. Each frame is


assigned a time to start display and a time to finish display on the temporal
4 Even though audio objects have neither dimension nor location, treating classes for audio
objects as subclasses of the BaseObject class makes the entire class hierarchy very simple.
It would even be possible to give a default view of audio objects such that they could be
visually manipulated or displayed.
Multimedia Objects 7

coordinate assigned to the video data. If we start playing the video, frames
are selected and displayed in response to the current value of the temporal
coordinate.

A TemporalMedia object is responsible for continually determining, on the basis


of the current value on the temporal coordinate, what should be done with any
of its data. A video object, for example, continually searches for frames that
are to be played at any given moment 5 .

TemporalMedia objects, then, are active objects[42], i.e., those which sponta-
neously try to detect situations in which they are required to perform opera-
tions.

For example, once you say 'start playback' to a TemporalMedia object, you need
not send any further messages until you want to stop or suspend playback. It
is the responsibility of TemporalMedia objects to find frames or samples for
playback, not that of the application programmer or user.

A sample definition of a TemporalMedia class is given in Figure 3. It includes


information regarding the duration ofthe data, methods for implementing play-
back operations, instance variables regarding object internal temporal coordi-
nates, and methods for activating objects.

The Video Class


The Video class is a subclass of the TemporalMedia class. A video object
receives video frames from a source, which can be a file, a local process, or a
remote process over the network, and it displays them.

The most critical issue in video data handling is its large size. Although ma-
nipulating uncompressed video data is easy and useful for short lengths, com-
pression is indispensable when we try to handle greater lengths. Compression
schemes include MPEG[ll]' Motion JPEG[31] and H.261[24].

For each compression scheme, there is, in turn, a variety of decompression


schemes available (both dedicated hardware, i.e., decompression boards, and
software[30)). Subclasses for such decompression schemes are located below
5 TemporalMedia objects need to deal with the difference between the playback speed and
the originally intended speed. If. for example. the playback speed is faster than that for
which they were designed, video objects might need to skip frames. If the playback speed is
slower, they might need to wait to playa frame.
8 CHAPTER 1

class TemporalMedia: public BaseObject {


II basic property of the media
float duration; II length of this media
II methods for playback control
void start () ; II start playing
void pause(); II pause playing
void resume(); II resume playing
void stop(); II stop playing
int setSpeed(float speed); II set speed
int setPosition(float new); II set position
float getSpeed(); II get speed
float getPosition(); II get position

II status regarding internal time


float speed; II current playing speed
float curClock; II current playback position
II method implementing activeness
void processData(); II called periodically
};

Figure 3 Sample TemporalMedia class definition.

each of the various compression schemes in the class hierarchy illustrated in


Figure 4.

This type of hierarchy has several advantages. First, the application program-
mers can deal with video objects without being concerned about such details
as compression and decompression schemes. They need only write programs to
accord with the definition of the Video class. Secondly, if a new decompression
scheme becomes available, the class hierarchy designers need only define a new
subclass below the appropriate compression scheme class. They do not need
to modify the existing class hierarchy in any other way. Similarly, application
programmers need only add to the existing program a code for the new sub-
class. They need not modify any part of the existing application programs.
The same principle applies when a new compression scheme becomes available.
Multimedia Objects 9

Figure 4 The Video class hierarchy.

Sample definitions for use in such a hierarchy are given in Figure 5.

The A udio Class


Audio objects receive audio samples from a source and play them. The same
sort of hierarchy is necessary for audio data handling as was used for video 6
(see Figure 6).

The Composite Class


In general, multimedia applications need to be able to handle a number of
temporal objects at the same time. That is to say, while a simple VCR function
might only require simultaneous playback of audio and video data, a more useful
application would be able to handle other combinations, such as
6 A Detailed description of audio representation can be found in (34). The outline of
MPEG-I audio compression is described in [38]
10 CHAPTER 1

II a generic class implementing Video playback


class Video: public TemporalMedia {
II receiving data from a source
Frame *currentFrame; II current frame data
Frame *getFrame(); II get a frame data from a source
II a method for displaying a frame
void displayFrame(Frame *); II display a frame on the display

};

II a generic class implementing MPEG Video playback


class HPEGVideo: public Video {
II MPEG stream information
HPEGsysInfo sysInfoHdr; II HPEG system information header

};

II a class implementing software MPEG playback


class MPEGSoftwareVideo: public HPEGVideo {
II characteristics of the display
int depth; II depth of the display
ColorHap *colorHap; II colormap

II information for software decompression


Frame *prevFrame; II previous reference frame
Frame *nextFrame; II next reference frame
Frame *decodeFrame(Frame *); II decode a frame

};

II a class implementing MPEG data playback


II with hardware board by XX vendor
class HPEGVideoForXXBoard: public HPEGVideo {
II methods to communicate with board
int sendData(char *, ... ); II send data to board
int sendCommand( ... ); II send a command to board
};
Figure 5 Sample Video classes definitions.
Multimedia Objects 11

rawAudio

~--"'l MPEG Audio for ZZ

Figure 6 The Audio class hierarchy.

• playing multiple video objects in a predetermined sequence.

• playing a video object with a selection of audio objects, each of which


represents the narration in a different language.

• simultaneously playing two video objects, each of which represents the


recording of a scene from a different camera angle.

• etc.

Composite objects serve this purpose. A composite object is a combination of


temporal objects. There are two types of composition: spatial and temporal.

Spatial composition defines the spatial relationships among the components


of a composite object (Figure 7), i.e., the layout of objects in a display.

Temporal composition defines the temporal relationships among the com-


ponents of a composite object. In other words, temporal composition is the
process of placing temporal objects on the temporal coordinate of the new
composite object being constructed (Figure 8).

It is often the case that we would like to use only a part of an existing temporal
object as a component, or we would like to include as a component an object
being played at a different speed than that for which it was originally designed.
12 CHAPTER 1

Video Objectl Video Object2

Figure 7 An example of spatial composition.

One obvious approach would be to edit original objects into new objects sat is-
Multimedia Objects 13

Parallel Composition

Sequential Composition

Figure 8 Examples of temporal composition.

fying the requirements. This would be, however, extremely inefficient in terms
of both processing time and storage space.

A better solution is the clip object, a reference to a temporal object which


includes information regarding content range and a scaling factor for adjusting
playing speed (Figure 9). A sample clip object definition is given in Figure 10.

Figure 11 gives sample definitions for objects in the Composite class. It includes
information about components, as well as about methods for manipulating
components.

Classes for Video / Audio Capturing


These classes define objects which capture live video or audio data to be sent
to sinks, which can be local disk files or video/audio objects on the network.
These objects are necessary to implement video conferencing systems[41J, as
well as to implement digital video recording and editing systems.
14 CHAPTER 1

Original Object

Figure 9 An example of a Clip Object.

class Clip: public TemporalMedia {


II reference to the original object
TemporalMedia *originalKedia;
II content range
float clipStart; II start position of the clip
float clipEnd; II end position of the clip
II speed scaling factor
float scale; II 1: same speed, -1: reverse
};

Figure 10 Sample Clip class definition.

When designing classes for the objects, care must be taken with regard to the
capturing device configuration and the data format.
Multimedia Objects 15

class Composite: public TemporalMedia {


II information on components
int noOfComponents; II number of components
TemporalMedia *components[]; II references to components
float *position[] II temporal positions of components
int *isUsed[]; II Is each component used?
II methods handling components
int addComponent(TemporalMedia *component, float pos);
int deleteComponent(TemporalMedia *component, float pos);
int activateComponent(TemporalMedia *component);
int deactivateComponent(TemporalMedia *component);

};

Figure 11 Sample Definition for the Composite Class.

2.3 Classes for Discrete Media


Even though temporal media objects play primary roles in multimedia ap-
plications, they cannot be really useful without discrete media objects. The
following is an overview of the classes for discrete media objects:

Text Class Text objects contain a large amount of text information. They
can be used to hold detailed descriptions of multimedia objects, or to hold help
messages.

Image Class Image objects contain two-dimensional images (or bitmap data).
They can be used to implement video browsers.

Graphics Class Graphics objects are used to draw such graphical objects as
lines, rectangles, and ovals.

2.4 Classes for G UI Objects


GUI objects are responsible for determining the appearance of an application
on a display. They are also responsible for receiving users' interaction by key-
16 CHAPTER 1

board or mouse and for controlling the application. The following are brief
descriptions of most often used GUI objects:

Window class A Window object contains a rectangular region for placing


objects.

Button class A Button object has a rectangular region to display its graphical
view and a 'callback' function attached to it. When the user moves a mouse into
the rectangular region of a button and clicks, the callback function attached to
it is invoked.

Buttons in multimedia applications are typically used to control the temporal


behavior of multimedia objects. They are used, for example, to implement
PLAY, STOP, etc.

Scrollbar class When an object comprises too large a region to be displayed


at one time, only part of it is displayed. In that case, scrollbars are used to
indicate which part of the object is currently being displayed, as well as to
access different parts of the object.

In multimedia applications, scrollbars are typically used to indicate which part


of a multimedia object is being played. Scrollbars are also used for random
access of multimedia objects.

Menu class A menu object holds multiple items and lets the user choose one
of them. A menu object can be used to choose multimedia data for playback.

Field class A field object contains a few lines of text information, and it can
be used to display a short description of multimedia data or to input keywords
for a multimedia data query.

Dialogbox class A dialogbox object is used when an application is not able


to continue its execution without asking the user how to proceed. It is typically
used to let the user answer yes or no.
Multimedia Objects 17

3 COMPOSITE MULTIMEDIA OBJECT


MODEL BY GIBBS ET AL.
Simon Gibbs et al. have presented a class hierarchy for multimedia objects
and a composite multimedia object model based on it [12][13]. This section
discusses the most distinctive features of their composite object model.

3.1 Multimedia Objects as Active Objects


In the Gibbs model, multimedia objects are defined as active objects which
produce and/or consume multimedia data values via ports. (Multimedia data
values are sequences of such data elements as video frames, audio samples, etc.)
Multimedia objects may be classified into three categories: source, sink and
filter.

• Source objects have only output ports, and they produce multimedia data
values. One example would be an object which records live audio data and
outputs a sequence of digital audio samples.

• Sink objects have only input ports, and they consume multimedia data
values. One example would be an object which receives a sequence of video
frames and outputs them to a display.
• Filter objects have both input and output ports, and they both produce
and consume multimedia data values. Examples include (1) an object
which duplicates its input and then outputs that through two separate
output ports, or (2) an object which converts one format to another.

A graphical notation system is used to facilitate representing dataflow relation-


ships between objects and to provide a basis for a visual editor for composing
composite objects (Figure 12). Multimedia objects are denoted by circles, to
which boxes representing ports are attached. External boxes represent output
ports; internal boxes represent input ports. Dataflow is represented by arrows.
18 CHAPTER 1

o
Multimedia Object
I
Port Dataflow

000source filter sink

Figure 12 Multimedia objects.

3.2 Temporal Transformations to Multimedia


Objects
The model contains two temporal coordinate systems: world time and object
time. World time is the temporal coordinate common to all multimedia objects
in an application, and it dominates their temporal behavior.

Object time is specific to a given multimedia object. Each object can specify:
1) the origin of its object time with respect to world time, 2) its speed for
processing multimedia data values, and 3) the orientation of its object time
with respect to world time. These specifications are implemented with the
following three temporal transformations: They are Translate, Scale and
Invert.

• Translate shifts the multimedia object in world time.

• Scale scales the overall duration of the object by a given factor.

• Invert flips the orientation of object time back and forth between "for-
ward" and "reverse".

The effect of applying these temporal transformations to a multimedia object


is illustrated in Figure 13.
Multimedia Objects 19

Original

Translate

Scale

Invert

Figure 13 Temporal transformations.

3.3 Composite Multimedia Objects


A composite multimedia object contains a set of component multimedia objects
and specifications for their temporal and dataflow relationships. Temporal
relationships define the synchronization and temporal sequencing of component
objects. Dataflow relationships define the connections between the input and
output ports of components.

Let us consider the example of creating a new composite object c!, which
performs the following operations:

Presentation of a video object, video!, begins at time to. At time t l ,


a fade is begun from video! to a second video object, video2' The
transition is completed at time t2 and at time t3 video2 is stopped .

The temporal relationships of the component objects of the new composite


object can be illustrated with a composite time/ine diagram such as that seen
in Figure 14.

The dataflow relationships of component objects can be defined with the pre-
viously introduced graphical notation, as seen in Figure 15, which illustrates
the dataflow relationship for object c! for the time interval [t!, t2J . During the
interval, video frame sequences from video! and video2 are processed by the
20 CHAPTER 1

Video 1

Ove

Vide02

, , • I • I .,
to t1 12 t3

Figure 14 A Composite Timeline.

digital video effect object dve, which produces a new sequence of video frames
and sends it to the video display object 7 .

Figure 15 An example of a dataflow relationship.

The implementation of a composite object editor with a graphical user interface


has been reported in [26].
7Jn order to implement dataflow relationships of the type illustrated above, it is also
necessary to use connector objects as well as port objects.
Multimedia Objects 21

4 MHEG
The number of standards, both organized and de-facto, being applied to multi-
media is bewildering, and the MHEG (Multimedia and Hypermedia information
coding Expert Group), operating jointly for the ISO (the International Orga-
nization for Standardization) and the IEC (the International Electrotechnical
Commission), is a working group applying the object-oriented approach to the
development of a standard format for multimedia and hypermedia interchange
[6][20][32]8. MHEG is concerned with the composition of time-based media
objects whose encodings are determined by other standards.

A multimedia object is essentially useless on its own, and only gains usefulness
in the context of a multimedia application; for interchange between applica-
tions, we need an application-independent format for representing objects. The
aim of this standard is the coded representation of final form multimedia and
hypermedia information objects that will be interchanged as units within or
across services and applications (Figure 16). The means of interchange may
include storage, local area network, wide area telecommunications, broadcast
telecommunications, etc. MHEG provides this format in the form of standard
"MHEG objects," which are classified as shown in Figure 179 .

A MHEG engine is a process or a set of processes that interprets MHEG objects


encoded/decoded according to the encoding /decoding specifications of MHEG.

MHEG classes are determined solely on the basis of object attributes (data
items), not on the basis of their operations, and thus, the hierarchy is limited
to attribute inheritance.

Content Object: Contains or refers to the coded representation of media infor-


mation. The content object also specifies the original size, duration and volume
of data.

Multiplexed Content Object: Contains or refers to the coded representation of


multiplexed media information. It also provides a description of each multi-
plexed stream.
8 MHEG specifications are still in development, but any differences bet ween the concepts
introduced here and the final ISO version will probably not affect the main notions of MHEG.
MHEG object encodings are expected to be available in several different notations, but the
basic notation is ASN.l (Abstract Syntax Notation)[18][19J.
9Each MHEG object belongs to one of the classes indicated in bold type. Classes not
indicated in bold type (abstract classes) describe common attributes.
22 CHAPTER 1

MHEGEngine Interchange MHEGEngine

Internal
format MHEGObject format

Sender Receiver

Figure 16 Interchange of MHEG objects.

MH-0BJECT>

ACTION
LINK
MODEl>

SCRIPT
COMPONENT>
CONTENT>
I MULTIPLEXED CONTENT

COMPOSITE

CONTAINER
DESCRIPTOR

Figure 17 MHEG inheritance tree.

Composite Object: Provides support for specifying relationships among multi-


media and hypermedia objects, as well as a logical structure for describing the
list of possible interactions offered to the user.
Multimedia Objects 23

Container Object: Provides a container for regrouping multimedia and hyper-


media data in order to interchange them as a single set.

Descriptor Object: Defines a structure for the interchange of resource informa-


tion about a single object or a set of objects to be interchanged.

Link Object: Specifies a set of relationships between one object and another;
specifically, it determines that when certain conditions for the first object (com-
monly referred to as a "source") are satisfied, certain actions will be performed
on the second object (commonly referred to as a "target").

Action Object: Specifies a synchronized set of elementary actions to be applied


to one or more objects. Action objects are used in link objects in order to
describe the link effect.

Script Object: a vehicle for applying non-MHEG languages to the specification


of source/target relationships which are too complex to be described by a link
object.

Objects received in an interchange are application-independent; in order for


a specific application actually to use the data contained in such an object, it
is necessary to create a run-time (rt) object (rt-content-object, rt-composite-
object etc.) which contains the required data in a format appropriate to that
application. Rt-objects cannot be interchanged between communicating sys-
tems.

4.1 Example
Let assume the following very simple example of a multimedia system so as
better to understand MHEG objects: 10

When a "GUIDE" button is pressed, a lO-second video and lO-second audio


are played simultaneously (Figure 18).

To accomplish this, the following eight objects are required:

• Three Content Objects


10 Due to space limitations, object descriptions here are very simple, intuitive, and not based
on the formal MHEG description. We hope they are, nonetheless, generally understandable.
More complex example of scenarios and explanations can be found in [27].
24 CHAPTER 1

Figure 18 MHEG example.

ContentObject-class :
Object-Number : 1 II Button
Classification : Graphics
Encoding Type JPEG
Original Size 70pt, 20pt

ContentObject-class :
Object-Number : 2 II Video
Multimedia Objects 25

Classification: Video
Encoding Type MPEG
Original Size 160pt, 120 pt
Original Duration: 10 sec

ContentObject-class:
Object-Number: 3 II Audio
Classification: Audio
Encoding Type MIDI
Original Duration: 10 sec

• Three rt-content-objects
Three rt-content-objects are created from the above three content objects.
rt-content-objects 1.1, 2.1, and 3.1 correspond to content ?bjects 1, 2, and
3, respectively.

• One Link Object

Link-class
Object-Number: 4
Link-condition
Source object number: 1.1
Previous-Condition
Status-Value: not-selected
Current-Condition
Status-Value: selected
Link-Effect Action object Number 5

• One Action Object

Action-class
Object-Number: 5
Target Object Set: 2.1, 3.1
Synchro-Indicator: parallel
Synchronized-Actions
Action
Object2.1 Run
Action
Object3.1 Run

Also, the action "set-button-style" attaches a selection status to the rt-content-


object 1.1 so that it can behave as a button.
26 CHAPTER 1

Set-Button-Style:
Target object: Obj1.1
Initial-State: selectable
not-selected

All objects described the above are created by the MHEG engine. When a user
presses the "G UIDE" button, rt-content object 1.1, link object 7 and action
object 8 are activated, and video (rt-content object 2.1) and audio (rt-content
object 3.1) are played simultaneously.

5 OBJECT COMPOSITION AND


PLAYBACK MODELS BY HAMAKAWA
ET AL.
Object composition and playback models (Figure 19) were proposed by Hamakawa
and Rekimoto in [16]. Their object composition model dealt with the static
aspects of such multimedia objects as name, duration time, etc., while their ob-
ject playback model dealt with the dynamic aspects of such multimedia objects
as play, stop, etc ...

Object Object
Mate rial
• Composition Playback User
Model Model

Used in the construction of Used in the playback of previously


multimedia objects constructed multimedia objects

Figure 19 Multimedia object models.


Multimedia Objects 27

5.1 Object Composition Model


The object composition model proposed by Hamakawa and Rekimoto has three
distinctive features:

1. Temporal glue
As in T£X[22] , the typesetting system intended for the creation of beautiful
books, glue is an object which can stretch or shrink in two dimensional
positional space. This glue can be extended into temporal space, making
it "temporal glue", and introduced into multimedia object models (see
Figure 20).
Each multimedia object will then have glue attributes (normal, stretch,
and shrink) in three dimensional space (2-dimensional position and time).
It is also possible to provide a special object, called a Glue object, which
does not exist as an entity in itself, but which has only glue attributes.

2. Object hierarchy
The object composition model employs a hierarchical structure composed
of multimedia objects (Figure 21).
The complete layout of composite objects, such as the time length of each
object, is determined when the highest ranking composite object is deter-
mined. When any multimedia object is edited, the attributes of all related
composite objects are automatically recalculated to conform to the change.
3. Relative location
In one common approach to constructing multimedia objects, the timeline
model, individual multimedia objects are located on an absolute timeline
scale (see Figure 22). The object composition model differs from the time-
line model in that it is unnecessary to decide the precise time line location
for each object. Only the relative locations among objects in time and
space need be defined. Once objects are composed, their absolute loca-
tions (in both time and space) are calculated automatically.

Each multimedia object has a number of different attributes. Such attributes


can be divided into the following three types:

Properties
General information about multimedia data, such as data type, location,
etc.
28 CHAPTER 1

,
Stretch
I I
II '

Maximum size . i.
I
'! I Object I
.

"
Normal
.:
Normal size Object

Shrink

Minimum size

Figure 20 Temporal glue.

Composite Object

Composite Object Composite Object

Figure 21 Object hierarchy.

Hierarchy
Information about how objects are combined .
Multimedia Objects 29

Track 1

Track 2

Track 3

Track 4

TrackS

o Timeline

Figure 22 Timeline model.

Glue Attributes
Values of temporal glue (i.e., normal, stretch, and shrink sizes) , as well as
spatial glue attributes.

We may note here that the concepts which most lend this model its character-
istic nature are the concepts of relative location and temporal glue.

Constructing Composite Objects


Composite objects are constructed by arranging and modifying multimedia
objects along designated dimensions. Control objects used to help in this con-
struction include the following:

Box This is used to arrange an object group along a designated dimension.


There are three types; TBBox, LRBox, SEBox. They correspond, respectively,
to an arrangement of Top-Bottom (space), Right-Left (space) , and Start-End
(time). Figure 23 shows a basic example of Box.

Objnew ;- TBBox(Objl ' Obj2 '···' ObjN)


Objn ew ;- LRBox(Objl ' Obj2 '··· ' ObjN)
Objnew ;- SEBox(Objl ' Obj2' · · ·' ObjN)
30 CHAPTER 1

t
y
Objx - SEBox(ObjA ObjB)
Objy - TBBox(Objx, ObjC)
Objz - LRBox(ObjD, Objy)

Figure 23 Box example.

Time-section This is used to create an object which initially has no attribute


values of its own other than its representing a given time-section.

Objnew +- Section(Obj ,from, to)

This value-less object can be referenced to any existing object so as to create


a new object which contains the attribute values of the specific time-section of
the object to which it has been referenced.

Overlay This is used to overlay one object with another object in the time-
dimension.

When playing a video object and an audio object simultaneously, the operation
is as follows:

Objnew +- Overlay(Video Obj, Audio Obj)

Loop This is a type of glue used to repeat an original object for a designated
length of time.

Objnew +- Loop(Obj ,normal, shrink, stretch)


Multimedia Objects 31

In this model, because such static media as texts, still pictures, etc. do not
contain information regarding the temporal dimension, loop is used to add
temporal glue attributes to their other attributes when they are employed with
dynamic media (audio, video etc.) in composite objects.

Position This is used to locate objects on a specific section of an absolute


time-scale, as it would be if employed in a timeline model.

Objnew +- Position(Obj ,StartTime,EndTime)

Additionally, the following two methods are provided to facilitate working with
objects:

Mark This function serves to mark an object at a certain point in time, and to
add to the object a title which indicates some feature of object-content relevant
to that point in time.

Mark(Obj,Time,Title)
Constraint This function attaches constraints to objects and is used primarily
for synchronization, so as, for example, to ensure that a given audio object
always ends at the same instant as a given video object, etc.
Constraint (Condition)
A constraint may be attached to an object with regard to its start, end, or
a point marked on it. For example, Constraint(Obj1.start=Obj2.start),
Constraint(Objl.markl=Obj2.end).

Glue Calculation and Determination of the Time


Length of Each Object
Since each of the different objects comprising a composite object has glue at-
tributes, the composite object itself has glue attributes (Figure 24).

The time length of each object is determined when the highest ranking com-
posite object has been determined (Figure 25).

The time length of this highest ranking composite object is the normal time
length of its glue attributes l l .
11 See [16] for a more detailed description of calculation methods.
32 CHAPTER 1

Composite Object
GluePro~

Composite Object Composite Object

Figure 24 Glue property propagation.

Composite Object
Actual Locations ~ ~
(x,y,t) ~ ~

Composite Object Composite Object

Figure 25 Detennination of actual location.

5.2 Object Playback Model


The object playback model employs two kinds of multimedia classes, the Media
class and the Context class. Additionally, a third class, called Viewer, is
introduced to connect these classes to the screen (Figure 26).
Multimedia Objects 33

Display

'.
;· .'.
D viewer Class

start , stop ,
pause , resume

i
composit!ji

..... .....
..........
..... .....
.....

context Class

Media Class

Figure 26 Relationships among three classes.

The Media class represents multimedia data. All objects created in the object
composition model belong to this class, or its subclasses. (Sound is an example
of a Media subclass used to manage digital sound data.)

The Context class keeps track of the playback status of each object, such as
"What is the current video data frame?", "What is the current status of audio
data?", "Where is data displayed on the window?". etc .
34 CHAPTER 1

The Viewer class has the information required for display, such as the posi-
tion coordinates for a window, etc. It also provides convenient interfaces to
programmers, such as play, stop, and pause. Viewer is a general management
class, implemented to manage both audio and video data. It has no subclasses.

A Context object is generated whenever a multimedia object is played back .


This structure of classes clearly separates multimedia static data from their
temporal states. Normally, each media object has a corresponding context
object, but it is possible that two or more context objects might share one
media object. This means we could playback different portions of the media
simultaneously through different windows (Figure 27) .

.......... .. , ............. .
~iS?t~~('.: .:.:::::::~::~::.::::::::::~::::::::::::?:::;:
:::::::;:;'::;:::;:::;::::::',
, 101 \:~HI01iliY
~ ~:<:~.: }::

•• • • •~.• • •~• • • ~0~


: :·~-: ·:~,•-•:•-~• • • • • • • • • • • ~~j • •

Dynamic
~--""''"--

Dynamic

Figure 27 Playback example.


Multimedia Objects 35

It may be easier way to understand these relationships if we think of them


in terms of an orchestral performance. Viewer corresponds to the stage upon
which the music is performed. Media corresponds to the musical score, the gen-
erallayout of how the music is to be performed over a period of time. Context
corresponds to the conductor, who may interpret the score and conduct it in
accord with how he feels it should be played at any given moment.

To implement the above model on a computer, Hamakawa and Rekimoto con-


structed an audio and video extension library, called Xavier 12 , using InterViews
[23]. Figure 28 shows a multimedia presentation system using Xavier.

r.,.,,,.._ ... Pd. .............. ........... _


n.,...,... ........ tu .....

-
.."..,..,('1 . . . . . .
t rtIIt""h'JNII

... _,., •
... _...JJ_ . . .-,_.,
,n,.J.IJ,.ItW""""'4f/*/i;f/, ' ...... ,.... ,~,.".

... --~-..,-
r............ ....
,,,... ..... _ttf"'·
~ ,tU~ ....,.,~".,."

Cld4Of ..
,*,-"-t-

Figure 28 Multimedia presentation system.

12 Xavier can be obtained via anonymous ftp from interviells. stanford. edu in
/pub/contrib.
36 CHAPTER 1

6 CONCLUSION
We have described the basics of multimedia objects and important research
activities in the area. Among the several commercial products that adopt
an object-oriented approach in handling multimedia data are Quicktime[2]
and CommonPoint[40]. Efforts other than MHEG at defining international
standards for multimedia data handling include PREMO[17], HyTime[33] and
HyperODA[21].

Multimedia object technology appears about to assume significant importance,


and we would like to conclude this chapter with a brief introduction to topics
of recent interest, as well as to some which seem likely to be of future interest.

Object Composition

One particularly important issue for the design of composite objects is the
question of how best to describe or define the temporal relationships among
their component objects. This is analogous to the issue of how to define similar
temporal relationships that must be specified in such other fields as database
design, cognitive science, and computational geometry.

Much research in this area is based on the interval-based temporal logic pro-
posed by Allen[l]. Allen has shown that there are thirteen primitives necessary
to represent any interval relationship 13. Little et al. extended such interval
relationships to n-ary and reverse temporal relationships[25]. The n-ary model
is particularly elegant in that, without requiring very many levels of hierar-
chy, it captures an arbitrary number of component objects having a common
temporal relationship.

Weiss[43][44] has proposed a data model called algebraic video for composing,
searching, and playing back digital video presentations, and has demonstrated
a prototype system which can create new video presentations with algebraic
combinations of these segments.

Buchanan[4][5] has described a temporal layout that indicates when events in


the multimedia document occur. and introduced 'lEX's glue. as well as a model
similar to the object composition model mentioned earlier in this chapter, to
arrive at an "optimal" display for a document.
13That is, X before Y, X meets Y, X overlaps Y, X during Y, X starts Y, X finishes Y,
their six converses, and X equals Y.
Multimedia Objects 37

Distributed Multimedia Objects

With the recent dramatic progress in networking technology, a number of new


issues have come to the fore with regard to the implementation of multime-
dia objects. These include, for example. how to name and locate distributed
objects[28], and how to transmit multimedia data over networks in realtime[10].

The Berkeley CMT (Continuous Media Toolkit) tries to address these issues[35j14.
It introduces a simple mechanism for creating distributed objects, a mechanism
for synchronizing distributed objects, and a best-effort protocol for realtime
multimedia data transmission on top of the existing UDP /IP protocol.

Integrating Multimedia Objects with Multimedia Databases

As we use more and more multimedia data, we need to find more efficient ways
to store and retrieve it, and much research is being conducted in the field of
multimedia databases.

Oomoto and Tanaka[29] have proposed a video-object data model and defined
several operations, such as interval projection, merge, overlap, etc. for com-
posing new video objects. They have also introduced the concept of "interval-
inclusion inheritance," which describes how a video object inherits attribute/value
pairs from another video object.

Gibbs[13] has proposed constructing composite multimedia objects out of a


BLOB (binary large object) stored in databases, and creating multimedia ob-
jects flexibly by applying interpretation, derivation, and temporal composition
to a BLOB.

Further intensive research is necessary with regard to the questions of how best
to construct multimedia databases and how best to integrate multimedia object
technology with them.

REFERENCES
[1] Allen J .F, "Maintaining Knowledge about Temporal Intervals," Commu-
nications of the ACM, Vol.26. No.n, pp. 832-842, 1983.
14 See http://vn-plateau . cs. berkeley. edu/ for more information on the Berkeley
Continuous Media Toolkit. The documents and software can be obtained from
ftp://mm-ftp.cs.berkeley.edu/pub/multimedia/.
38 CHAPTER 1

[2] Apple Computer Inc. "QuickTime," 1991.


[3] Booch, G, "Object-Oriented Analysis and Design," Second Edition, The
Benjamin/Cummings, 1994.
[4] Buchanan M.C., and Zellweger P.T., "Scheduling Multimedia Documents
Using Temporal Constraints," Third International Workshop on Network
and Operationg System Support for Digital Audio and Video, Lecture
Notes in Computer Science 712, Springer-Verlag, pp. 237-249, 1992.

[5] Buchanan M.C., and Zellweger P.T., "Automatic Temporal Layout Mech-
anisms," The proceedings of the ACM Multimedia 93, pp. 341-350, 1993.
[6] Buford, J .F.K, ed. "Multimedia Systems," Addison-Wesley, 1994.
[7] Champeaux, D. de, Lea, D., and Faure, P., "Object-Oriented System De-
velopment," Addison-Wesley, 1993.
[8] Coleman, D., et al., "Object-Oriented Development - The Fusion Method,"
Prentice Hall, 1994.
[9] Coplien, J .0. "Advanced C++ Programming Styles and Idioms," Addison-
Wesley, 1992.
[10] Ferrari, D., Banerjer, A. and Zhang, H. "Network Support for Multimedia
- A discussion ofthe Tenet Approach," Technical Report TR-92-072, Inter-
national Computer Science Institute, University of California at Berkeley,
1992.
[11] Le Gall, D. "MPEG: a video compression standard for multimedia applica-
tions," Communications of th ACM, April 1991, Vol. 34, No.4, pp. 46-58.
[12] Gibbs, S. "Composite Multimedia and Active Objects," The proceedings
of OOPSLA 91 (Conference on Object-Oriented Programming Systems,
Languages, and Applications), pp. 87-112.
[13] Gibbs, S., Breiteneder, C., and Tsichritzis., "Data Modeling of Time-Based
Media," the proccedings of the ACM-SIGMOD 1994, Conference on Man-
agement Data, pp. 91-102.
[14] Gibbs, S., and Tsichritzis, D.C., "Multimedia Programming - Objects,
Environments and Frameworks," Addison-Wesley, 1995.
[15] Hamakawa, R., Sakagami, H., and Rekimoto, J., "Audio and Video Exten-
sions to Graphical User Interface Toolkits," Third International Workshop
on Network and Operationg System Support for Digital Audio and Video,
Lecture Notes in Computer Science 712, Springer-Verlag, pp. 399-404, 1992
Multimedia Objects 39

[16] Hamakawa, R., and Rekimoto, J., "Object Composition and Playback
Models for Handling Multimedia Data," Multimedia Systems, Springer-
Verlag, Vol.2 1994, pp. 26-35.
[17] Herman, I.,Carson., G.S. et al. "PREMO: An ISO Standard for a Pre-
sentation Environment for Multimedia Objects," The proceedings of the
ACM Multimedia 94, pp. 111-118, 1994.
[18] ISO/IEC, "Specification of Abstract Syntax Notation One (ASN.1)," 2nd
ed, IS 8824, 1990.

[19] ISO/IEC, "Specification of Basic Encoding Rules for Abstract Sysntax


Notation One (ASN.l)," 2nd ed, IS 8825,1990.

[20] ISO/IEC, "Information Technology - Coding of Multimedia and Hyper-


media Information -, Part 1: - MHEG object representation, - Base Nota-
tion(ASN.1)," ISO/IEC DIS 13522-1, October 14, 1994.

[21] ISO/IEC. "Information Technology - Open Document Architecture (ODA)


and Interchange Format - Temporal relationships and nonlinear struc-
tures," ISO/IEC DIS 8613-14, 1993.
[22] Knuth, D.E., "THE 1EXbook," Addison-Wesley Publishing Company,
1984.

[23] Linton, M., Vlissides, J. and Calder, P., "Composing user interfaces with
InterViews", Computer, Feb. 1989. pp. 8-22.

[24] Liou, M. "Overview of the px64kbits/s Video Coding Standard," Commu-


nications of the ACM, April 1991, Vol. 34, No.4, pp. 59-63.
[25] Little, T, and Ghafoor, A. "Interval-Based Comceputual Models for Time-
Dependent Multimedia Data," IEEE Transactions on Knowledge and Data
Engineering, Vol5 No.4, pp. 551-563, 1993.
[26] de Mey, V., Gibbs, S. "A Multimedia Component Kit," The proceedings
of ACM Multimedia 93, pp. 291-300.
[27] Meyer-Boudnik, T. and Effelberg, W .. "MHEG Explained," IEEE Multi-
media, Spring, pp. 26-38, 1995
[28] Object Management Group, "The Common Object Request Broker: Ar-
chitecture and Spesification OMG Documnet Number 91.12.1 Revision
1.1," 1991.
40 CHAPTER 1

[29] Oomoto, E., and Tanaka, K., "OVID: Design and Implementation of a
Video-Object Database System," IEEE Transactions Knowledge and Data
Engineering, August 1993, pp. 629-643.

[30] Patel, K., Smith, B.C. and Rowe, L.A. "Performance of a Software MPEG
Video Decoder," The proceedings of ACM Multimedia 93, pp. 75-82.

[31] Pennebaker, W.B. "JPEG still image data compression standard," Van
Nostrand Reinhold, New York, 1992.

[32] Price, R., "MHEG: An Introduction to the future International Standard


for Hypermedia Object Interchange," The proccedings of the ACM Mul-
timedia 93, pp. 121-128.

[33] DeRose, S.J., and Durand, D.G. "Making Hypermedia Work - A User's
Guide to HyTime," Kluwer Academic Publishers, 1994.

[34] van Rossum, G. "FAQ: Audio File Formats," can be obtained from anony-
mous ftp at ftp://ftp. cwi .nl/pub/audio as files AudioFormats.part[12],
1995.

[35] Rowe, L.A., Patel, K., et al., "MPEG Video in Software: Representation,
Transmission and Playback," The proceedings. ofIS&TjSPIE 1994, Inter-
national Symposium on Elec. Imaging: Science and Technology.

[36] Rumbauch, J., et ai, "Object-Oriented Modeling and Design," Prentice


Hall, 1991.

[37] Shimojo, S., "Introduction to UNIX Multimedia," (In Japanese), No.1-23,


UNIX Magazine Vo1.7 No.ll - Vo1.9 No.9, 1992-1994.

[38] Shlien, S. "Guide to MPEG-1 Audio Standard," IEEE Transactions on


Broadcasting, Vo1.40, No.4, pp. 206-218, 1994.

[39] Stroustrup, B. "The C++ Programming Language 2nd edition" Addison-


Wesley, 1991.

[40] Taligent Inc., "Taligent's Guide to Designing programs," Addison-Wesley,


1994.

[41] Watabe, K., Sakata, S., et al. "Distributed Desktop Conferencing System
with Multiuser Multimedia Interface," IEEE Journal on Selected Areas in
Communications, Vol. 9 , NO.4, pp. 531-539, 1991.

[42] Wegner, P., "Concepts and Paradigms of Object Oriented Programming,"


OOPS Messenger 1, 1 (Aug. 1990), pp. 7-87.
Multimedia Objects 41

[43] Weiss, R, Duda, A., and Gifford, D.K., "Content-Based Access to Algebraic
Video" The proceedings of the International Conference on Multimedia
Computing and System, pp. 140-151, 1994.

[44] Weiss, R, Duda, A., and Gifford, D.K., "Composition and Search with a
Video Algebra," IEEE Multimedia, pp. 12-25, Spring, 1995.

[45] Woelk, D., Kim, W., and Luther, W., "An Object-Oriented Approach
to Multimedia Databases," The proceedings of the ACM-SIGMOD 1986,
Conference on Management Data, pp. 311-325.
2
COMPRESSION TECHNIQUES
AND STANDARDS
Borko Furht
Department of Computer Science rind Engineering,
Florida Atlantic University,
Boca Raton, Florida, U.S.A.

ABSTRACT
This chapter covers multimedia compression techniques and standards: (a) JPEG
compression standard for full-color still image applications, (b) H.261 standard for
video-based communications, and (c) MPEG standard for intensive applications of
full-motion video, such as interactive multimedia. We describe all the components
of these standards including their encoder and decoder architectures. Experimental
data are also presented.

1 INTRODUCTION TO MULTIMEDIA
COMPRESSION

1.1 Storage Requirements for Multimedia


Applications
Audio, image, and video signals require a vast amount of data for their rep-
resentation. There are three main reasons why present multimedia systems
require that data must be compressed. These reasons are related to:

• Large storage requirements of multimedia data,

• Relatively slow storage devices which do not allow playing back uncom-
pressed multimedia data (specifically video) in real time, and
44 CHAPTER 2

• The present network's bandwidth, which does not allow real-time video
data transmission.

To illustrate large storage requirements, consider a typical multimedia applica-


tion, such as encyclopedia, which may require:

• 500,000 pages of text (2 KB per page) - total 1 GB,

• 3,000 color pictures (in average 640x480x24 bits = 1 MB/picture) - total


3 GB,

• 500 maps (in average 640x480x16 bits = 0.6 MB/map) - total 0.3 GB,
• 60 minutes of stereo sound (176 KB / second) - total 0.6 G B,

• 30 animations, in average 2 minutes in duration (640x320x16 bits x 16


frames/sec = 6.5 MB/second) - total 23.4 GB,

• 50 digitized movies, in average 1 minute in duration (640x480x24 bits x 30


frames/sec = 27.6 MB/sec) - total 82.8 GB.

The encyclopedia will require total of 111.1 GB storage capacity. Consider


that compression algorithms are then applied to compress various media used
in the encyclopedia. Assume that the following average compression ratios are
obtained:

• Text 2:1,

• Color images 15:1,

• Maps 10:1,

• Stereo sound 6:1.

• Animation 50:1,

• Motion video 50:1.

Figure 1 gives the storage requirements for the encyclopedia before and after
compression. When using compression, storage requirements will be reduced
from 111.1 GB to only 2.96 GB, which is much easier to handle.
Compression Techniques and Standards 45

111.1 GB-> (aftercoupession) ->2.96GB

100<B 82.8<B

23.4<B
10<B
3<B

1<B

O.1<B

so.ooo 3,<XDcx:kr !m1TllpS 8>rrirUes 8>rrirUes 50 rrina


pages a text images (~ dstereo arimiicn cI9tiZ£d \Iic:m
(~ (~ 1~1) sam {~ (~
2:1) 15:1) (~ 9):1) 5:1:1)
6:1)

Figure 1 Storage requirements for the encyclopedia before and after com-
pression.

To illustrate the second reason for compression of multimedia data, which is


slow transmission rates of current storage devices, consider playing a one-hour
movie from a storage device, such as a CD-ROM. Assuming color video frames
with a resolution of 620x560 pixels and 24 bits/pixel, it would require about 1
MB/frame of storage. For a motion video requiring 30 frames/second, it gives
a total of 30 MB for one second of motion video, or 108 GB for the whole
one-hour movie. Even if there is enough storage capacity available, we won't
be able to play the video in real time due to insufficient transmission rates of
current storage devices.

According to the previous calculation, the required transmission rate of the


storage device should be 30 MB/sec; however today's technology provides about
300 KB/sec transfer rate of CD-ROMs. Therefore, a one-hour movie would be
played for 100 hours! In summary, at the present state of technology of storage
devices, the only solution is to compress multimedia data before its storage and
decompress it before the playback.
46 CHAPTER 2

The final reason for compression of multimedia data is the limited bandwidth
of present communication networks. The bandwidth of traditional networks
(Ethernet, token ring) is in tens of Mb/sec, which is too low even for the
transfer of only one motion video in uncompressed form. The newer networks,
such as ATM and FDDI, offer a higher bandwidth (in hundreds of Mb/sec
to several Gb/sec), but only few simultaneous multimedia sessions would be
possible if the data is transmitted in uncompressed form.

Modern image and video compression techniques offer a solution of this prob-
lem, which reduces these tremendous storage requirements. Advanced com-
pression techniques can compress a typical image ranging from 10:1 to 50:1.
Very high compression ratios of up to 2000:1 can be achieved in compressing
of video signals.

1.2 Classification of Compression Techniques


Compression of digital data is based on various computational algorithms,
which can be implemented either in software or in hardware. Compression tech-
niques are classified into two categories: (a) lossless, and (b) lossy approaches
[Fox91, FSZ95]. Lossless techniques are capable to recover the original rep-
resentation perfectly. Lossy techniques involve algorithms which recover the
presentation to be similar to the original one. The lossy techniques provide
higher compression ratios, and therefore they are more often applied in image
and video compression than lossless techniques.

The lossy techniques are classified into: (1) prediction based techniques, (2)
frequency oriented techniques, and (3) importance-oriented techniques. Pre-
diction based techniques, such as ADPCM, predict subsequent values by ob-
serving previous values. Frequency oriented techniques apply the Discrete Co-
sine Transform (OCT), which relates to fast Fourier transform. Importance-
oriented techniques use other characteristics of images as the basis for compres-
sion. For example, DVI technique uses color lookup tables and data filtering.

The hybrid compression techniques, such as JPEG, MPEG, and px64, com-
bine several approaches, such as DCT and vector quantization or differential
pulse code modulation. Recently, standards for digital multimedia have been
established based on these three techniques, as illustrated in Table 1.
Compression Techniques and Standards 47

Short Official Name Standards Compression


Name Group Ratios
JPEG Digital Joint 15:1
compression and Photographic Full color still-
coding of Experts Group frame
continuous-tone applications
still images

H.261 Video encoder Specialist Group 100:1 to


Idecoder for on Coding for 2000:1
or audio-visual Visual
px64 services at px64 Telephony Video-based tele-
Kbps communications

MPEG Coding of Moving Pictures 200:1


moving pictures Experts Group Motion-intensive
and associated applications
audio

Table 1 Multimedia compression standards.

1.3 Image Concepts and Structures


A digital image represents a two-dimensional array of samples, where each
sample is called a pixel. Precision determines how many levels of intensity can
be represented, and is expressed as the number of bits/sample. According to
precision, images can be classified into:

Binary images, represented by 1 bit/sample. Examples include black/white


photographs and facsimile images.

Computer graphics, represented by a lower-precision, as 4 bits/sample.

Grayscale images, represented by 8 bits/sample.

Color images, represented with 16, 24 or more bits/sample.

According to the trichromatic theory , the sensation of color is produced by


selectively exciting three classes of receptors in the eye. In a RGB color
48 CHAPTER 2

representation system, shown in Figure 2, a color is produced by adding three


primary colors: red, green and blue (RGB). The straight line, where R = G =
B, specifies the gray values ranging from black to white.

R
Gray values
R=G=B

Figure 2 The RGB representation of color images.

Another representation of color images, YUV representation, describes lumi-


nance and chrominance components of an image. The luminance component
provides a grayscale version of the image, while two chrominance components
give additional information that converts the grayscale image to a color image.

The YUV representation is more natural for image and video compression. The
exact RGB to YUV transformation, defined by the CCIR 601 standard, is given
by the following transformations:

Y = 0.299R + 0.587G + 0.114B (2.1)


U = 0.564(B - Y) (2.2)
V = 0.713(B - Y) (2.3)

where Y is the luminance component, and U and V are two chrominance com-
ponents.

Another color format, referred to as YCbCr format, is intentively used for


image compression. In YCbCr format, Y is the same as in a YUV system,
however U and V components are scaled and zero-shifted to produce Cb and
Cr, respectively, as follows:
Compression Techniques and Standards 49

U
Cb = 2" + 0.5 (2.4)

V
Cr = 1.6 + 0.5 (2.5)

In this way, chrominance components Cb and Cr are always in the range [0,1].

Resolutions of an image system refers to its capability to reproduce fine detail.


Higher resolution requires more complex imaging systems to represent these
images in real time. In computer systems, resolution is characterized with
number of pixels (for example, VGA has a resolution of 640 x 480 pixels). In
video systems, resolution refers to the number of line pairs resolved on the face
of the display screen, expressed in cycles per picture height, or cycles per picture
width. For example, the NTSC broadcast system in North America and Japan,
denoted 525/59.94, has about 483 picture lines (525 denotes the total number
of lines in its rates, and 59.94 is its field rate in Hertz). The HDTV system
will approximately double the number of lines of current broadcast television
at approximately the same field rate. For example, a 1152x900 HDTV system
may have 937 total lines and a frame rate of 65.95 Hz.

The full-motion video is characterized with at least 24 Hz frame rate (or 24


frames/sec), and up to 30, or even 60 frames/sec for HDTV. For animation,
acceptable frame rate is in the range 15-19 frames/sec, while for video telephony
is 5-10 frames/sec. Videoconferencing and interactive multimedia applications
require a rate of 15-30 frames/sec.

2 JPEG ALGORITHM FOR STILL IMAGE


COMPRESSION
Originally, JPEG standard was targeted for full-color still frame applications,
achieving 15:1 average compression ratio [PM93, Wal91, Fur95]. However,
JPEG has also been applied in some real-time, full-motion video applications
(Motion JPEG - MJPEG). JPEG standard provides four modes of operation:

• sequential DeT-based encoding, in which each image component is encoded


in a single left-to-right, top-to-bottom scan,
50 CHAPTER 2

• progressive nCT-based encoding, in which the image is encoded in mul-


tiple scans, in order to produce a quick, rough decoded image when the
transmission time is long,
• lossless encoding, in which the image is encoded to guarantee the exact
reproduction, and
• hierarchical encoding, in which the image is encoded in multiple resolutions.

In this section we describe in detail the sequential DCT-based JPEG compres-


sion, while the other modes are given in [PM93, FSZ95].

2.1 JPEG Codec


In this section we describe both the design of a sequential JPEG encoder and
decoder. The block diagrams of the JPEG sequential encoder and decoder
are shown in Figure 3.

The JPEG encoder consists of three main blocks:

• Forward Discrete Cosine Transform (FDCT) block,


• Quantizer, and
• Entropy encoder.

At the input of the encoder, the original unsigned samples, which are in the
range [0, 2P -1], are shifted to signed integers with range [_2P - 1 , 2P - 1 -1]. For
example, for a grayscale image, where p=8, the original samples with range [0,
255], are shifted to range [-128, +127].

Then, the source image is divided into 8x8 blocks, and samples from each block
are transformed into the frequency domain using the Forward Discrete Cosine
Transform using the following equations:

7 7
C(u) C(v) " " f (
F( u, V ) = -2- ) (2x+1)U11' (2y+1)V71' (2.6)
. -2- LJ LJ x, Y cos 16 cos 16
x=Oy=O
Compression Techniques and Standards 51

Source
lrMgo_ JPEG Encoder

• hi bloch

JPEG Decoder

Irw.,.,
OCT

Figure 3 Block diagrams of sequential JPEG encoder and decoder.

where:

1
C(u) = V2 for u=O

C(u) =1 for u>O

1
C(v) = V2 for v=O

C(v) =1 for v>O


52 CHAPTER 2

The transformed 64-point discrete signal is a function of two spatial dimensions


x and y, and its components are called spatial frequencies or DCT coefficients.

The F(O, 0) coefficient is called the "DC coefficient". and the remaining 63 co-
efficents are called the "AC coefficients". For a grayscale image. the obtained
DCT coefficients are in the range [-1024, +1023], which requires additional 3
bits for their representation, compared to the original image samples. Several
fast DCT algorithms are proposed and analysed in [PM93, HM94].

For a typical 8x8 image block, most spatial frequencies have zero or near-zero
values, and need not to be encoded. This is illustrated in the JPEG example,
presented later in this section. This fact is the foundation for achieving data
compreSSIOn.

In the next block, quantizer, all 64 DCT coefficients are quantized using a
64-element quantization table, specified by the application. The quantization
reduces the amplitude of the coefficients which contribute little or nothing to
the quality of the image, with the purpose of increasing the number of zero-
value coefficients. Quantization also discards information which is not visually
significant. The quantization is performed according to the following equation:

F(U, V)]
Fq(u, v) = Round [ Q(u, v) (2.7)

where Q(u,v) are quantization coefficients specified by a quantization table.


Each element Q(u,v) is an integer from 1 to 255, which specifies the step size
of the quantizer for its corresponding DCT coefficient.

A set of four quantization tables are specified by the JPEG standard for
compliance testing of generic encoders and decoders; they are given in Table 2.
In the JPEG example, presented later in this section. a quantization formula
is used to produce quantization tables.

After quantization, the 63 AC coefficients are ordered into the "zig-zag" se-
quence, as shown in Figure 4. This zig-zag ordering will help to facilitate
the next phase, entropy encoding, by placing low-frequency coefficients, which
are more likely to be nonzero, before high-frequency coefficients. When the
coefficients are ordered zig-zag, the probability of coefficients being zero is an
increasing monotonic function of the index. The DC coefficients, which repre-
Compression Techniques and Standards 53

8 65 8 12 20 26 30 9 9 12 24 50 50 50 50
6 67 10 13 29 30 28 9 11 13 33 50 50 50 50
7 78 12 20 29 35 28 12 13 28 50 50 50 50 50
7 911 15 26 44 40 31 24 33 50 50 50 50 50 50
9 11 19 28 34 55 52 39 50 50 50 50 50 50 50 50
12 18 28 32 41 52 57 46 50 50 50 50 50 50 50 50
25 32 39 44 52 61 60 51 50 50 50 50 50 50 50 50
36 46 48 49 56 50 52 50 50 50 50 50 50 50 50 50

16 17 18 19 20 21 22 23 16 16 19 22 26 27 29 34
17 18 19 20 21 22 23 24 16 16 22 24 27 29 34 37
18 19 20 21 22 23 24 25 19 22 26 27 29 34 34 38
19 20 21 22 23 24 25 26 22 22 26 27 29 34 37 40
20 21 22 23 24 25 26 27 22 26 27 29 32 35 40 48
21 22 23 24 25 26 27 28 26 27 29 32 35 40 48 58
22 23 24 25 26 27 28 29 26 27 29 34 38 46 56 69
23 24 25 26 27 28 29 30 27 29 35 38 46 56 69 83

Table 2 Four quantization tables for compliance testing of generic JPEG


encoders and decoders.

sent an average value of the 64 image samples, are encoded using the predictive
coding techniques, as illustrated in Figure 5.

The reasons for predictive coding of DC coefficients is that there is usually a


strong correlation between DC coefficients of adjacent 8x8 blocks. Adjacent
blocks will very probably have similar average intensities. Therefore, coding
the differences between DC coefficients rather than the coefficients themselves
will give better compression.

Finally, the last block in the JPEG encoder is the entropy coding, which
provides additional compression by encoding the quantized DCT coefficients
into more compact form. The JPEG standard specifies two entropy coding
methods: Huffman coding and arithmetic coding. The baseline sequential
JPEG encoder uses Huffman coding, which is presented next.

The Huffman encoder converts the DCT coefficients after quantization into a
compact binary sequence using two steps: (1) forming intermediate symbol se-
54 CHAPTER 2

Horizornalfrequency

DC AC 01 AC 07

Vertical
frequency

77

Figure 4 Zig-zag ordering of AC coefficients.

Sample

Difference
block;_1 block;
DCI-DCI-1
Previous
sample
DCi-1

Figure I) Predictive coding of DC coefficients. The difference between the


present and the previous DC coefficients is calculated and then coded using
JPEG.

quence, and (2) converting intermediate symbol sequence into binary sequence
using Huffman tables.

In the intermediate symbol sequence, each AC coefficient is represented by a


pair of symbols:

• Symbol-l (RUNLENGTH,SIZE) , and


Compression Techniques and Standards 55

• Symbol-2 (AMPLITUDE).

RUNLENGTH is the number of consecutive zero-valued AC coefficients pre-

°
ceding the nonzero AC coefficient. The value of RUNLENGTH is in the range
to 15, which requires 4 bits for its representation.

°
SIZE is the number of bits used to encode AMPLITUDE. The number of of
bits for AMPLITUDE is in the range of to 10 bits, so there are 4 bits needed
to code SIZE.

AMPLITUDE is the amplitude of the nonzero AC coefficient in the range of


[+1024 to -1023]' which requires 10 bits for its coding. For example, if the
sequence of AC coefficients is:

--.--
0,0,0,0,0,0,
6
476

the symbol representation of the AC coefficient 476 is:

(6,9) (476)

where: RUNLENGTH=6, SIZE=9, and AMPLITUDE=476.

If RUNLENGTH is greater than 15, then Symbol-1 (15,0) is interpreted as the


extension symbol with RUNLENGTH=16. These can be up to three consecu-
tive (15,0) extensions.

In the following example:

(15,0) (15,0) (7,4) (12)

RUNLENGTH is equal to 16+16+7=39, SIZE=4, and AMPLITUDE=12.

The symbol (0,0) means 'End of block' (EOB) and terminates each 8x8 block.

For DC coefficients, the intermediate symbol representation consists of:

• Symbol-1 (SIZE), and


• Symbol-2 (AMPLITUDE).
56 CHAPTER 2

Because DC coefficients are differentially encoded, this range is double the


range of AC coefficients, and is [-2048, +2047].

The second step in Huffman coding is converting the intermediate symbol


sequence into a binary sequence. In this phase, symbols are replaced with
variable length codes, beginning with the DC coefficient, and continuing with
the AC coefficients.

Each Symbol-1 (both for DC and AC coefficients) is encoded with a Variable-


Length Code (VLC), obtained from the Huffman table set specified for each
image component. The generation of Huffman tables is discussed in [PM93].
Symbols-2 are encoded using a Variable-Length Integer (VLI) code.

For example, for an AC coefficeint presented as the symbols:

(1,4) (12)

the binary presentation will be: (1111101101100), where (111110110 )


is VLC obtained from the Huffman table, and (1100) is VLI code for 12.

In the JPEG sequential decoding, all the steps from the encoding process are
inversed and implemented in reverse order, as shown in Figure 3. First, an
entropy decoder (such as Huffman) is implemented on the compressed image
data. The binary sequence is converted to a symbol sequence using Huffman
tables (VLC coefficients) and VLI decoding, and then the symbols are converted
into DCT coefficients. Then. the dequantization is implemented using the
following function:

F;(u, v) = Fq(u, v) x Q(u, v) (2.8)

where Q(u,v) are quantization coefficients obtained from the quantization ta-
ble.

Then, the Inverse Discrete Cosine Transform (IDCT) is implemented on de-


quantized coefficients to convert the image from frequency domain into spatial
domain. The IDCT equation is defined as:
Compression Techniques and Standards 57

7 7
1 ""
F(x,y) = 4[L..,L..,C(u)C(v)F(u,v)cos
(2x +16l)U1r cos (2y +16l)V1r ] (2.9)
u=Ov=O

where:

for u =0

C( u) = 1 for u>0

1
C(v) = V2 for v =0

C( v) = 1 for v >0

The last step consists of shifting back the decompressed samples in the range
[O,2P -1].

2.2 Compression Measures


The basic measure for the performance of a compression algorithm is Compres-
sion Ratio (CR), defined as:

CR = Original data size


(2.10)
Compressed data size

There is a trade-off between the compression ratio and the picture quality.
Higher compression ratios will produce lower picture quality and vice versa.
Quality and compression can also vary according to source image characteris-
tics and scene content. A measure for the quality of the picture, proposed in
[WaI95], is the number of bits per pixel in the compressed image (Nb). This
58 CHAPTER 2

measure is defined as the total number of bits in the compressed image divided
by the number of pixels:

Encoded number of bits


Nb = - - - - - - - - - (2.11)
Number of pixels

According to this measure, four different picture qualities are defined [Wal91],
as shown in Table 3.

Nb [bits/pixel] Picture Quality

0.25 - 0.5 Moderate to good


quality

0.5 - 0.75 Good to very good


quality

0.75 -1.0 Excellent quality

1.5 - 2.0 Usually


indistinguishable
from the original

Table 3 Picture quality characteristics.

Another statistical measure, that can be used to evaluate various compression


algorithms, is the Root Mean Square (RMS) error, calculated as:

(2.12)

where:

Xi - original pixel values,


Compression Techniques and Standards 59

Xi - pixel values after decompression,


n - total number of pixels in an image.

The RMS shows the statistical difference between the original and decom-
pressed images. In most cases the quality of a decompressed image is better
with lower RMS. However, in some cases it may happen that the quality of a
decompressed image with higher RMS is better than one with lower RMS.

2.3 Sequential JPEG Encoding Example


To illustrate all steps in baseline sequential JPEG encoding, we present the
step-by-step procedure and obtained results in encoding an 8x8 block of 8-bit
samples, as illustrated in Figure 6. The original 8x8 block is shown in Figure
6a; this block after shifting is given in Figure 6b. After applying the FDCT,
the obtained DCT coefficients are given in Figure 6c. Note that, except for
low-frequency coefficents, all other coefficients are close to zero.

For the generation of quantization tables, we used the program proposed in


[NeI92]:

for(i = 0; i < N; i + +)
for(j = O;j < N;j + +)
Q[i]U] = 1 + [(1 + i + j) x quality];

The parameter quality specifies the quality factor, and its recommended range
is from 1 to 25. Quality = 1 gives the best quality, but the lowest compression
ratio, and quality = 25 gives the worst quality and the highest compression
ratio. In this example, we used quality = 2, which generates the quantization
table shown in Figure 6d.

After implementing quantization, the obtained quantized coefficients are shown


in Figure 6e. Note that a large number of high-frequency AC coefficients are
equal to zero.

The zig-zag ordered sequence of quantized coefficients is shown in Figure 6f,


and the intermediate symbol sequence in Figure 6g. Finally, after implementing
Huffman codes, the obtained encoded bit sequence is shown in Figure 6h. The
60 CHAPTER 2

(c) Block after FDCT


(a) Original 8x8 block (b) Shifted block Eq.(5.1}
1216191211275147 185 -17 14 -8 23 -9 -13 -18
1401441471140140 155 17917~
1624121912203951
1«1~1~1~1~1~1~1M
20 -34 26 -9 -10 10 13 6
152155136167163162152172
24 27 8 39 35 34 24 44 -10 -23 -1 6 -18 3 -20 O.
1~1~1~1~1~1~1~1~
40 17 28 32 24 27 8 32 -8 -5 14 -14 -8 -2 -3 8
162 1~ 156148140 136 147162 3420282012 81934 -3 9 7 1 -11 17 18 15
147167140155155140136162 193912272712 834 3 -2 -18 8 8 -3 0 -6
1361~ 123 167 162 1« 140 147 8 28 -5 39 34 16 12 19 8 0 -2 3 -1 -7 -1 -1
1~1~1~1~1~1~1~1~
20 27 8 27 24 19 19 8 o -7 -2 1 1 4 -6 0

(d) Quantization Table (e) Block after quantization


(quality=2) Eq. (5.2)

3 5 7 9 11 131517 61 -3 2 0 2 0 0 -1
5 7 9 11 131517 19 4-4200000
7 9 11 1315171921 -1 -2 0 0 -1 0 -1 0
911131517192123 00100000
1113151719212325 00000000
13 15 17 19 21 23 25 27 o 0 -1 0 0 0 0 0
15171921 23252729 00000000
1719212325272931 00000000

(f) Zig-zag sequence


61,-3,4,-1 ,-4,2,0,2,-2,0,0,0,0,0,2,0,0,0,1 ,0,0,0,0,0,0,-1,0,0,-1,0,0,
0,0,-1,0,0,0,0,0,0,0,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
(g) Intennediate symbol sequence
(6)(61 ), (0,2)( -3), (0,3)( 4),(0,1 )(-1 ),(0,3)(-4 ),(0,2)(2), (1,2)(2), (0,2)( -2),
(0,2)(-2),(5,2)(2),(3,1 )(1 ),(6,1 )(-1 ),(2,1 )(-1 ),(4,1 )(-1 ),(7,1 )(-1 ),(0,0)
(e) Encoded bit sequence (total 98 bits)
(110)( 111101) (01 )(00) (100)( 100) (00)(0) (100)(001) (01)( 10)
(11011)(10) (0.1)(01) (01)(01) (11111110111)(10) (111010)(1)
(1111011)(0) (11100)(0) (111011)(0) (11111010)(0) (1010)

Figure 6 Step-by-step procedure in JPEG sequential encoding of a 8x8 block.


Compression Techniques and Standards 61

Huffman table used in this example is proposed in the JPEG standard for
luminance AC coefficients [PM93], and the partial table, needed to code the
symbols from Figure 6g, is given in Table 4.

(RUNLENGTH, CodeWord
SIZE)
(0,0) EOB 1010
(0,1) 00
(0,2) 01
(0,3) 100
(1,2) 11011
(2,1 ) 11100
(3,1) 111010
(4,1) 111011
(5,2) 11111110111
(6,1 ) 1111011
(7,1) 11111010

Table 4 Partial Huffman table for luminance AC coefficients.

Note that the DC coefficient is treated as being from the first 8x8 block in the
image, and therefore it is coded directly (not using predictive coding as all the
remaining DC coefficients).

For this block, the compression ratio is calculated as:

C R -_ Original number of bits _- -


64 x 8 _ 512 _
- - - - 5.22
Encoded number of bits 98 98

and the number of bits/pixel in the compressed form is:

Nb = Encoded number of bits = 98 = 1 53


Number of pixels 64 .

2.4 JPEG Compression of Color Images


62 CHAPTER 2

The described sequential JPEG algorithm can be easily expanded for compres-
sion of color images, or in a general case for compression of multiple-component
images. The JPEG source image model consists of 1 to 255 image components
[WaI91, Ste94], called color or spectral bands, as illustrated in Figure 7.

Co top

samples .. !~. • • • •
]€ • • • •
.J •
• • : J. " line
y • • •
left YI

- x,

bottom
x
Figure 7 JPEG color image model.

For example, both RGB and YUV representations consist of three color com-
ponents. Each component may have a different number of pixels in the hor-
izontal (X;) and vertical (Y;) axis. Figure 8 illustrates two cases of a color
image with 3 components. In the first case, all three components have the
same resolutions, while in the second case they have different resolutions.

The color components can be processed in two ways:

(a) Non-interleaved data ordering, in which processing is performed compo-


nent by component from left-to-right and top-to-bottom. In this mode, for
a RGB image with high resolution, the red component will be displayed
first, then the green component, and finally the blue component.
(b) Interleaved data ordering, in which different components are combined
into so-called Minimum Coded Units (MCUs). Interleaved data ordering
is used for applications that need to display or print multiple-component
images in parallel with their decompression.

Block diagrams of the encoder and decoder for color JPEG compression are
identical to those for grayscale image compression, shown in Figure 3, except
the first block into encoder is a color space conversion block (for example, RGB
Compression Techniques and Standards 63

t
y
t
y

! !
<II XI'..

y
r
y y

! !
Figure 8 A color image with 3 components: (a) with same resolutions, (b)
with different resolutions.

to YUV conversion), and at the decoder side the last block is the inversed color
conversion, such as YUV to RGB.

Figure 9 presents results offull-color JPEG co dec designed at Multimedia Lab-


oratory at Florida Atlantic University [F+95]. Figure shows the original image,
the decompressed images for different quality factors of 1,5, 10, and 25, and the
decompressed images using only three and one DCT coefficients, respectively.
Figures 10 and 11 show MS Windows user interface for the FAU's JPEG codec
including compression results for different quality factors.
64 CHAPTER 2

QualIty Le.el " S

Qua\uy Level- 10

DC. AC 10 and Ac,,1Only

Figure 9 An example of JPEG compression using FAU 's JPEG codec. The
co dec was developed by D. Schenker [F+95).
Compression Techniques and Standards 65

Outplll from lhe proanm (or comprc$Sioo quality 1.

Outpllt from lhe prognm for compn:$Sioo quality 5.

Output from lhe proanm (or lXlIIIpII'SOioo quality 10.

Figure 10 User interface for FAU's JPEG codec and compression results for
quality factors I, 5, and 10.
66 CHAPTER 2

Compl~:Ulon Aetulll IJ'

Output from tile program for compression quality 25.

Output from !be program for compression using DC. AC •• aDd ACo. Only

Output from tile prosram for COIIIpIl:S<iOCl quality 10.

Figure 11 FAU's JPEG codec results for quality factor 25, and three and
one OCT coefficients.
Compression Techniques and Standards 67

3 PX64 COMPRESSION ALGORITHM


FOR VIDEO COMMUNICATIONS
The H.261 standard, commonly called px64 Kbps, is optimized to achieve very
high compression ratios for full-color, real-time motion video transmission. The
px64 compression algorithm combines intraframe and interframe coding to pro-
vide fast processing for on-the-fly video compression and decompression. In-
traframe coding refers to the coding of individual frames, while interframe cod-
ing is the coding of a frame in reference to the previous or future frames.

The px64 standard is optimized for applications such as video-based telecom-


munications. Because these applications are usually not motion-intensive, the
algorithm uses limited motion search and estimation strategies to achieve higher
compression ratios. For standard video communication images, compression ra-
tios of 100:1 to over 2000:1 can be achieved.

The px64 Kbps compression standard is intended to cover the entire ISDN
channel capacity (p = 1, 2, ... 30). For p = 1 to 2, due to limited available
bandwidth, only desktop face-to-face visual communications (videophone) can
be implemented using this compression algorithm. However, for p ;::: 6, more
complex pictures are transmitted, and the algorithm is suitable for video con-
ferencing applications.

3.1 CCITT Video Format


The px64 algorithm operates with two picture formats adopted by the CCITT,
Common Intermediate Format (CIF), and Quarter-CIF (QCIF) [Lio91]' as il-
lustrated in Table 5.

Intended applications of this standard are videophone and videoconferencing


applications. The following examples illustrates the need for a video compres-
sion algorithm for these applications.

Example 1: Desktop videophone application

For a desktop videophone application, if we assume p=l, the available ISDN


network bandwidth is BA = 64 Kbits/sec. If the QCIF format is used, the re-
quired number of bits per frame consists of one luminance and two chrominance
components:
68 CHAPTER 2

...................................... ................................................. .............................................


elF QCIF
Unesffiame Pixelslline Unesffiame Pixelslline

Luninance (Y)
288 352 144 176
~hroninance (Cb) 144 176 72 88
~hroninance (Cr)
144 176 72 88

Table 5 Parameters of the CCITT video fonnats.

Nb = (144 X 176 + 72 X 88 + 72 X 88) X 8 bits = 300 Kbits/frame

If the data is transmitted at 10 frames/sec, the required bandwidth is:

Br = 300 Kbits/frame X 10 frames/sec = 3 Mbits/sec

As a consequence, a video compression algorithm should provide compression


ratio of minimum:

Cr = Br = 3 Mbits/sec = 47
BA 64 Kbits/sec

Example 2: Videoconferencing application

Assuming p=10 for a videoconferencing application, the available ISDN net-


work bandwidth becomes Br = 640 Kbits/sec. If the elF format is used, the
total number of bits per frame becomes:

Nb = (288 X 352 + 144 x 176 + 144 x 176) x 8 bits = 1.21 Mbits/frame

Assuming a frame rate of 30 frames/sec, the required bandwidth for the trans-
mission of videoconferencing data becomes:
Compression Techniques and Standards 69

Br = 1.21 Mbits/frame x 30 frames/sec = 36.4 Mbits/sec

Therefore, a video compression algorithm should provide compression ratio of


minimum:

Cr = Br = 36.4 Mbitsfsec = 57
BA 640 Kbitsfsec

3.2 Px64 Encoder and Decoder

Algorithm Structure
The px64 video compression algorithm combines intraframe and interframe
coding to provide fast processing for on-the-fly video. The algorithm consists
of:

• DCT-based intraframe compression, which similarly to JPEG, uses DCT,


quantization and entropy coding, and
• Predictive interframe coding based on Differential Pulse Code Modulation
(DPCM) and motion estimation.

The block diagram of the px64 encoder is presented in Figure 12.

The algorithm begins by coding an intraframe block using the DCT transform
coding and quantization (intraframe coding) , and then sends it to the video
multiplex coder. The same frame is then decompressed using the inverse quan-
tizer and IDCT, and then stored in the picture memory for interframe coding.

During the interframe coding, the prediction based on the DPCM algorithm
is used to compare every macro block of the actual frame with the available
macro blocks of the previous frame, as illustrated in Figure 13. To reduce the
encoding delay, only the closest previous frame is used for prediction.

Then, the difference is created as error terms, DCT-coded and quantized, and
sent to the video multiplex coder with or without the motion vector. At the
final step, entropy coding (such as Huffman encoder) is used to produce more
70 CHAPTER 2

Video Compressed
Video in Multiplex video stream
~--~~--~Co~r

(Huffman
encoder,

Motion
Vector

Figure 12 Block diagram of the px64 encoder.

Previous frame Current frame

Figure 13 The principle of interframe coding in the px64 video compression.

compact code. For interframe coding, the frames are encoded using one of the
following three techniques:

• DPCM coding with no motion compensation (zero-motion vectors),


• DPCM coding with non-zero motion vectors,
• Blocks are filtered by an optional predefined filter to remove high-frequency
noise.

At least one in every 132 picture frames should be intraframe coded.


Compression Techniques and Standards 71

A typical px64 decoder, shown in Figure 14, consists ofthe receiver buffer, the
Hufffman decoder, inverse quantizer, IDCT block, and the motion-compensation
predictor which includes frame memory.

Reproduced
Received r - - - - ,
bit stream image
VLC Inverse
Buffer
Decoder quantizer
Filter on/off

Motion-
Filter compensation
predictorl
Motion vectors Frame memory

Figure 14 Block diagram of the px64 decoder.

The motion estimation algorithms are briefly discussed in the next section on
the MPEG algorithm.

Video Data Structure


According to the H.261,standard, a data stream has a hierarchical structure
consisting of Pictures, Groups of Blocks (GOB), Macro Blocks (MB) and Blocks
[Lio91, A+93b]. A Macro Block is composed of four (8 x 8) luminance (Y)
blocks, and two (8 x 8) chrominance (Cr and Cb ) blocks, as illustrated in
Figure 15.

A Group of Blocks is composed of 3 xII MBs. A CIF Picture contains 12 GOBs,


while a QCIF Picture consists of 4 GOBs. The hierarchical block structure is
shown in Figure 16.

Each of the layers contains headers, which carry information about the data
that follows. For example, a picture header includes a 20-bit picture start code,
video format (CIF or QCIF), frame number, etc. A detailed structure of the
headers is given in [Lio91].
72 CHAPTER 2

x.x x.x y x.x x.x x.x y x.x


xxxxxxxxxxxxxxxx
X X X X X X X.X X X X X X X X.X
~x ~X ~X X X ~X ~X ~X X X
XXXXXXXXXXXXX.x X• X
~X ~X X·X x·X ~X ~X X X X X x - Luminance
,.X x.x x.x JeX ,.X x.x x.x
Jex • - Chrominance
XXXXXXXXXXXXXXXX
X. X X.X YX. X X. X
xxxxxxxxxxxxxxxx
x.x
X-" X. X

X X X X X X X.X X X X X X X X.X
~X ~X ~X X X ~X ~X ~X X X
X X X X x.X x.xx X X X X.x X.X
t-X ~X X X X X X·X ~x X X X X
,.x x.x X.X JeX ,.x x.x X.X JeX
XXXXXXXXXXXXXXXX

Figure 15 The composition of a Macro Block: MB = 4Y + Cb + Cr.

4 MPEG COMPRESSION FOR


MOTION-INTENSIVE APPLICATIONS
The MPEG compression algorithm is intended for compression of full-motion
video. The compression method uses interframe compression and can achieve
compression ratios of 200:1 through storing only the differences between suc-
cessive frames. The MPEG approach is optimized for motion-intensive video
applications, and its specification also includes an algorithm for the compres-
sion of audio data at ratios ranging from 5:1 to 10:1.

The MPEG first-phase standard (MPEG-l) is targeted for compression of


320 x 240 full-motion video at rates of 1 to 1.5 Mb/s in applications, such
as interactive multimedia and broadcast television. MPEG-2 standard is in-
tended for higher resolutions, similar to the digital video studio standard CCIR
601, EDTV, and further leading to HDTV. If specifies compressed bit streams
for high-quality digital video at the rate of 2-80 Mb/s. The MPEG-2 standard
supports interlaced video formats and a number of features for HDTV. The
MPEG-2 standard is also addressing scalable video coding for a variety of ap-
plications which need different image resolutions, such as video communications
over ISDN networks using ATM [Ste94, LeG91]. The MPEG-4 standard is
Compression Techniques and Standards 73

MB= 4Y + Cb+ Cr

~
~
y Cb Cr
GOB =3x11 MBs
1 2 3 4 5 6 7 8 9 10 11

12 13 14 16 16 17 18 19 20 21 22
23 24 26 26 27 28 29 30 31 32 33

1 2
3 4
5 6 SQClF=3GOBs
7 8
9 10
11 12 CIF= 12 GOBs

Figure 16 Hierarchical block structure of the px64 data stream.

intended for compression of full- motion video consisting of small frames and re-
quiring slow refreshments. The data rate required is 9-40 Kbps, and the target
applications include interactive multimedia and video telephony. This stan-
dard requires the development of new model-based image coding techniques for
human interaction and low-bit-rate speech coding techniques [Ste94].

Table 6 illustrates various motion-video formats and corresponding MPEG pa-


rameters.

The MPEG algorithm is intended for both asymmetric and symmetric applica-
tions. Asymmetric applications are characterized by frequent use of the decom-
pression process, while the compression process is performed once. Examples
include movies-on-demand, electronic publishing, and education and training.
Symmetric applications require equal use of the compression and decompression
processes. Examples include multimedia mail and videoconferencing.

When MPEG standard is conceived, the following features have been identified
as important: random access, fast forward/reverse searches, reverse playback,
74 CHAPTER 2

FORMAT \/ideo Corrpessed bit


Parameters rate

SIF 352x240 at 1.2 - 3 Mbps ..


30Hz
...,.
PtFEG-1

CCIR601 720x486at 5-10 Mbps i .....


30Hz
EOlV 960x486at 7-15 Mbps
30Hz ~
HOlV 1920x1080 20-40 Mbps
at 30 Hz II

Table 6 Parameters of MPEG standards.

audio-visual synchronization, robustness to errors, editability, format flexibility,


and cost tradeoff. These features are described in detail in [LeG91].

The MPEG standard consists of three parts: (1) synchronization and multi-
plexing of video and audio, (2) video, and (3) audio.

4.1 MPEG Video Encoder and Decoder

Frame Structures
In the MPEG standard, frames in a sequence are coded using three different
algorithms, as illustrated in Figure 17.

1 frames (intra images) are self-contained and coded using a DCT-based tech-
nique similar to JPEG. 1 frames are used as random access points in MPEG
streams, and they give the lowest compression ratios within MPEG.

P frames (predicted images) are coded using forward predictive coding, where
the actual frame is coded with reference to a previous frame (lor Pl. This
process is similar to H.261 predictive coding, except the previous frame is not
always the closest previous frame, like in H.261 coding (see Figure 13). The
compression ratio of P frames is significantly higher than of 1 frames.
Compression Techniques and Standards 75

Fa:V(afd pedKltiQ:) . .

Figure 17 Types of frames in the MPEG standard.

B frames (bidirectional or interpolated images) are coded using two reference


frames, a past and a future frame (which can be lor P frames). Bidirectional,
or interpolated coding provides the highest amount of compression.

Note that in Figure 17, the first 3 B frames (2, 3 and 4) are bidirectionally
coded using the past frame I (frame 1), and the future frame P (frame 5).
Therefore, the decoding order will differ from the encoding order. The P frame
5 must be decoded before B frames 2, 3 and 4, and I frame 9 before B frames
6, 7 and 8. If the MPEG sequence is transmitted over the network, the actual
transmission order should be {I, 5, 2, 3, 4, 9, 6, 7, 8}.

The MPEG application determines a sequence of I, P and B frames. If there


is a need for fast random access, the best resolution would be achieved by
coding the whole sequence as I frames (MPEG becomes identical to MJPEG).
However, the highest compression ratio can be achieved by incorporating a large
number of B frames . The following sequence has been proven very effective for
a number of practical applications [Ste94]:

(I B B P B B P B B) (I B B P B B P B B) ...
76 CHAPTER 2

In the case of 25 frames/s, random access will be provided through 9 still


frames (I and P frames), which is about 360 ms [Ste94j. On the other hand,
this sequence will allow a relatively high compression ratio.

Motion Estimation
The coding process for P and B frames includes the motion estimator, which
finds the best matching block in the available reference frames. P frames are
always using forward prediction, while B frames are using bidirectional predic-
tion , also called motion-compensated interpolation, as illustrated in Figure 18
[A+93bj.

Previous frame

c
B = (A+C)/2

Figure 18 Motion compensated interpolation implemented in MPEG . Each


block in the current frame is interpolated using the blocks from a previous and
a future frame.

B frames can use forward, backward prediction, or interpolation. A block


in the current frame (B frame) can be predicted by another block from the
=
past reference frame (B A ~ forward prediction), or from the future refer-
=
ence frame (B C ~ backward prediction), or by the average of two blocks
(B = (A + C)/2 ~ interpolation).

Motion estimation is used to extract the motion information from the video
sequence. For every 16 x 16 block of P and B frames, one or two motion
vectors are calculated. One motion vector is calculated for P and forward
and backward predicted B frames, while two motion vectors are calculated for
interpolated B frames.
Compression Techniques and Standards 77

The MPEG standard does not specify the motion estimation technique, however
block-matching techniques are likely to be used . In block-matching techniques,
the goal is to estimate the motion of a block of size (n x m) in the present
frame in the relation to the pixels of the previous or the future frames. The
block is compared with a corresponding block within a search area of size
(m + 2p x n + 2p) in the previous (or the future) frame, as illustrated in Figure
19a. In a typical MPEG system, a match block (or a macroblock) is 16 x 16
pixels (n = m = 16), and the parameter p=6 (Figure 19b) .

n+2p 28

16

28 16
m+2p
p p -. p=s....

G G

Figure 19 The search area in block-matching techniques for motion vector


estimation: (a) a general case, (b) a typical case for MPEG: n=m=16, p=6. F
- a macroblock in the current frame, G - search area in a previous (or a future)
frame .

Many block matching techniques for motion vector estimation have been de-
veloped and evaluated in the literature, such as: (a) the exhaustive search (or
brute force) algorithm, (b) the three-step-search algorithm [K +81, L+94], (c)
the 2-D logarithmic search algorithm [JJ81], (d) the conjugate direction search
algorithm [SR85], (e) the parallel hierarchical 1-D search algorithm [C+91],
and (f) the modified pixel-difference classification, layered structure algorithm
[C+94] . These algorithms are described in detail in [FSZ95] .

These block matching techniques for motion estimation obtain the motion vec-
tor by minimizing a cost function. The following cost functions have been
proposed in the literature:

(a) The Mean-Absolute Difference (MAD), defined as:


78 CHAPTER 2

1 n/2 m/2
MAD(dx,dy) = -mn L..,
" L
i:-n/2 j=-m/2
IF(i,j) - G(i + dx,j + dy)1 (2.13)

where:

F(i, j) - represents a (m x n) macroblock from the current frame,

G(i, j) - represents the same macroblock from a reference frame (past or future),

(dx, dy) - a vector representing the search location.

The search space is specified by: dx = {-p,+p} and dy = {-p,+p}.

For a typical MPEG system, m=n= 16 and p=6, the MAD function becomes:

8 8

MAD(dx,dy) = 2~6 L L IF(i,j)-G(i+dx,j+dy)1 (2.14)


i:-8 j:-8

and
dx = {-6,6},dy = {-6,6}

(b) The Mean-Squared Difference (MSD) cost function is defined as:

1 n/2 m/2
MSD{dx,dy) = -
mn
L L [F(i,j) - G(i + dX,j + dy)]2 (2.15)
i:-n/2 j:-m/2

(c) The Cross-Correlation Function (CCF) is defined as:


Compression Techniques and Standards 79

The mean absolute difference (MAD) cost function is considered as a good


candidate for video applications, because it is easy to implement it in hardware.
The other two cost functions, MSD and CCF, can be more efficient, however
are too complex for hardware implementations.

To reduce the computational complexity of MAD, MSD, and CCF cost func-
tions, Ghavani and Mills have proposed a simple block matching criterion,
called Pixel Difference Classification (PDC) [GM90j. The PDC criterion is
defined as:

PDC(d~,dy) =L LT(dx,dy,i,j) (2.17)


j

for (dx, dy) = {-p,p}.

T( dx, dy, i, j) is the binary representation of the pixel difference defined as:

T(dx, d ,i .) = {I, if IF(i,P - G(i + dx,j + dy)1 :s t; (2.18)


y ,) 0, otherWise.

where t is a pre-defined threshold value.

In this way, each pixel in a macro block is classified as either a matching pixel
(T=l), or a mismatching pixel (T=O). The block that maximizes the PDC
function is selected as the best matched block.

Using a block-matching motion estimation technique, the best motion vector(s)


is found, which specifies the space distance between the actual and the reference
macroblocks. The macroblock in the current frame is then predicted based on a
macroblock in a previous frame (forward prediction), a macroblock in a future
frame (backward prediction), or using interpolation between macroblocks in
a previous and a future frame. A macroblock in the current frame F( i, j) is
predicted using the following expression:

Fp(i,j) = G(i + dX,j + dy) (2.19)

for (i,j) = {-8,8}.


80 CHAPTER 2

Fp( i, j) is the predicted current macroblock, G( i, j) is the same macroblock in


a previous/future frame, and (dz, dy) is the estimated motion vector.

For interpolated frames, a macroblock in the current frame F(ij) is predicted


using the following formula:

(2.20)

where (i,j) = {-B,B}.

G 1(i, j) is the same microblock in a previous frame, (dz 1 , dyd is the corre-
sponding motion vector, G2 (i,j) is the same microblock in a future frame, and
(dz 2, dY2) is its corresponding motion vector.

The difference between predicted and actual macroblocks, called the error terms
E(i,j), is then calculated using the following expression:

E(i,j) = F(i,j) - Fp(i,j) (2.21)

for (i,i) = {-8,B}.

Block diagram of a MPEG-1 encoder, which includes motion predictor and


motion estimation, is shown in Figure 20, while a typical MPEG-1 decoder is
given in Figure 21.

I frames are created similarly to JPEG encoded pictures, while P and B frames
are encoded in terms of previous and future frames. The motion vector is
estimated, and the difference between the predicted and actual blocks (error
terms) are calculated. The error terms are then DCT encoded and finally the
entropy encoder is used to produce the compact code.

4.2 Audio Encoder and Decoder


The MPEG standard also covers audio compression. MPEG uses the same
sampling frequencies as compact disc digital audio (CD-DA) and digital audio
tape (DAT). Besides these two frequencies, 44.1 KHz and 48 KHz, 32 KHz is
also supported, all at 16 bits. The audio data on a compact disc, with 2 channels
Compression Techniques and Standards 81

,---_ _--, Compr _ _ " "

I frame Color 10001110000_.

space Quantlutlon
Entropy
FOCT encoder
convertor
R08->YUV

Error FOCT
terms
PIS Entropy
encoder
convertor

R08->YUV

Figure 20 The block diagram of a typical MPEG_l encoder.

Picture
type

Bit
stream VLCand FLC Video out
Buffer Inverse
decoder and
quantizer
IDCT
demultiplexer
Inter/Intra

Motion
vectors

Future
picture store Plctu,.
type

Figure 21 The block diagram of a typical MPEG-l decoder.

of audio samples at 44.1 KHz with 16 bits/sample, requires a data rate of about
1.4 Mbits/s [Pen93] . Therefore, there is a need to compress audio data as well.
82 CHAPTER 2

Existing audio compression techniques include Jl-Iaw and Adaptive Differen-


tial Pulse Code Modulation (ADPCM), which are both of low complexity, low
compression ratios, and offer medium audio quality. The MPEG audio com-
pression algorithm is of high complexity, but offers high compression ratios and
high audio quality. It can achieve compression ratios ranging from 5:1 to 10:1.

The MPEG audio compression algorithm comprises of the following three op-
erations:

• The audio signal is first transformed into the frequency domain, and the
obtained spectrum is divided into 32 non-interleaved subbands.
• For each subband, the amplitude of the audio signal is calculated, as well
as the noise level is determined by using a "psychoacoustic model". The
psychoacoustic model is the key component of the MPEG audio encoder
and its function is to analyze the input audio signal and determine where
in the spectrum the quantization noise should be masked.
• Finally, each subband is quantized according to the audibility of quantiza-
tion noise within that band.

The MPEG audio encoder and decoder are shown in Figure 22 [Pen93, Ste94].

MPEG Audio Encoder


PC Maudlo Encoded
input Time-to frequency
mapping filter bank r----- Bltlnoise allocation
quantlzer, and coding
-... Bit-stream
formatting
bit stream

L.......eD Psychoacoustic
model
i

--
MPEG Audio Decoder
Encoded Decoded

r-----
bit stream PSMaudlo
Bit-stream Frequency sample Frequency to
unpacking reconstruction time mapping

Figure 22 Block diagrams of MPEG audio encoder and decoder.

The input audio stream simultaneously passes through a filter bank and a
psychoacoustic model. The filter bank divides the input into multiple sub bands,
Compression Techniques and Standards 83

while the psychoacoustic model determines the signal-to-mask ratio of each


subband. The bit or noise allocation block uses the signal-to-mask ratios to
determine the number of bits for the quantization of the sub band signals with
the goal to minimize the audibility of the quantization noise. The last block
performs entropy (Huffman) encoding and formatting the data. The decoder
performs entropy (Huffman) decoding, then reconstructs the quantized subband
values, and transforms sub band values into a time-domain audio signal.

The MPEG audio standard specifies three layers for compression: layer 1 repre-
sents the most basic algorithm and provides the maximum rate of 448 Kbits/sec,
layers 2 and 3 are enhancements to layer 1 and offer 384 Kbits/sec and 320
Kbits/sec, respectively. Each successive layer improves the compression perfor-
mance, but at the cost of greater encoder and decoder complexity.

A detailed description of audio compression principles and techniques can be


found in [Pen93].

4.3 MPEG Data Stream


The MPEG standard specifies a syntax for the interleaved audio and video data
streams. An audio data stream consists of frames. which are divided into audio
access units. Audio access units consist of slots, which can be either four bits
at the lowest complexity layer (layer 1), or one byte at layers 2 and 3. A frame
always consists of a fixed number of samples. Audio access unit specifies the
smallest audio sequence of compressed data that can be independently decoded.
The playing times of the audio access units of one frame are 8 ms at 48 KHz,
8.7 ms at 44.1 KHz. and 12 ms at 32 KHz [Ste94]. A video data stream consists
of six layers, as shown in Table 7.

At the beginning of the sequence layer there are two entries: the constant bit
rate of a sequence and the storage capacity that is needed for decoding. These
parameters define the data buffering requirements. A sequence is divided into
a series of GOPs. Each GOP layer has at least one I frame as the first frame in
GOP, so random access and fast search are enabled. Gaps can be of arbitrary
structure (I, P and B frames) and length. The GOP layer is the basic unit for
editing an MPEG video stream.

The picture layer contains a whole picture (or a frame). This information
consists of the type of the frame (I, P, or B) and the position of the frame in
display order.
84 CHAPTER 2

Syntax layer Functionality

Sequence layer Context unit


Group of pictures layer Random access unit: video coding
Picture layer Primary coding unit
Slice layer Resynchronization unit
Macroblock layer Motion compensation unit
Block layer DCTunit

Table 7 Layers of MPEG video stream syntax.

The bits corresponding to the DCT coefficients and the motion vectors are
contained in the next three layers: slice, macroblock, and block layers. The
block is a (8x8) DCT unit, the macroblock is a (16x16) motion compensation
unit, and the slice is a string of macroblocks of arbitrary length. The slice layer
is intended to be used for resynchronization during a frame decoding when bit
errors occur.

5 CONCLUSION
In this chapter we presented three video and image compression standards and
related algorithms applied in multimedia applications: JPEG, H.261 (or px64),
and MPEG. The most popular use of JPEG image compression technology in-
clude photo ID systems, telecommunications of images, military image systems,
and distributed image management systems.

Four distinct applications of the compressed video, based on H.261 and MPEG
standards, can be summarized as: (a) consumer broadcast television, (b) con-
sumer playback, (c) desktop video, and (d) videoconferencing.

Consumer broadcast television includes home digital video delivery and typi-
cally requires a small number of high-quality compressors and a large number
of low-cost decompressors. Expected compression ratio is about 50:1.

Consumer playback applications, such as CD-ROM libraries and interactive


games, also require a small number of compressors with a large number of low-
cost decompressors. However, the compression ratio is much greater achieving
100:1 than that required for cable transmission.
Compression Techniques and Standards 85

Desktop video, which includes systems for authoring and editing video presen-
tations, is a symmetrical application requiring the same number of encoders
and decoders. The expected compression ratio is relatively small in the range
2-50:1.

Videoconferencing application also requires the same number of encoders and


decoders, the compression and decompression must be done in real time, and
the expected compression ratio is about 100:1.

Other promissing multimedia compression techniques, which are not yet stan-
dards, include wavelet-based compression, sub band coding, and fractal com-
pression [FSZ95].

REFERENCES
[A+93b] R. Aravind, G. L. Cash, D. C. Duttweller, H.-M. Hang, B. G. Haskel,
and A. Puri, "Image and Video Coding Standards", AT&T Technical Jour-
nal, Vol. 72, January/February 1993, pp. 67-88.

[C+91] 1. G. Chen, W. T. Chen, Y. S. Jehng, amd T. D. Chiueh, "An Efficient


Parallel Motion Estimation Algorithm for Digital Image Processing" , IEEE
Transactions Circuits Systems, Vol. 1, 1991, pp. 378-385.

[C+94] E. Chan, A. A. Rodriguez, R. Gandhi, and S. Panchanathan, "Ex-


periments on Block-Matching Techniques for Video Coding", Journal of
Multimedia Systems, Vol. 2, No.5, 1994, pp. 228-241.

[Fox91] E. A. Fox, "Advances in Interactive Digital Multimedia Systems",


IEEE Computer, Vol. 24, No. 10, October 1991, pp. 9-21.

[F+95] B. Furht, J. Greenberg, T.W. Gresham, Y. Liao, D. Schenker, and


D. Sommers, "Color JPEG Compression Experiments", Technical Report,
Dept. of Computer Science and Engineering, Florida Atlantic University,
TR-CSE-95-33, August 1995.

[FSZ95] B. Furht, S. W. Smoliar, and HJ. Zhang, "Video and Image Processing
in Multimedia Systems" , Kluwer Academic Publishers, Norwell, MA, 1995.

[Fur95] B. Furht, "A Survey of Multimedia Techniques and Standards. Part


I: JPEG Standard", Journal of Real-Time Imaging, Vol. 1, No. 1, April
1995.
86 CHAPTER 2

[GM90] H. Gharavi and M. Mills, "Block Matching Motion Estimation Algo-


rithms - New Results", IEEE Transactions Circuits and Systems, Vol. 37,
1990, pp. 649-651.
[HM94] A. C. Hung and T. H.-y' Meng, "A Comparison of Fast Inverse Dis-
crete Cosine Transfrom Algorithms", Journal of Multimedia Systems, Vol.
2, No.5, 1994, pp. 204-217.
[JJ81] J. R. Jain and A. K. Jain, "Displacement Measurement and its Appli-
cation in Interframe Image Coding" , IEEE Transacations on Communica-
tions, Vol. 29, 1981, pp. 1799-1808.
[K+81] J. Koga, K. linuma, A. Hirano, Y. lijima, and T. Ishiguro, "Motion
Compensated Interframe Coding for Video Conferencing", Proc. of the
National Telecommunications Conference, pp. G5.3.1-5.3.5, 1981.

[L+94] W. Lee, Y. Kim, R. J. Gove, and C. J. Read, "Media Station 5000:


Integrating Video and Audio", IEEE Multimedia, Vol. 1, No.2, Summer
1994, pp. 50-61.
[LeG91] D. LeGall, "MPEG: A Video Compression Standard for Multimedia
Applications", Communications of the A CM, Vol. 34. No.4, April 1991,
pp.45-68.
[Lio94] M. Liou, "Overview of the Px64 Kbit/s Video Coding Standard",
Cnmm. of the ACM, Vol. 34, No.4, April 1991, pp.59-63.
[NeI92] M. Nelson, "The Data Compression Book", Mf3T Books, San Mateo,
California, 1992.
[Pen93] D. Y. Pen, "Digital Audio Compression", Digital Technical Journal,
Vol. 5, No.2, Spring 1993, pp. 28-40.
[PM93] W. B. Pennenbaker and J. L. Mitchell, "JPEG Still Image Data Com-
pression Standard", Van Nostrand Reinhold, New York, 1993.
[SR85] R. Srinivasan and K. R. Rao, "Predictive Coding based on Efficient Mo-
tion Estimation" , IEEE Transactions on Communications, Vol. 33, 1985,
pp. 888-896.
[Ste94] R. Steinmetz, "Data Compression in Multimedia Computing" - Stan-
dards and Systems", Part I and II, Journal of Multimedia Systems, Vol. I,
1994, pp. 166-172 and 187-204.
[WaI91] G. Wallace, "The JPEG Still Picture Compression Standard", Comm.
of the ACM, Vol 34, No.4, April 1991, pp. 30-44.
3
MULTIMEDIA INTERFACES:
DESIGNING FOR DIVERSITY
Meera Blattner
University of California, Davis
Lawrence Livermore National Laboratory

1 INTRODUCTION
Every scientist and engineer dreams of reducing a problem to a few basic prin-
ciples. Although our understanding of the human interface will continue to
grow, it is unlikely that we will ever see a time when there is a simple, compre-
hensive model for the user interface. User interfaces become increasingly more
diverse in their construction. Computer users, unless they are programmers,
use languages designed for specific applications or languages based on interac-
tion techniques, such as point-and-click, which require very little training to
use. The interfaces commonly used now by computer users, the mouse, key-
board and screen, will be confined to those with desk jobs, and others will use a
variety of input devices. from voice and pen to virtual reality interfaces. In the
future, many of our interfaces will receive input through observing users rather
than waiting for human action as in interfaces for tracking and teleconferencing.

The theme of this chapter is to examine the diversity of multimedia interfaces


in both design and applications - they are found in as many shapes and forms
as there are problems to solve. Section 2 is concerned with what an interface is,
for if we are to design interfaces we must know what they are. When you design
an interface, you design a language, that is, a formal method of communicating
with the computer. The problems of designing interfaces for such a disparate
set of applications are examined in Section 3. Audio interfaces and some of the
problems interfaces designers are attempting to solve with audio are examined
in Section 4. Last, six unusual applications and their interfaces are examined
in Section 5.
88 CHAPTER 3

2 WHAT IS AN INTERFACE?

2.1 Formal systems


An interface is generally considered to be where two entities meet. In this chap-
tel', the definition of an interface is restricted to a human-computer interface,
so the entities are a computer and a human. An interface sometimes implies
a physical proximity, a thing, for example, a screen on the computer and what
you see on that screen. We are long past the point, however, when we consider
a computer interface as residing primarily on a screen. Suppose we speak to
our computer or we are immersed in a virtual world; suppose our computer is
tracking us as we walk down the hall. Where is the interface?

Human-computer interfaces enable humans and computers to communicate


with each other. In other words, they are system composed of a language
together with the devices necessary to send/receive that language so that hu-
mans and computers may communicate. By a language we mean the interface
and the user have a set of conventions, symbols or signals they use in a uniform
way to communicate with each other. To be interpreted by a machine, it must
be a formal system where a syntax is defined. In other words, the symbols or
signals are structured according to very specific rules, such as a grammar, to
define the system. Then we assign meanings or semantics to the system.

It is tempting to compare a computer interface with communication between


people. People have eyes, ears, noses, and hands, in other words, senses or
modalities, to perceive and interact with the world around them. We have
mouths, voices, hands and we use body language for communication with other
people. Some communication requires the use of other physical objects, such
as pens, musical instruments, costumes, etc. On a computer, physical devices
are required for the language to be expressed as output, or to be read as input.
When a device generates output or when it reads input, we identify the inter-
face with a place or device. For example, a microphone and voice recognition
software are part of an interface between the human and computer. But more
important than the devices used is the spoken language that is also part of the
interface, in particular, the semantic content or meaning of the language. This
definition is clouded when the computer is being used to pass information from
one person to another without any comprehension of what the content may be,
such as in electronic mail, teleconferencing or computer-supported cooperative
work. Closer inspection reveals there is communication on many levels: we
may be communicating with another person through the interface; we may be
Multimedia Interfaces: Designing for Diversity 89

sending the operating system messages; or, we may be communicating with the
application.

2.2 What Do We Communicate?


If a human-computer interface is for communication, who and what are we
communicating with? Dannenberg and Blattner [15] considered three types
of communication: human-human, computer-computer, and human-computer.
In human-human communication we are facilitating the task of communicating
with other humans by using the computer to structure our tasks or bundling up
information and passing it through the system. In some cases, the computer
will assist us in making information more comprehensible, or even, making
communication between humans possible where it would not be otherwise, for
example, to translate from one language to another. In computer-computer
communication there is a communication standard or protocol to allow vari-
ous computers to translate information between each other. Communication
between humans and computers requires the translation of human expression
to machine language. and conversely. For many years programming languages
typed into computers on keyboards with pointing devices performed this task
successfully.

Multimedia is putting more demands on computer systems by requiring not


only that human forms of expression be translated into machine language, but
also that they can take different forms in mUltiple media, which are able to
interact with one another. We now leave programming languages to program-
mers and use forms of communication readily grasped by the non-expert. The
interface designer is challenged by new input devices that are constantly being
developed for multimodal input such as speech, gesture, eye-tracking, two- and
three-dimensional tracking, and so on. Later we give an example of a pressure-
sensitive input device. Equally challenging are the variety of output media.
Besides text, graphics, speech and nonspeech audio, video, there are examples
of such new and unusual interfaces as those designed for touch, gesture and
force-feedback.

Dannenberg's Piano Tutor has an interface that can hear the notes of a piano
and compare them to notes on a page [16]. The output media for the Piano
Tutor are: video, pre-recorded voice, computer graphics display, and synthe-
sized music. Outputs may include text (remediation) appearing over notes on
a scale. diagrams of various types, voice comments, and music. The inputs are
90 CHAPTER 3

music played by the student and selections made from menus. The inputs and
output are scattered over various physical locations.

In contrast to this, a virtual reality interface must sense different parts of


our bodies and interpret the meaning of gesture and movement. The virtual
reality interface is close to the body and the input is obtained from devices
that interpret our gesture, gaze, position, voice, and possibly, force-feedback.
The output is graphics, force feedback, and audio. The interface is in a helmet,
and goggles, and sometimes in a tightly fitted suit. The semantic content of
the language for the virtual world is designed to fit the application-is the user
shooting monsters or viewing a model kitchen? Tracking devices, microphones,
and loudspeakers may be part of some multimedia interfaces. They may be
located ten feet or more from our bodies and pick up sound, light or motion.
The input language is our voice, gestures and other changes to air pressure
and/or light waves. Telepresence requires our input devices to be located far
from our bodies, but to send signals from these transmissions to us in ways
similar to virtual reality input devices.

2.3 The Computer's Sensory Abilities


People communicate through many channels and sensory modalities. When we
temporarily lose one or more of our senses, we feel frustrated, unable to express
ourselves properly. Yet in spite of our frustration when our senses are limited,
we design computer interfaces with those same limitations. The reason is clear:
if you give someone who has never spoken or heard spoken language a voice.
he or she will be unable to use those new abilities.

The purpose of multimedia interfaces is to give the ability for more expression
to communication between computers and humans. Learning how to use mul-
timedia interfaces will take time. Computer have abilities that people don't
have, of course, or we would never build computers, we would just hire people
to do the job. In addition to speed and endurance, computers have sensory
and computational abilities people don't have. Devices can be fashioned for
computers that give them keener ears, better eyes, etc. In spite of all of this,
we have not been able to design computers that have the human abilities for
abstract thought and basic comprehension.
Multimedia Interfaces: Designing for Diversity 91

3 DESIGNING MULTIMEDIA
INTERFACES
User·interfaces are hard to build. In this section we will examine some of the
reasons why this is the case. To design and build a user interface, the following
steps are generally executed: 1) Identify the needs of the users; 2) Construct
a scenario; 3) Find a metaphor; 4) Provide a design rationale; 5) Design the
system; 6) Build a prototype; 7) Test it; and 8) Modify the design.

For the multimedia interface, there is the added complexity of more component
parts and a more difficult integration. But the greatest difficulty of all is the
design of a multimedia interfaces that takes advantage of multiple senses.

We are only just beginning to learn how to do that. Not only is the interface
itself changing with the new technologies brought about by multimedia, but
the way we design interfaces is changing. Before media are selected or a syntax
and semantics for the interface chosen. what does the designer do?

3.1 Identifying users needs


More than ever before. multimedia interfaces are able to communicate with peo-
ple by using interaction that is natural and expressive to humans. Whenever we
try to define the word "natural." its meaning slips away from us. To be natural
is to be in accordance with nature. and that doesn't help much. The key to a
good interface is usability, not naturalness. When we speak, what do we say?
When we gaze, where do we look? When we use a stylus, how do we gesture?
Design and evaluation of interactions between people are more complex in mul-
timedia interfaces. For years human factors professionals came to the computer
industry armed with the methodology of experimental psychology. However,
the questions faced by design engineers are not adequately addressed by the
quantitative methods of psychology. The traditional approaches of functional
requirements based on evaluations of existing tools have not provided adequate
understanding of tool design. Participatory design, that is, bringing users into
the design of tools, is one response to this problem. However, participatory
design has not emphasized the role of the social and physical settings of the
user. How we interact with our environment depends a great deal upon our
culture and context. Because of the pioneering work of Winograd and Flores
[48], Suchman [41], Dreyfus and Dreyfus [19], interface designers see the use of
computers. characterized not by rationality, planning and reflection, but within
a social context that is familiar to us.
92 CHAPTER 3

The problem of understanding the user's needs and work patterns is partic-
ularly acute in the design of multimedia interfaces where there may not be
similar pre-existing tools to study and the user may be trying to simultane-
ously coordinate several different devices required for input or output. The
techniques now known as contextual design began to emerge in the late 1980's
[49]. The characteristic technique that has evolved from these designers is to
observe the user, perhaps questioning, but never interfering in the context of
a working environment, using the methodology of ethnography, a subdiscipline
of anthropology. Bodker [8] argues that the user interface fully reveals itself to
us when it is in use. Interface designers now refer to artifacts, that is, products
of human activity, usually tools with which we do our work. More formally,
"Artifacts are things that mediate the action of a human being toward another
subject or an object." Springmeyer. Blattner & Max [39] used these techniques
to discover how to design tools for scientific data analysis, while Springmeyer
[40] analyzed the process and designed prototype tools using contextual design
methods for her investigation. The border resources are socially shared envi-
ronments of work and establish a genre or a style for the work. Increasingly,
the border of the work context has become more important [10].

3.2 The Context for Design: Models and


Metaphors
Most computer interfaces are designed for those who are not expert computer
users. In most cases, we seek to design an interface that can be quickly grasped
and understood within a short time. Part of this requirement is that it must
be based on some coherent organizational principle-a mental model or concep-
tual understanding of the system. Metaphors have been widely used for this
purpose.

What is a metaphor? We have certain experiences in our daily lives with which
we are very familiar. For example. throwing something into a wastebasket.
We can "throw out" or destroy a file by dragging its icon to an icon of a
wastebasket. The theory is that by understanding the characteristics of real-
world objects with which we are familiar. we can quickly learn a computer
interface by analogy. Text editors were designed to be similar to typewriters;
form interfaces were lifted directly from the forms used in offices. The desktop
metaphor was the basis for the interface developed at Xerox PARe and is now
used widely on the Mac, Sun, and Microsoft Windows. However, metaphors
are like old friends that have become an embarrassment because of their bad
manners; user interface designers are quick to disassociate themselves from
Multimedia Interfaces: Designing for Diversity 93

them. Nelson [33] says, "I have never personally seen a desktop where pointing
at a lower piece of paper makes it jump to the top." He believes that the
problem with metaphors is that we wish to design things that are not like
physical objects, and the details of whose behavior may float free.

We must start building the interface from some consistent, coherent model of
how the interface is going to work. Some psychologists believe that learning by
analogy may be the only way humans learn [1]. Some applications are more
obviously associated with a metaphor than others and often the metaphor is
already imbedded in an application. If not, a metaphor is still not a bad place to
start. The problem is coming up with appropriate metaphors that don't mislead
the user [20]. We must examine carefully whether the metaphor has sufficient
structure, if it is suitable to the user's background and level of knowledge, if it
can be easily represented, and if it can it be extended if required. Places where
metaphors fail is mismatch. In order to bridge the gap between a metaphor and
a mismatch, the interface designer often resorts to a composite metaphor [11].
But even this ultimately fails and more mismatches occur. Nardi and Zarmer
[31] discovered through ethnographic studies that "users do not do their work
from full-blown 'mental models,' but instead incrementally develop external,
physical models of their problems. The external models focus cognition and
problem-solving activity."

Are metaphors a particular issue for multimedia interfaces? If metaphors are


the conceptual model for the user, then multimedia interfaces must use multi-
media metaphors. Reality is a metaphor in virtual reality. The metaphor is a
mismatch when we walk through a wall and move by flying through the air.

Clear board [24] is both a name and a metaphor for a computer-supported


cooperative work tool, where the users can see their colleagues dimly appearing
on the other side of their workspace (see Figure 1). It was designed by using
a clear board with users placed on either side. The users could write on each
side of the board with colored pens and see each other at the same time. The
metaphor of using a clear board is quickly mismatched, since users cannot write
in a mutually comprehensive way on both sides of a board. Yet the metaphor
is a good one, because the notion of a see-through writing surface remains in
the completed collaborative tool. The Perspective Wall [36] is a metaphor that
is based on a wall that organizes a large linearly ordered information space.
The stretchable wall is flat in front but bent back on two sides with documents
hanging from it. As the user selects a document, it moves to the front and is
enlarged.
94 CHAPTER 3

Figure 1 Clearboard-l in use. A user drawing on a half-mirror polarizing


projection screen with a color paint marker. This makes the whiteboard appear
dark. The collaborator's image appears as if behind the screen.
Multimedia Interfaces: Designing for Diversity 95

Related to the question of whether a metaphor should be used is the question


of attributes to a non-human thing. Designers can use multimedia interfaces
to put animated figures. voices. and faces on the display. Shneiderman [18]
puts forth the argument that artificial intelligence has been held back in a
misdirected pursuit of human-like robots, natural language recognition to sup-
port interaction. and now human-like agents that cleverly anticipate the user's
needs. Laurel [27] believes that we cheat users when our computers assume hu-
man personalities. Instead, with Guides, a type of agent, Laurel puts dramatic
characters on the screen. Agents were created to handle mundane tasks as well
as information searches and must have some understanding of our needs and
goals.

It has been shown that the greater the extent to which a computer possesses
characteristics that are associated with humans, the greater the extent to which
individuals will use rules derived from human interaction when interacting with
computers [32]. Do we want human faces staring out at us in our interface? Do
we want our computers to ask us how we are feeling today? Like the oversim-
plified tunes of singing commercials used in radio years ago, an oversimplified
character in our interface will annoy. This is a sign of poor design, not of
the evils of anthropomorphism. Brennan [18] concludes that we should stop
worrying about anthropomorphism and work on making systems coherent and
usable. Literature. films. and theater all create characters that draw us into
their virtual worlds. The interactivity of computers is just bringing a new
dimension into the simulation of reality and perhaps we find this threatening.

3.3 Form and Continuous Media


Some media are time-varying, unlike graphical point-and-click interfaces with
discrete objects being displayed. Video is a highly dynamic medium in which
material moves in a constantly changing stream with few, if any, logical sepa-
rations or chunks. Music and speech also extend through time in a continu'ous
linear sequence. Animation, as video, has few logical divisions. Gesture, one of
the newest forms of multimedia communication that is being exploited. depends
on continuous motions in time. Once a time-varying event has happened, the
user must either remember the event or replay it to see it again. To create a suc-
cessful multimedia interface for time-varying media, it is important to be able
to identify specific events of interest and provide high level abstractions that
capture an idea. and determine how we interact or navigate through it. Even
navigating through discrete media is still a problem. Certain forms are easier
to grasp that others. Hierarchical structures, networks (nodes and arrows), and
96 CHAPTER 3

arrays form the underlying structures of many collections of information that


require organization and navigation through them, such as continuous media.
Often both hierarchies and networks are used together to form a navigational
structure. Simplicity, modularity and encapsulation are dominant themes in
design of all types. In multimedia interfaces, much of design for interaction is
based on human expressive forms, such as natural language or drawing [7].

To examine how people interact verbally we have to examine conversation. Hy-


per Phone [30] is an environment for voice documents, that is, a database with
a voice interface. We have all had the experience of asking someone to repeat
what was just said, only to discover the repetition was as incomprehensible
as the original statement. Repetition is not sufficient for comprehension - the
speaker must rephrase the sentence. Researchers discovered that the user must
have control not only of what is said, but when and how it is said. For success-
ful verbal communication people interrupt, query, summarize, rephrase, repeat,
and move backwards and forwards through time. The implications of this are
that utterances must be broken into chunks or pieces that form the basic units
of the conversation. Collections of chunks are summarized, rephrased, etc., in
a hierarchical structure. The user manipulates the navigation process without
the use of names with phrases such as, "Show me more about this."

Rudnicky and Hauptmann [37] have identified six properties of interaction


that support the conversational model of speech interaction: 1) user plasticity
(User's adapt their speaking style to the perceived characteristics of the system
they are dealing with.); 2) appropriate interaction protocols (Communication
between participants is regulated by agreed upon procedures for taking turns,
making corrections, and confirmations.); 3) error correction; 4) system response,
which must not cause too many delays; and 5) task structure, which the sys-
tem can recognize. They also observed that human interaction takes advantage
of different modalities and uses visual, prosodic, and gestural information in
speech understanding. Hence conversation uses multimodality.

Anyone who has tried to search videotapes for particular events is familiar with
the difficulty of identifying and searching for shots or cuts, that is, one or more
contiguous frames representing a continuous action in time and space. Now
magnify this problem to searching archives that have thousands of videos or
films. Some machine-readable annotation of video has been attempted, but
only the SMPTE time code accompanies motion picture images as generated.
Logs are sometimes created after the video or film has been completed. The
problem is to create video annotations that allow the user to search for and to
retrieve items of interest. At this point, the philosophy of how to attack the
problem diverges greatly. Some researchers devise systems where annotation
Multimedia Interfaces: Designing for Diversity 97

can be done automatically in machine readable form. Of course, in this case, the
descriptions are limited to changes that can be detected by a machine, such as,
time, space. color, and shape. The IMPACT project [45] creates descriptions of
cut separations. motion of cameras and filmed objects. tracks and contour lines
of objects. existence of objects. and periods of existence recorded automatically.
The basic structure of video information in the IMPACT system is hierarchical,
with scenarios on top. scenes in the middle. and cuts on the bottom. The cuts,
scenes and scenarios are linked by hyperlinks. Teodosio and Bender [43] use
a method of salient stills to extract information that is representative of a
sequence of video frames. Salient still images do not represent one moment in
time. as do photographs or a single video frame, but rather an aggregate of the
temporal changes that occur in a moving image sequence where salient features
are preserved. The resulting image is a composite of images and may not look
like anyone particular frame. It is a complicated technique based on optical
flow and signal processing, however, salient stills can be created without user
intervention.

An entirely different approach to the complexity of video is taken by Davis [17],


who has created a language, Media Streams, which enables users to annotate,
browse. retrieve, and repurpose digital video. In Media Streams, annotators
use an iconic visual language to create stream-based annotations of video con-
tent. Media Streams utilizes a hierarchically structured semantic space of iconic
primitives. which are combined to form compound descriptors used to create
multi-layered, temporally indexed annotations of video content. The annota-
tion language is designed to support the representation of what one sees and
hears. rather than what one infers from context. Media Streams' descriptive
categories include: actual recorded space. inferable space. actual recorded time,
inferable time. weather. characters. objects, character actions. object actions.
relative positions of characters and objects. screen positions of characters and
objects, recording medium. camera motions. shot transitions. and subjective
thoughts about the material.

In Media Streams, the annotations using these categories do not describe video
clips, but are themselves temporal extents describing content within a video
stream. As stream-based annotations they support multiple layers of overlap-
ping descriptions which, unlike clip-based annotations, enable video to be dy-
namically resegmented at query time. Video clips change from being fixed
segmentations of the video stream, to being the results of retrieval queries into
a semantic network of stream-based annotations. Unlike keyword-based video
retrieval systems, Media Streams supports query of annotated video according
to its temporal, semantic. and relational structure. A query for a video sequence
will not only search the annotated video streams for a matching sequence but
98 CHAPTER 3

will compose a video sequence out of various annotated segments in order to


satisfy the query.

Media Streams attempts to address two fundamental user interface issues in


video annotation and retrieval: creating and searching the space of descriptors
to be used in annotation and retrieval; and visualizing, annotating, browsing,
and retrieving video shots and sequences. Consequently, the system has two
main interface components: the Icon Space and the Media Time Line. The
Icon Space is the interface for the selection, compounding, and grouping of
the iconic descriptors in Media Streams (see Figure 2). In the Icon Work-
shop portion of the Icon Space (the upper half) these iconic primitives can
be compounded to form compound iconic descriptors. Through compounding,
the base set of primitives can produce millions of unique expressions. In the
Icon Palette portion of the Icon Space (the lower half), users create palettes
of iconic descriptors for use in annotation and search. By querying the space
of descriptors, users dynamically group related iconic descriptors and thus can
reuse the descriptive effort of others. The Media Time Line is the core browser
and viewer in Media Streams. It enables users to visualize video at multiple
timescales simultaneously, to read and write multi-layered iconic annotations,
and provides one consistent interface for annotating, browsing, retrieving, and
repurposing video and audio content (see Figure 3).

3.4 The Experience of Sound


This section is dedicated to sound. How can we use it'? Why hasn't sound
been used in the past? Does sound convey the same information as text and
graphics?

The ability to hear sound is one of our basic senses. In spite of this, sound has
been slow to progress as an integral part of the human-computer interface.

There are many historical reasons why this is the case: the letters of the al-
phabet when typed into a keyboard were easily interpreted into binary form
for textual displays upon a screen or printed page. Voice input has many dif-
ficulties when used as input medium and corresponds more to the difficulties
of using handwritten input, where errors occur because input is not precise. In
the past, nonspeech audio has been associated with music and has not been
used for conveying information, with certain exceptions such as bugle calls, fog
horns, talking drums, etc., which were not universally known and limited in
scope [5].
Multimedia Interfaces: Designing for Diversity 99

Figure 2 The Icon Space for Media Streams is the interface for the selection,
companding, and grouping of iconic descriptors.
100 CHAPTER 3

Figure 3 The Media Time Line is the core browser and viewer in Media
Streams.
Multimedia Interfaces: Designing for Diversity 101

Audio will be an important part of the multimedia interface in the future.


Some of the reasons are: speech is faster than writing or typing for all but the
most skilled typist; audio messages can be relayed quickly to the user without
interfering with the screen display; audio interfaces take very little room and
can be used to impart information with notepad-size computers without the
need of a large display; certain types of disabilites can use audio to replace
video display; and finally, continuous information can be effectively displayed
in audio, and when messages need to be brought to the user's attention when
he or she is not looking at the screen. Many other reasons for using audio can
be found.

Audio has dimensions or parameters, just as graphical material has. In non-


speech audio these parameters are manipulated to provide the symbols or syn-
tax of messages. The dimensions of sound are [14]:

• harmonic content
- pitch and register (tone, melody, harmony)
- waveshape (sawtooth, square, ... )
- timbre. filters. vibrato. and equalization
- intensity/volume/loudness
- envelope (attack, decay, sustain, release)

• timing
- duration, tempo. repetition rate, duty cycle. rhythm, syncopation

• spatial location
- direction (azimuth, elevation)
- distance/range
• ambience: presence, resonance. reverberance. spaciousness
• representationalism: literal, abstract, mixed.

Sampled sounds are digital recordings of sounds which we can hear. These
sounds have the advantage of immediate recognizability and ease of implemen-
tation into computer interfaces. Synthesized sounds are those sounds which are
created algorithmically on a computer. They can be made to sound similar to
102 CHAPTER 3

real-world sounds through sound analysis (such as Fourier analysis) and trial-
and-error methods. Since synthesized sounds are created algorithmically, it is
easy to modify such a sound in real time by altering attributes like amplitude
(volume), frequency (pitch), or the basic waveform function (timbre). Further-
more, it is easy to add modulation of amplitude or frequency in real time to
create the effects of vibrato or tremelo without changing the basic sound. It
is for these reasons that sound synthesis is so popular in music creation today.
Synthesized sounds offer a high degree of flexibility with a reasonable amount
of ease. A drawback of synthesized sound is that each algorithm used typi-
cally mimics some sounds very well and others not as well. For instance, bell
sounds can be synthesized very well using a ring modulation algorithm but that
same algorithm cannot produce the waveform necessary to make the sound of
a French horn.

Because sampled sounds are digital recordings, any sound that can be heard
can be produced with extremely high accuracy. However, the amount of work
required to attain equal flexibility in modification, compared with synthesized
sounds, is very high. Typically, sampled sounds are modified only in amplitude
(volume) and frequency (pitch). When sampled sounds are altered in frequency,
care must be taken only to modify the sound within a certain range, because
the sound usually loses its unique acoustic characteristics proportional to the
amount of deviation from the originally sampled sound. Other modifications to
the sound, such as modulation. are typically not done because they require too
much computation and they also produce sounds which no longer are identified
with the original source.

Three-dimensional (localized) sound truly immerses the listener in his or her


auditory environment. The basis of the work in three-dimensional acoustic dis-
plays is psychoacoustics. The virtual acoustic environment is part of the NASA
Ames View system [47]. The technology of simulating three-dimensional sound
depends on reconstructing the sounds as they enter the ears. The acoustic sig-
nals are affected by the pinnae (outer ear) and the distance and direction of the
ears. Microphones were placed in the ears of humans or manikins to measure
this effect, called the head-related transfer function. A real-time system, the
Convolvotron, is used to filter incoming sounds using a head-related transfer
function [47].
Multimedia Interfaces: Designing for Diversity 103

3.5 Types of sounds


The types of sounds that can be made are so different, they are considered
different media. The interface designer must understand how to use many
different types of audio and be familiar with the strengths and weaknesses of
each.

Music
A powerful use of music is found in film scores. Music comes to bear in helping
to realize the meaning of the film. in stimulating and guiding the emotional
response to the visuals. Music serves as a kind of cohesive, filling in empty
spaces in the action or dialogue. The color and tone of music can give a picture
more richness and vitality and pinpoint emotions and actions [44] It is the
ability of music to influence an audience subconsciously that makes it truly
valuable to the cinema. Finally, audio can reflect the sounds of the scene in
which the picture is placed. Music specific to particular cultures is used in the
study of history, geography, and anthropology. A scene placed in a geographical
context may be enhanced by local music. Care must be taken when music is
used in programs that are used frequently, because music can be annoying if
the same piece is heard repetitively.

Speech
Speech is required for detailed and specific information. It is through speech
(rather than through other sounds) that we communicate precise and abstract
ideas. Speech may be used as input as well as output in the computer interface.
Recent advances in speech recognition systems have made it possible to use
a natural speech style and to allow casual users to work easily with a speech
system [37]. In spite of this, very little is known about building successful speech
interfaces for two- dimensional displays, let alone three-dimensional interfaces.

Real world sounds


Real world sounds are the natural sounds of the world around us, such as leaves
rustling or birds singing, or man-made sounds such as machine noises or even
a band playing in the background. What about sound in our everyday lives?
Real-world sounds are essential to our sense of presence in a scene that depicts
our world around us. R. Murray Schafer [38] reconstructed "soundscapes,"
historical reconstructions of the sounds that surrounds people in various en-
104 CHAPTER 3

vironments. Examples are street criers, automobiles, the crackling of candles,


church bells, etc.

A uditory displays
Auditory displays include the interpretation of data into sound, such as the
association of tones with charts, graphs, and algorithms, or sound in scientific
visualization. These auditory display techniques were used to enable the lis-
tener to picture in his or her mind real-world objects or data. An example of
auditory display is SoundGraphs [28], in which points on an x-y graph were
translated into sonic equivalents with pitch as the x-axis and time on the y-axis.
(A nonlinear correction factor was used.) Recently Blattner, Greenberg, and
Kamegai [3] enhanced the turbulence of fluids with sound, where audio was
tied to the various aspects of fluid flow and vortices.

Some of the most interesting examples of auditory display are described by


Kramer [26] as the pioneering efforts in this field. Only a few of these early inno-
vative experiments can be mentioned in this section. Speeth reported the results
of experiments that used audification of seismic data to determine if subjects
could differentiate earthquakes from underground bomb blasts. Bly explored
the classification of non-ordered multivariate data sets. Each n-dimensional
data point was represented by a discrete auditory event with n sound param-
eters controlled by data. Yeung developed a sonification technique for using
audible displays in analytical chemistry. Using two pitch ranges and other pa-
rameters such as loudness. decay time, duration, and stereo location, he found
that his subjects could classify detected levels of metal in a given sample with
98techniques for the display of multivariate data intended for oil well log data
using both auditory and visual displays. More recently, Smith encoded data
on scatterplots as icons, which visually they have a texture. The icons emit
characteristic sounds when triggered by a moving mouse. Kramer used audio to
enhance Magellan's view of Venus, where areas had auditory output to indicate
such pieces of information as the chemical make-up of the area. The auditory
output did not disrupt the view of the underlying landscape.

Cues and audio messages


Cues and audio messages tend to convey more abstract information than that
received through auditory displays, which are data encoded into sound. Exam-
ples of messages are: the computer is going down; you have an appointment
Multimedia Interfaces: Designing for Diversity 105

at 2 pm; and, a syntax error has occurred. Auditory signals are detected more
quickly than visual signals and produce an alerting or orienting effect [47].

Nonspeech signals are used in warning systems and aircraft cockpits. Alarms
and sirens fall into this category, but these have been used throughout history,
long before the advent of electricity. Examples are military bugle calls, post-
horns, and church bells that pealed out time and announcements of important
events. Work on auditory icons was done by Gaver [22] and earcons by Blattner,
Sumikawa, and Greenberg [2]. Auditory icons use sampled real-world sounds
of objects hitting, breaking. and tearing for messages. Gaver used the term
"everyday listening" to explain our familiarity with the sounds of common
objects around us. Because most messages are abstract. the auditory icons use
analogy or simple association to map to their meaning. Earcons are musical
fragments. called motives. with varying their musical parameters to obtain a
variety of related sounds. Earcons are described in greater detail in Section
4.3.

Virtual reality, telepresence and teleconferencing


Spatial sound is critical to applications such as virtual reality, telepresence and
teleconferencing. The types of sounds described above, speech, music, real-
world sounds. and audio cues can all be located in a three-dimensional audio
environment. Sound localization by NASA has shown the effectiveness of sep-
arating voices in space to improve their clarity [47]. The three-dimensional
acoustic properties of teleconferencing systems can be used to filter out extra-
neous sounds by the use of "audio windows." [14] The general idea is to permit
multiple simultaneous audio sources, such as in a teleconference, to coexist in
a user-controlled display to easily move though the display and separate the
channels while retaining the clarity and purity of the sounds.

3.6 The Structure of Earcons


Earcons are short, distinctive audio patterns to which arbitrary definitions are
assigned. They can be modified in various ways to assume different but related
meanings. The building blocks for earcons are short sequences of tones called
motives. From motives we can build larger units by varying musical param-
eters. The advantage of these constructions is that the musical parameters
of rhythm, pitch, timbre, dynamics (loudness), and register can be easily ma-
nipulated. The motives can be combined, transformed, or inherited to form
more complex structures. The motives and their compounded forms are called
106 CHAPTER 3

earcons, however, ear cons can be any auditory message, such as real-world
sounds. single notes. or sampled sounds of musical instruments.

A motive may be an earcon or it may be part of a compounded earcon. If A and


Bare earcons that represent different messages. A and B. can be combined by
juxtaposing A and B to form a third earcon AB. that is, A and B are both heard
in sequence. Earcon A may be transformed into earcon B by a modification
in the construction of A. For example, if A is an earcon. a new earcon can be
formed by changing some parameter in A to obtain B, such as the pitch in
one of its notes. A family of earcons may have an inherited structure, where a
family motive, A, is an unpitched rhythm of not more than five notes is used to
define a family of messages. The family motive is elaborated by the addition of
a musical parameter, such as pitch (A+p = B) and then preceded by the family
motive to form a new earcon. AB (A and B played in sequence). Hence the
earcon has two distinct components, an unpitched motive followed by a pitched
motive with the same rhythm. A third earcon, ABC. can be constructed by
adding a third motive, C. with both the pitch and rhythm of the second motive,
but now has an easily recognizable timbre (A + p + t = B + t = C).

To display more than one earcon. the temporal location of each ear con with
respect to each other have to be identified. Two primary methods are used:
overlaying and sequencing of ear cons [4]. Some sort of merging or melding
into new sound could be considered. For example, the pitch of two notes
can be combined into a third pitch. Programs typically play audio without
regard to the overall auditory system state. As a result, voices may be played
simultaneously or they may occur with several nonspeech messages, making the
auditory display incoherent. An audio server is being constructed that blends
the sounds of voice, earcons. music. and real-world sounds in a way that will
make each auditory output intelligible [35].

3.7 The Advantages of Earcons


Will sounds as abstract as earcons be accepted by the majority of users? The
advantages are very clear: they are easily constructed on almost any type of
workstation or personal computer. The sounds do not have to correspond to the
objects they represent. so objects that either make no sound or an unpleasant
sound still can be represented by earcons without further justification. Auditory
display, that is, data translated directly into sound. requires less explanation
or motivation than abstractions such as earcons. Auditory icons that make
real-world sounds usually can be recognized quickly. However. most messages
Multimedia Interfaces: Designing for Diversity 107

do not have appropriate iconic images and the association of the auditory icon
to its message must be learned. Several experiments have shown that earcons
are preferred over many other types of sonification and can be used successfully.
Brewster, Wright. and Edwards (1993) found earcons to be an effective form
of auditory communication. They recommended these six basic changes in
earcon form to make them more easily recognizable by users: 1) Use synthesized
musical timbres: 2) Pitch changes are most effective when used with rhythm
changes; 3) Changes in register should be several octaves; 4) Rhythm changes
must be as different as possible; 5) Intensity levels must be kept close; and 6)
Successive earcons should have gaps between them.

Earcons are necessarily short because they must be learned and understood
quickly. Earcons were designed to take advantage of chunking mechanisms and
hierarchical structures that favor retention in human memory. Furthermore,
they use recognition rather than recall. The tests run by Brewster, Wright,
and Edwards had no training period associated with them. Tested subjects
heard earcons only once before the test. In spite of this, the subjects could use
them effectively. If earcons are to be used by the majority of computer users,
they must be learned and understood as quickly as possible, taking advantage
of all techniques that may help the user recognize them.

4 APPLICATIONS AND NEW


TECHNOLOGIES
This section is for those who still envision the interface as something that
appears on a screen and is manipulated by pointing devices. None of the ap-
plications below have screens as their primary input/output devices. These
applications may use visual displays, but equally important are body position-
ing and gesture for input or output in a variety of different ways.

4.1 Virtual Reality


Virtual reality (VR) is total immersion into and interaction with a simulated
environment. There is a stereoscopic graphics display before the user's eyes and
audio is fed into the user's ears. Positioning devices are attached to the user's
head and hands and when the user moves, the visual and auditory scene before
him or her moves accordingly. There is the related concept of telepresence or
teleoperated environments. where a camera. usually attached to a robot. and
108 CHAPTER 3

other sensors, such as microphones, interact at a remote site to give the user
the experience of being in another location. Several technologies are related to
VR and telepresence: see-through environments, where the view of the scene is
overlaid with a drawing or other graphics. Desktop VR is displayed on a desktop
and gives the impression of looking at a a virtual scene through a window. VR
is not new. The original ideas go back to 1965 or earlier, but only recently
has technology developed to a point where these ideas can be implemented for
practical use [42]. VR is the ultimate multimedia interface and that is why
it is included in this overview, however, because of its distinctive hardware
and interaction styles. VR is often put in a category of its own. apart from
multimedia.

It is difficult to discuss the potential of an interface which is known primarily in


the world of entertainment. I include it in this chapter because I believe it is a
critical new technology that will flourish and develop into our primary interface
for scientific visualization. Progress in the development of virtual reality and
telepresence interfaces for scientific applications will be slow for a number of
reasons: 1) The potential of this new type of interface is not yet appreciated as
a research area by much of the academic community; 2) VR systems still lack
the rapid interactivity and high resolution required to make them usable for
many applications; 3) VR is complex and many interacting component parts
are required. each of which is a research area in itself; and, consequently, 4)
The cost of research and equipment in this area is very high.

Some of the most interesting work in VR is being conducted in Japan, where


VR research is being conducted both at universities and in industry [46] Two
applications are both briefly described here within the field of medicine. One
uses teleoperation and the other uses a synthetic environment.

A teleoperated microsurgical robot is being used for eye surgery in the MSR-1
system [23]. Bi-direction information is relayed between the master unit and
the slave concerning visual, auditory, and mechanical information. Images are
relayed to a surgeon wearing a helmet and to an adjacent screen. There is also a
stereo tone with an amplitude and/or frequency that is a function of the forces
experienced at the tool-tissue interface. The surgeon holds pseudotools, shaft-
shaped like microsurgical scalpels, that project from the left and right limbs of
a mechanical master. The pseudotools control the behavior of the microsurgical
tools that perform the actual surgery. The movements ofthe left and right limbs
of the pseudotools are scaled down by 1 to 100 times in the microsurgical tool,
which is exerted by a microsurgical robot that performs the surgery. The master
and slave subsystems communicate through a computer system that enhances
and augments images. filters hand tremor. performs coordinate transformation
Multimedia Interfaces: Designing fo'r Diversity 109

of the surgeons hand motions. and makes safety checks. The limbs of the
master and robot slave are in one-to-one correspondence. The operator feels
not only the magnitude. but the direction of the micromotion and the resistance
experienced by the slave during operation. The surgeon is able to feel through
the pseudotools the forces experienced by the microtools. Students can train
using a mannequin (see Figure 4).

The outstanding advantages of such a teleoperated environment is the control


of scale. hand tremor. enhancement of the image, and safety precautions to
preclude a tool slipping or other inadvertent hand or tool movements. There is
another outstanding advantage of a telepresence interface: surgery can be done
at a distance. Specialists in rare surgical procedures will be able to perform
operations anywhere in the world where the equipment can be placed. The
ability to train physicians in a simulated environment is a dream held by every
medical school. Anatomy can be studied in a variety of ways, from exploring
a magnified bloodstream by traveling through it, to simulating surgical pro-
cedures. This technology depends on the recreation of human bodies through
graphics and force feed-back. Animation is far from the point where human
bodies can be realistically rendered, particularly when it must be done in real-
time. Surgical procedures require not only the image of the human body, but
the feel as well. Virtual organs must be programmed to behave in ways that
real organs behave and the organs must be programmed to interact with each
other and the surgical instruments. Even if the completely realistic VR surgical
process is many years away from realization, we can still manage a fairly good
approximation [29]. Many techniques can be taught by practice on simulated
bodies. Another advantage would be to spare the lives of the myriads of lit-
tle animals subjected to dissection in high school biology classes all over the
country.

4.2 Distance Learning: Access to The World


Education will be one of the primary beneficiaries of new computer and com-
munication technologies. The cost was minimal when the educational use of the
computer was first realized and computer companies donated personal comput-
ers to schools. In those early days. the primary obstacle to the use of computers
in schools was integration into the curriculum and teacher education. Presently,
computer use is well-integrated into school curriculums. Multimedia projects to
assemble educational materials using sophisticated authoring systems are found
everywhere. However. there is another technology that is rapidly becoming the
focus of computer education in the United States: distance learning.
110 CHAPTER 3

Figure 4 The microsurgical robot. MSR-l and the associated virtual envi-
rorunent for eye surgery.
Multimedia Interfaces: Designing for Diversity 111

Unlike the personal computers that were donated to schools or made available
at low prices. distance learning is pricey indeed. While the use of computers
in the classroom required teachers to learn computer skills. distance learning
does not necessitate a great deal of technical training on the part of those
using the systems. On the university level. distance learning can provide spe-
cialized courses and resources unavailable at a single educational institution.
For those of us actively involved with bringing distance learning to full-time
students in higher education. the obstacles are fierce: faculty apathy, academic
rules hindering the creation of intercampus programs. coordination difficulties,
and most of all. the enormous expense of obtaining equipment for two-way
video transmission. that is. teleconferencing equipment. But this situation will
change rapidly as communication costs drop and new high-speed networks are
installed. The benefits of this new technology will be magnified even more
as we move distance education into high schools, junior high, and elementary
schools. The one-room schoolhouse can become the one-world schoolhouse.
With a teleconferencing center instead of traditional schoolrooms. classes can
be taught statewide. and materials can be drawn from all over the world. The
local teacher will become a coordinator and a tutor, selecting materials. eval-
uating student progress, and assigning classes for students. The first basic
requirement for all of this is high-speed at reasonable cost.

The second is the two-way video and high-resolution teleconferencing equip-


ment that allows completely natural interaction between sites. Textbooks will
disappear and all material will be available on computers. Students will take
their disks horne in the evening to use them on their computers to do their
homework.

4.3 Assistive Technologies


In a survey made in 1989 by thf> National Center for Health Statistics, more
than 34 million people identified themselves as having "a degree of activity
limitation" because of injury or chronic illness. This does not include people
who are institutionalized or mentally ill. There are generally three types of
physical impairments: hearing, vision. and motor. Within each of these there
is a large number of variations. For example. in the vision impaired class. a user
may have low vision or be completely blind. If vision is impaired, but the person
is not blind. type may have to be larger, the user may not be able to see the
cursor, blind spots may occur, the image may be warped, and so on. Another
reading difficulty is dislexia, where there is nothing wrong with the person's
vision as such. but there are problems in reading text that corne from symbol
112 CHAPTER 3

formation. As many as 30 percent of young male engineering students may


have some reading or spelling difficulties of this type. For those who are blind
as well as deaf, the usual solution of audio output for the blind doesn't work.
Technology will be able to assist the elderly in a variety of ways when they
experience impairments at an advanced age. The interface designer concerned
with computers for those with disabilities may have to work with hundreds
of different kinds of disabilities. Problems that arise in providing computers
for the disabled largely stem from: 1) lack of funds to support this type of
research; 2) the large number of disabilities and their disparate solutions; 3)
methods of making new technology available to disabled people; and 4) bringing
these products to market. Yet it is generally agreed that there will be greater
concern and increased funding for those with disabilities. The advances made
in computer interfaces for those with disabilities also have provided new ideas
for the more general market.

Assistive computer products are virtually all multimedia. There is a great di-
versity among input/output devices used for these purposes. Much of the work
for assistive technologies comes from university and government research lab-
oratories and uses new research in multimedia, artificial intelligence, wireless
communications. and of course, psychology, physiology, simulation and mod-
eling of physical systems. In the remainder of this section, a few examples of
some highly ingenious interfaces for the disabled are described. The emphasis
here is on novelty to show the diverse problems associated with providing tech-
nical assistance to the disabled. One of the products below has touch output,
while the other responds to weight as input and emits sounds as output.

Certainly one of the most interesting devices is a computer-controlled mechan-


ical hand called Ralph (Robotic ALPHabet) that facilitates communication
with people who are deaf/blind. Ralph was designed and developed at the
Rehabilitation Research and Development Center at the VA Medical Center
in Palo Alto, CA [25]. Many of the estimated 17,000 deaf/blind individuals
in the U.S. communicate using fingerspelling, a system in which each letter of
the alphabet has a representation made with a gesture of one hand. To -com-
municate, an interpreter fingerspells the letters of the message into the hand
of the person who is deaf/blind. The recipient deciphers the movements of
the interpreter's hand back into the letters of the message. Ralph translates
any serial ASCII stream of information into fingerspelling through movements
of its mechanical fingers (see Figure 5). Examples of devices that Ralph can
use for input include a keyboard (for person-to-person conversations), a mo-
dem (for telephone interactions), an OCR scanner (for conversion from printed
text), a voice recognition system (for translation from speech), a stenography
machine (for participation in conference, classroom or courtroom situations),
Multimedia Interfaces: Designing for' Diversity 113

and closed-captioned television (for entertainment and news). Ralph can give
a person who is deaf/blind more options for receiving information, enhance the
privacy of their conversations, and remove the total dependence on an inter-
preter for communication.

Another unusual product is the Baby-Babble-Blanket (BBB) for infants with


motor problems. [21] Infants with severe motor problems frequently become
children with limited or non-existent speech. The BBB was developed to allow
young infants and the most severely disabled children have some control of their
environment and to communicate in a lying position. Another goal was to im-
prove the motor abilities of these infants. An interdisciplinary team consisting
of a speech-language pathologist. a computer scientist, and a physical therapist
developed BBB. The BBB (patent no. 5260869) is a multiple microswitch-
activated pad serving as input to a Macintosh computer and software. The
software provides a customized sound selection of digitized speech, babbles,
words, sentences, music or environmental sounds accessed by switches. The
mother's voice is also included with some of the customized sounds. A data
collection system is incorporated to evaluate infant responses and prints out
summary statistics.

The BBB is an input device that is a plastic. waterproof blanket a baby lies on.
The BBB has imbedded in it 12 switches regularly spaced over a 3 foot by 4
foot grid. The blanket has two layers of polymer with urethane sponge material
between the layers. The switches are connected to a cable that is hooked up to
a Macintosh. As the baby moves, the BBB makes various sounds in response to
the movements (see Figure 6). Normal infants gradually develop the concept of
cause and effect through exploration of their environment. Physically disabled
infants are limited in their ability to explore their environments and to vocalize
and are slow to deduce the nature of interactions with objects. The BBB
enables motor-impaired infants to experiment with their environments and to
develop cause-effect skills found in normal infants.

4.4 Gesture and Speech Technology


For many thousands of years, people used only speech and gesture to commu-
nicate with each other. After the relatively recent introduction of writing, pens
(quills), pencils (chalk), and brushes were the instruments used to communicate
written words because of the incredible precision and deftness they afford [12].
Punched cards and, later, keyboards with a cathode ray tube andisplay were
used as input devices to computers because voice recognition and other more
114 CHAPTER 3

Figure 5 RALPH , the fingerspelling hand used to conununicate with deaf-


blind individuals who feel and interpret the motion and positions of the hand.
Multimedia Interfaces: Designing for' Diversity 115

Figure 6 The Baby-Babble-Blanket (BBB) for infants with severe motor


problems. The system allows even the youngest infant or most-severely disabled
child to control the environment and communicate while in a lying position.
116 CHAPTER 3

natural input devices were technically beyond our ability to produce. (There
is a subtle difference between voice and speech: voice is the ability to vocalize,
which enables us to speak; speech is a system of sounds that we interpret for
the purpose of communicaiton.) Times have changed and a wide variety of in-
put devices are now available that take advantage of human senses. Certainly
the success of the mouse was partly responsible for an examination of the use
of gesture in the computer interface. Mouse input is difficult to control, and
nearly impossible to use for handwriting. The pen is considered a natural and
ergonomic computer input device; it is small, unobtrusive, flexible, and can
be manipulated easily by casual users. However, current pen interfaces are
awkward and difficult to use. Oviatt [34] makes the case that pen alone may
never be the universal interface that replaces the keyboard and mouse as the
primary input device. Handwriting is slower than typing, and recognition of
all pen-written symbols is error-prone and ambiguous [13].

Many believe that speech interfaces will be the primary means of communicat-
ing with a computer. However, speech understanding has recognition problems
that make speech a difficult modality for input. Also. speech is not an effective
technology for the input and manipulation of graphical objects. When speech
and gesture are combined through the use of pen and voice, something surpris-
ing happens-they complement each other and overcome the problems experi-
enced by each modality. Oviatt [34] recommends that the following strategy
be adopted in connection with pen and voice interfaces: "Combine naturally
complementary modalities in a manner that optimizes the individual strengths
of each, while simultaneously overcoming each of their weakness."

The integration of the two modalities of speech and gesture have been studied in
the context of collaborative work, where people are communicating among each
other with the assistance of a computer. The integration of speech and gesture
when the problem is limited to a person communicating with a computer is not
understood. Issues in speech input are tied to those of natural language under-
standing. which is plagued with problems such as resolving anaphoric reference,
word sense disambiguation. the attachment of prepositional phrases. and the
lack of ability to reason [13]. If the language used as input is constrained to
a non-natural language. the user may have to go through a learning proce-
dure similar to learning command lines in a text-based interface. Nevertheless,
speaking to a computer has many advantages: speech is faster even than type-
writer input except for the most proficient typist; speech can be precise; and,
speech is spontaneous.

PenPoint [12], an operating system for pen-based computers, has a gesture-


based interface. PenPoint uses the pen for pointing, data entry, and gesture
Multimedia Interfaces: Designing for Diversity 117

commands. The designers of PenPoint did not recommend the use of voice
input, except as an adjunct tool to PenPoint. Some very early work that com-
bined natural language and pointing in an intelligent interface combined natural
language and pointing using a touch screen to specify graphics and inputs from
touch. A well-known interface called "Put-That-There," manipulated the dis-
play with voice and gesture input. [9] The whole question of manipulating
displays with voice and gesture is being re-examined 15 years after this seminal
article was published. There are some preliminary investigations into a unified
view of language that communicates through verbal, visual, tactile, or gestural
information. Written and spoken natural language is often accompanied by
pictures, gestures, and sounds other than spoken words. Blattner and Milota
[6] are investigating a language that integrates pen and speech input. To make
the system truly useful. it is usable with only pen input, only voice input, or
the combination of both. Pen and voice input may be the primary way we com-
municate with computers in the future, but the issue~ are so poorly understood
that it will be many years before usable pen-voice systems are developed.

5 CONCLUSION
Pandora was left with a box she was instructed not to open. But her curiosity
overcame her good sense and she finally opened the box. Malevolant forces
emerged from the box to plague the world and cause mischief-only hope was
left in the box.

A computer, like a box, is a device for storing information, when one examines
it, a thousand demonic problems come flying out. In this chapter some of the
difficult problems of how to construct interfaces and how we might cope with
them were considered. Whether they will turn malevolent and plague us, or
bring solutions that make our lives easier, happier, and richer remains to be
seen. Hope remains with us.

Acknowledgements
This work was performed with partial support of NSF Grant IRI-9213823 and
under the auspices of the U.S. Department of Energy by Lawrence Livermore
National Laboratory under Contract No. W-7405-Eng-48.
118 CHAPTER 3

Meera Blattner is also with The Department of Biomathematics, M.D. Ander-


son Cancer Research Hospital, University of Texas Medical Center, Houston,
Texas 77030.

REFERENCES
[1] John R. Anderson, "Cognitive Psychology and Its Implications." Second
Edition, New York: W. H. Freeman and Company, 1985.
[2] M.M. Blattner, D.A. Sumikawa, and R.M. Greenberg, "Ear cons and Icons:
Their Structure and Common Design Principles," Human-Computer In-
teraction, Vol. 4, No.1, 1989, pp 11-44.
[3] Meera M. Blattner, Robert M. Greenberg, and Minao Kamegai, "Listen-
ing to Turbulence: An Example of Scientific Audiolization," Multimedia
Interface Design, (M. Blattner and R. Dannenberg, eds), ACM Press, New
York and Addison-Wesley, Reading, Massachusetts. 1992, pp 87-102.
[4] Meera M. Blattner, Albert L. Papp III, and Ephraim P. Glinert. "Sonic
Enhancement of Two-Dimensional Graphic Displays," Auditory Displays:
The Proceeding of the First International Conference on Auditory Display
(G. Kramer, eds), Addison-Wesley, Santa Fe Institute Series, Reading,
Massachusetts, 1994.
[5] M. M. Blattner and R. M. Greenberg, "Communicating and Learning
Through Non-Speech Audio. Multimedia Interface Design in Education,"
A. Edwards and S. Holland (Eds), Springer-Verlag, NATO ASI Series F.
1992. pp 133-143.
[6] Meera M. Blattner and Andre D. Milota, "Multimodal Interfaces using
Pen and Voice Input," submitted for publication,
[7] Meera M. Blattner, "In Our Image: Interface Design in the 1990s," IEEE
Multimedia, Vol. 1, No.1, IEEE Press, 1994, pp 25-36.
[8] Susanne Bodker, "A Human Activity Approach to User Interfaces,"
Human-Computer Interaction, Vol. 4, Lawrence Erlbaum, pp 171-195.
[9] R. A. Bolt, "Put-That-There: Voice and Gesture at the Graphics Inter-
face." ACM Computer Graphics. 14(3), 1980. pp 262-270.
[10] John Seely Brown and Paul Dugid, "Borderline Issues: Social and Material
Aspects of Design." Human-Computer Interaction, Vol. 9, No. 1, Lawrence
Earlbaum, pp 3-36.
Multimedia Interfaces: Designing for Diversity 119

[11] John M. Carroll. Robert L. Mack. and Wendy A. Kellogg, "Interface


Metaphors and User Interface Design." Handbook of Human-Computer In-
teraction (M. Helander. ed), Elsevier Science Publishers, (North-Holland),
Amsterdam, The Netherlands. 1988, pp. 67-85.
[12] Robert Carr and Dan Shafer, "The Power of PenPoint." Addison-Wesley,
Reading, MA, 1991.
[13] Philip R. Cohen, "The Role of Natural Language in a Multimodal In-
terface," Proceedings of the ACM Conf. on User Interface Systems and
Technology (UIST), Nov. 15-18, 1992, pp 143-149.

[14] Michael Cohen and Elizabeth M. Wenzel, "The Design of Multidimensional


Sound Interfaces, Virtual Environments and Advanced Interface Design,"
(Barfield and Furness, Eds.), Oxford University Press, 1995. Also available
as Tech. Report 95-1-004. University of Aizu. Japan.

[15] Roger Dannenberg and Meera Blattner, "Introduction to the book," Mul-
timedia Interface Design. (M. Blattner and R. Dannenberg, Eds), ACM
Press. New York and Addison-Wesley. Reading, Massachusetts, 1992, pp
xvii-xxv.

[16] Roger B. Dannenberg and Robert L. Joseph. "Human-Computer Inter-


action in The Piano Tutor." Multimedia Interface Design, (M. Blattner
and R. Dannenberg. eds), ACM Press. New York and Addison-Wesley,
Reading, Massachusetts. 1992. pp 65-78.
[17] Marc Davis. "Media Streams: Representing Video for Retrieval and Re-
purposing." Ph.D. thesis, Massachusetts Institute of Technology, 1995.
[18] Abbe Don, Susan Brennan, Brenda Laurel, and Ben Shneiderman, "An-
thropomorphism: From Eliza to Terminator 2," (Panel) Human Factors
in Computing Systems. ACM CHI '92. Monterey, CA, May 3-7, 1992, pp
67-70.

[19] Herbert Dreyfus and Stuart Dreyfus, "Mind over Machine - The Power of
Human Intuition and Expertise in The Era of The Computer." The Free
Press, New York. 1986.

[20] Thomas D. Erikson. "Working with Interface Metaphors." The Art of


Human-Computer Interface Design, Addison-Wesley, (Ed. Brenda Laurel),
Reading, MA, 1990, pp 57-64.

[21] Harriet J. Fell. Hariklia Delta, Regina Peterson, Linda J. Ferrier, Zehra
Mooraj. and Megan Valleau. "Using the Baby-Babble-Blanket for Infants
120 CHAPTER 3

with Motor Problems," Proceedings of ACM ASSETS '94, October 31-


November 1, 1994, Marina del Rey, CA, pp 77-84.

[22] William W. Gaver, "Auditory Icons: Using Sound in Computer Interfaces,"


Human-Computer Interaction, Vol 2, No.2, 1986, pp 167-177.

[23] Ian W. Hunter, Tilemachos D. Doukoglou, Serge R. Lafontaine, Paul G.


Charette. Lynette A. Jones, Mark A. Sagar, Gordon D. Mallinson, and
Peter J. Hunter. "A Teleoperated Microsurgical Robot and Associated
Virtual Environment for Eye Surgery," Presence, Vol. 2, No.4, Fall 1993,
pp 265-280.

[24] Hiroshi Ishii, Minoru Kobayashi, and Arita, K., "Iterative Design of Seam-
less Collaboration Media," Communications of the ACM (CACM), Special
Issue on Internet Technology, ACM. Vol. 37. No.8, August 1994, pp 83-97.

[25] David L ..Jaffe. "An Overview of Programs and Projects at the Rehabili-
tiation Research and Development Center," ACM ASSETS '94, October
31-November 1, 1994, Marina del Rey, CA, pp 69-76.

[26] Gregory Kramer, editor, "Auditory Display: the Proceedings of the First
International Conference on Auditory Display," Addison-Wesley, Santa Fe
Institute Series, Reading, MA, 1994.
[27] Brenda Laurel. Tim Oren, and Abbe Don. "Issues in Multimedia Interface
Design: Media Integration and Interface Agents," Multimedia Interface
Design, (M. Blattner and R. Dannenberg, eds), ACM Press, New York
and Addison-Wesley, Reading, Massachusetts, 1992, pp 53-64.

[28] D.L. Mansur, M M. Blattner and K. I. Joy, "Sound-Graphs: A Numerical


Data Analysis Method for the Blind." Journal of Medical Systems, Vol. 9.
1985, pp. 163-174.

[29] Jonathan R. Merril, "Surgery on the Cutting-Edge," Virtual Reality


World, Novermber/December 1993, pp 34-38.

[30] Michael J. Muller, Robert F. Farrell. Kathleen D. Cebulka, and John G.


Smith, "Issues in the Usability of Time-Varying Multimedia," in Multime-
dia Interface Design, (M. Blattner and R. Dannenberg, Eds), ACM Press,
New York and Addison-Wesley, Reading, Massachusetts, 1992, pp 7-38.

[31] B. A. Nardi, B. A. and C. L. Zarmer, "Beyond Models and Metaphors:


Visual Formalisms in User Interface Design." Journal of Visual Languages
and Computing, Vol. 4. No.1, March 1993, pp. 5-34.
Multimedia Interfaces: Designing for Diversity 121

[32] C. Nass, J. Steuer, and H. Reeder, "Anthropomorphism, Agency, &


Ethopoeia: Computers as Social Actors" , In INTERCHI '93 Adjunct Pro-
ceedings. April 24-29. Amsterdam, The Netherlands, ACM/SIGCHI, pp.
111-112.
[33] Theodor Holm Nelson, "The Right Way to Think About Software, The
Art of Human-Computer Interface Design," Addison-Wesley, (Ed. Brenda
Laurel), Reading, MA, 1990, pp 229-235.

[34] Sharon L. Oviatt, "PEN/VOICE: Complementary Multimodal Communi-


cation," Speech Technology, 1994, pp 22-25.

[35] Albert L. Papp and Meera M. Blattner, "A Centralized Audio Presenta-
tion Manager," Auditory Displays: Proceedings of the 2nd International
Conference on Auditory Display, 1995, Addison-Wesley, Santa Fe Series,
In Press.

[36] George G. Robertson, Stuart K. Card, and Jock D. Mackinlay, "Informa-


tion Visualization Using 3D Interactive Animation," Communications of
the ACM, Vol. 36, No.4, April 1993, pp. 56-71.

[37] Alexander I. Rudnicky and Alexander C. Hauptmann, "Multimodal In-


teraction in Speech Systems," Multimedia Interface Design, (M. Blattner
and R. Dannenberg, eds), ACM Press, New York and Addison-Wesley,
Reading, Massachusetts, 1992, pp 147-172.
[38] R. Murray Schafer, "The Tuning of the World," Alfred Knopf, New York,
1977.
[39] Rebecca R. Springmeyer, Meera M. Blattner, and Nelson L. Max, "Devel-
oping A Broader Basis for Scientific Data Analysis Interfaces," Proceedings
of Visualization '92, Boston, October, pp 235-242.

[40] Rebecca R. Springmeyer, "Designing for Scientific Data Analysis: From


Practice to Prototype," Ph.D. Thesis, University of California, Davis,
published as UCRL-LR-111809, Lawrence Livermore National Laboratory
Tech Report, 1992.

[41] Lucy Suchman, "Plans and Situated Actions: The Problem of Human-
Machine Communication," Cambridge: Cambridge University Press, 1987.

[42] I. Sutherland, "The Ultimate Display," In Proceedings of IFIP Congress,


Vol. 2, pp 506-508.
122 CHAPTER 3

[43] Laura Teodosio and Walter Bender, "Salient Video Stills: Content and
Context Preserved," ACM Multimedia '93, August 1-6, 1993, Anaheim,
California, pp. 39-46.

[44] Tony Thomas, "Music for the Movies," A.S. Barnes and Company, New
York, 1973.
[45] Hirotada Ueda, Takafumi Miyatake, Shiegeo Sumino and Akio Nagasaka,
"Automatic Structure Visualization for Video Editing," ACM INTERCHI
'93, April 24-29, 1993, Amsterdam, pp 137-141.

[46] Benjamin Watson, "A Survey of Virtual Reality in Japan," Presence, Vol.
3, No.1, Winter 1994, MIT Press, ppl-18.

[47] Elizabeth M. Wenzel, "Three-Dimensional Virtual Acoustic Environ-


ments," Multimedia Interface Design, (M. Blattner and R. Dannenberg,
Eds), ACM Press, New York and Addison-Wesley, Reading, Massachusetts.
1992, pp 257-288.

[48] Terry Winograd and Fernando Flores, "Understanding Computers and


Cognition," Ablex, 1986; Addison-Wesley, Reading, MA, 1990.

[49] Dennis Wixon, Karen Holtzblatt. and Stephen Knox, "Contextual Design:
An Emergent View of System Design, Human Factors in Computing Sys-
tems," ACM CHI '90, Seattle, WA, April 1-5, 1990, pp 329-326.
4
MULTIMEDIA STORAGE SYSTEMS
Harrick M. Vin* and P. Venkat Rangan**
*Department of Computer Sciences, University of Texas at Austin, USA

** Department of Computer Science and Engineering, UC San Diego, USA

ABSTRACT
Multimedia storage servers provide access to multimedia objects including text, im-
ages, audio, and video. Due to the real-time storage and retrieval requirements and
the large storage space and data transfer rate requirements of digital multimedia,
however, the design of such servers fundamentally differs from conventional storage
servers. Architectures and algorithms required for designing digital multimedia stor-
age servers is the subject matter of this chapter.

1 INTRODUCTION
Recent advances in computing and communication technologies have made it
feasible and economically viable to provide on-line access to a variety of infor-
mation sources such as books, periodicals, images, video clips, and scientific
data. The architecture of such services consists of multimedia storage servers
that are connected to client sites via high-speed networks [13]. Clients of such
a service are permitted to retrieve multimedia objects from the server for real-
time playback at their respective sites. Furthermore, the retrieval may be
interactive, in the sense that clients may stop, pause, resume, and even record
and edit the media information if they have permission to do so.

The design of such multimedia servers differs significantly from traditional


text/numeric storage servers due to two fundamental characteristics of digi-
tal audio and video:
124 CHAPTER 4

Media Type Data Rate


II (specifications) II
Voice quality audio 64 Kbits/sec
(1 channel, 8 bit samples at 8kHz)
MPEG encoded audio 384 Kbits/sec
(equivalent to CD quality)
CD quality audio 1.4 Mbits/sec
(2 channels, 16 bit samples at 44.1 kHz)
MPEG-2 encoded video 0.42 MBytes/sec
NTSC quality video 27 MBytes/sec
(640 X 480 pixels, 24 bits/pixel)
HDTV quality video 81 MBytes/sec
(1280 X 720 pixels/frame, 24 bits/pixel)

Table 1 Storage space requirements for uncompressed digital multimedia


data.

• Large data transfer rate and storage space requirement: Playback of Dig-
ital video and audio consumes data at a very high rate (see Table 1).
Thus, a multimedia service must provide efficient mechanisms for storing,
retrieving and manipulating data in large quantities at high speeds.
• Real-time storage and retrieval: Digital audio and video (often referred
to as "continuous" media) consist of a sequence of media quanta (such as
video frames or audio samples) which convey meaning only when presented
continuously in time. This is in contrast to a textual object, for which spa-
tial continuity is sufficient. Furthermore, a multimedia object, in general,
may consist of several media components whose playback is required to be
temporally coordinated.

The main goal of this chapter is to provide an overview of the various issues in-
volved in designing a digital multimedia storage server, and present algorithms
for addressing the specific storage and retrieval requirements of digital multime-
dia. Specifically, to manage the large storage space requirements of multimedia
data, we examine techniques for efficient placement of media information on
individual disks, large disk arrays, as well as hierarchies of storage devices
(Section 3). To address the real-time recording and playback requirements, we
discuss a set of admission control algorithms which a multimedia server may
Multimedia Storage Systems 125

employ to determine whether a new client can be admitted without violating


the real-time requirements of the clients already being serviced (Section 4). We
will then briefly describe some of the existing commercial multimedia servers,
and then present some concluding remarks.

2 MULTIMEDIA STORAGE SERVERS


Digitization of video yields a sequence of frames and that of audio yields a
sequence of samples. We refer to a sequence of continuously recorded video
frames or audio samples as a stream. Since media quanta, such as video frames
or audio samples, convey meaning only when presented continuously in time, a
multimedia server must ensure that the recording and playback of each media
stream proceeds at its real-time rate. Specifically, during recording, a multi-
media server must continuously store the data produced by an input device
(e.g., microphone, camera, etc.) so as to prevent buffer overruns at the device.
During playback, on the other hand, the server must retrieve data from the
disk at a rate which ensures that an output device (e.g., speaker, video display)
consuming the data does not starve. Although semantically different, both of
these operations have been shown to be mathematically equivalent with respect
to their real-time performance requirements [3J. Consequently, for the sake of
clarity, we will only discuss techniques for retrieving media information from
disk for real-time playback. Analysis for real-time recording can be carried out
similarly.

Continuous playback of a media stream consists of a sequence of periodic tasks


with deadlines, where tasks correspond to retrievals of media blocks from disk
and deadlines correspond to the scheduled playback times. Although it is pos-
sible to conceive of systems that would fetch media quanta from the storage
system just in time to be played, in practice the retrieval is likely to be bursty.
Consequently, information retrieved from the disk may have to be buffered prior
to playback.

Therefore, the challenge for the server is to keep enough data in stream buffers
at all times so as to ensure that the playback processes do not starve [8J. In
the simplest case, since the data transfer rates of disks are significantly higher
than the real-time data rate of a single stream (e.g., the maximum throughput
of modern disks is of the order of 3-4 MBytes/s, while that of an MPEG-2
encoded video stream is 0.42 MBytes/s, and un compressed CD-quality stereo
audio is about 0.2 MBytes/s), employing modest amou.nt of buffering will en-
126 CHAPTER 4

able conventional file and operating systems to support continuous storage and
retrieval of isolated media streams.

In practice, however, a multimedia server has to process requests from several


clients simultaneously. In the best scenario, all the clients will request the
retrieval of the same media stream, in which case, the multimedia server needs
only to retrieve the stream once from the disk and then multicast it to all the
clients. However, more often than not, different clients will request the retrieval
of different streams; and even when the same stream is being requested by
multiple clients (such as a popular movie), requests may arrive at arbitrary
intervals while the stream is already being serviced. Thus, each client may be
viewing a different part of the movie at the same time.

A simple mechanism to guarantee that the real-time requirements of all the


clients are met is to dedicate a disk head to each stream, and then treat each
disk head as a single stream system. This, however, limits the total number of
streams to the number of disk heads. In general, since the data transfer rate
of disks are significantly higher than the real-time data rate of a single stream,
the number of streams that can be serviced simultaneously can be significantly
increased by multiplexing a disk head among several streams. However, in doing
so, the server must ensure that the continuous playback requirements of all the
streams are met. The number of clients that can be simultaneously serviced by
a multimedia server is dependent on the placement of multimedia streams on
disk as well as the servicing algorithm. In what follows, we first outline methods
for managing the storage space requirements of digital multimedia, and then
present algorithms for servicing a large number of clients simultaneously.

3 MANAGING THE STORAGE SPACE


REQUIREMENT OF DIGITAL
MULTIMEDIA
A storage server must divide video and audio streams into blocks while storing
them on disks. Since media quanta, such as video frames or audio samples,
convey meaning only when presented continuously in time, a multimedia server
must organize their storage on disk so as to ensure that the playback of each
media stream proceeds at its real-time rate. Moreover, due to the large storage
space requirements of digital audio and video, an interactive, read-write storage
Multimedia Storage Systems 127

server must employ flexible placement strategies that minimize copying of media
information during editing.

In order to explore the viability of various placement models for storing digital
continuous media on conventional magnetic disks, let us first briefly review
some of the fundamental characteristics of magnetic disks. Generally, magnetic
disks consist of a collection of platters, each of which is composed of a number
of circular recording tracks (see Figure 1). Platters spin at a constant rate.
Moreover, the amount of data recorded on tracks may increase from the inner-
most track to the outer-most track (e.g., zoned disks). The storage space of
each track is divided into several disk blocks, each consisting of a sequence of
physically contiguous sectors. Each platter is associated with a read/write head
that is attached to a common actuator. A cylinder is a stack of tracks at one
actuator position.

Actuator
Track
Head

.....;::: ,;;>---...---.. . . . .
Platter .I~~~~:::·····
".

' ........•. ~~

Direction
of rotation

Figure 1 Architectural model of a conventional magnetic disk.

In such an environment. the access time of a disk block consists of three com-
ponents: seek time, rotational latency, and data transfer time. Seek time is the
time needed to position the disk head on the track containing the desired data,
and is a function of the initial start-up cost to accelerate the disk head as well
as the number of tracks that must be traversed. Rotational latency, on the
other hand, is the time for the desired data to rotate under the head before it
can be read or written, and is a function of the angular distance between the
current position of the disk head and the location of the desired data, as well as
the rate at which platters spin. Once the disk head is positioned at the desired
128 CHAPTER 4

disk block, the time to retrieve its contents is referred to as the data transfer
time, and is a function of the disk block size and data transfer rate of the disk.

Assuming a multimedia server consisting of such magnetic disks, in this section,


we will first describe models for storing digital continuous media on individual
disks, and then discuss the effects of utilizing disk arrays as well as storage
hierarchies.

3.1 Placement of Data Blocks on Individual


Disks
The placement of data blocks on disks in storage servers is generally governed
by either contiguous, random, 01' constrained placement policy. Traditionally,
high performance storage servers have employed contiguous placement of media
blocks on disk [18]. In this case, once the disk head is positioned at the begin-
ning of a stream, all the media blocks constituting the stream can be retrieved
without incurring any seek or rotational latency, thereby defining a lower bound
on the retrieval time of a media stream from disk. However, in highly inter-
active, read-write file system environments, contiguous placement of blocks of
a stream is fraught with inherent problems of fragmentation, and can entail
enormous copying overheads during insertions and deletions for maintaining
contiguous nature of media streams on disk. Thus, the contiguous placement
model, although suitable for read-only systems (such as compact discs, CLVs,
etc.), is not viable for a flexible, read-write storage systems.

Storage servers for read-write systems have traditionally employed random


placement of blocks on disk [12, 21]. Since this placement model does not
impose any restrictions on the separation between successive media blocks,
editing operations do not incur any overhead for restructuring the storage of
media streams on disk. However, the fundamental limitation of such random
placement policy, from the perspective of designing a multimedia storage server,
is the overhead (seek time and rotational latency) incurred while accessing suc-
cessive blocks of a stream. Although the effect of higher separation between
successive media blocks on disk can be circumvented by increasing the size of
media blocks, the seek time and rotational latency resulting from the random
separation between successive media blocks on disk may yield very low effective
data transfer rates, thereby limiting the number of clients that can be serviced
simultaneously.
Multimedia Storage Systems 129

Clearly, the contiguous and random placement models represent two ends of a
spectrum: Whereas the former does not permit any separation between succes-
sive media blocks of a stream on disk, the latter does not impose any constraints
at all. Recently, an efficient generalization of these two extremes, referred to
as the constrained placement policy, have also been proposed [16]. The main
objective of constrained placement policy is to ensure continuity of retrieval,
as well as reduce the average seek time and rotational latency incurred while
accessing successive media blocks of a stream by bounding the size of each me-
dia block as well as the separation between successive media blocks on disk.
Such a placement policy is particularly attractive when the block size must be
small (e.g., when utilizing a conventional file system with block sizes tailored
for text). However, implementation of such a system may require elaborate al-
gorithms to ensure that the separation between blocks conforms to the required
constraints. Furthermore, for constrained latency to yield its full benefits, the
scheduling algorithm must retrieve all the blocks for a given stream at once
before switching to any other stream.

3.2 Efficient Placement in Multi-disk


Multimedia Servers
Due to the immensity of the sizes and the data transfer requirements of mul-
timedia objects, multimedia servers will undeniably be founded on large disk
arrays. Disk arrays achieve high performance by servicing multiple I/O requests
concurrently, as well as by utilizing several disks to service a single request in
parallel. The performance of a disk array, however, is critically dependent on
the distribution of the workload (i.e., the number of blocks to be retrieved
from the array) among the disks. The higher the imbalance in the workload
distribution, the lower is the throughput of the disk array.

To effectively utilize a disk array, a multimedia server must interleave the stor-
age of each media stream among the disks in the array. The unit of data
interleaving, referred to as a media block, denotes the maximum amount of log-
ically contiguous data that is stored on a single disk (this has also been referred
to as the striping unit in the literature [7]). Successive media blocks of a stream
are placed on consecutive disks using a round-robin allocation algorithm.

Each media block may contain either a fixed number of media units or a fixed
number of storage units (e.g., bytes) [5, 10,26]. If each media stream stored on
the array is encoded using a variable bit rate (VBR) compression technique, the
storage space requirement may vary from one media unit to another. Hence, a
130 CHAPTER 4

server that composes a media block by accumulating a fixed number of media


units will be required to store variable-size media blocks on the array. On the
other hand, if media blocks are assumed to be of fixed size, they may contain
varying number of media units. Thus, depending on the placement policy,
accessing a fixed number of media units may require the server to retrieve
either a fixed number of variable-size blocks or a variable number of fixed-size
blocks from the array.

The most appealing feature of the variable-size block placement policy is that,
regardless of the playback rate requirements, if we assume that a server services
clients by proceeding in terms of periodic rounds during which it accesses a fixed
number of media units for each client, then the retrieval of each individual video
stream from disk proceeds in lock-step. That is, each client accesses exactly one
disk during each round, and that consecutive disks in the array are accessed by
the same set of clients during successive rounds. In such a scenario, the server
can partition the the set of clients into D logical groups (where D is the number
of disks in the array), and then admit a new client by assigning it to the most
lightly loaded group. Such a simple policy balances load across the disks in
the array, and thereby maximizes the number of clients that can be serviced
simultaneously by the server. However, the key limitations of the variable-
size block placement policy include: (1) the inherent complexity of allocating
and deallocating variable-size media blocks, and (2) the higher implementation
overheads. Thus, although the variable-size block placement policy is highly
attractive for designing multimedia storage servers for predominantly read-only
environments (e.g., video on-demand), it may not be viable for the design of
integrated multimedia file systems (in which multimedia documents are created,
edited, and destroyed very frequently).

In fixed-size block placement policy, on the other hand, a multimedia server


partitions video streams into fixed size blocks for storing them on the array.
Thus, to access a fixed number of frames of VBR encoded video streams during
each round, the server will be required to access varying number of blocks for
each client. Since the set of disks accessed by a client may be unrelated to
those accessed by other clients, the number of media blocks to be accessed
during a round may vary from one disk to another. Due to this variation,
the time spent in accessing the required blocks from the most heavily loaded
disk may occasionally exceed the duration of a round, resulting in playback
discontinuities for clients. To reduce the occurrence of such discontinuities, the
server must select a media block size that minimizes the expected service time
of the most heavily loaded disk [26]. Additionally, the server may be required
to exploit the sequential nature of video and audio playback to precompute
the set of blocks to be accessed from disks during future rounds, and pre-
Multimedia Storage Systems 131

fetch a subset of them on detecting an overflow in the future round. Although


such dynamic load balancing schemes yield improved quality of service for the
clients being serviced simultaneously, it is at the expense of increase in buffer
space requirement. By judiciously choosing the set of blocks to be read-ahead in
underflow rounds. the increase in the buffer space requirement can be minimized
[22].

3.3 Utilizing Storage Hierarchies


The preceding discussion has focused on utilizing fixed disks as the storage
medium for the multimedia server. Although sufficient for providing efficient
access to a small number of video streams (e.g., 25-50 most popular titles
maintained by a video rental store), the high cost per gigabyte of storage makes
such magnetic disk-based server architectures ineffective for large-scale servers.
In fact, the desire for sharing and providing on-line access to a wide variety of
video sequences indicates that large-scale multimedia servers must utilize very
large tertiary storage devices (e.g., tape and optical jukeboxes - see Table 2)).
These devices are highly cost-effective and provide very large storage capacities
by utilizing robotic arms to serve a large number of removable tapes or disks to
a small number of reading devices. Because of these long seek and swap times,
however, they are poor at performing random access within a video stream.
Moreover, they can support only a single playback at a time on each reader.
Consequently, they are inappropriate for direct video playback. Thus, a large-
scale, cost-effective multimedia server will be required to utilize tertiary storage
devices (such as tape jukeboxes) to maintain a large number of video streams,
and then achieve high performance and scalability through magnetic disk-based
servers.

In the simplest case, such a hierarchical storage manager may utilize fast mag-
netic disks to cache frequently accessed data. In such a scenario, there are
several alternatives for managing the disk system. It may be used as a staging
area (cache) for the tertiary storage devices, with entire media streams being
moved from the tertiary storage to the disks when they need to be played back.
On the other hand, it is also possible to use the disks to only provide storage
for the beginning segments of the multimedia streams. These segments may
be used to reduce the startup latency and to ensure smooth transitions in the
playback [14].

A distributed hierarchical storage management extends this idea by allowing


multiple magnetic disk-based caches to be distributed across a network. In
132 CHAPTER 4

Magnetic Optical Low-End High-End


Disk Disk Tape Tape
Capacity 9 GB 200GB 500 GB lOTB
Mount Time osec 20 sec 60 sec 90 sec
Transfer Rate 2MB/s 300 KB/s 100 KB/s 1,000 KB/s
Cost/GB $555/GB $125/GB $100/GB $50/GB

Table 2 Tertiary storage devices.

such a scenario, if a high percentage of clients access data stored in a local (or
nearby) cache, the perceived performance will be sufficient to meet the demands
of continuous media. On the other hand. if the user accesses are unpredictable
or have poor reference locality, then most accesses will require retrieval of in-
formation from tertiary storage devices. thereby significantly degrading the
performance [6, 11, 19].

Having discussed the techniques for efficient placement of multimedia objects


on disk arrays and tertiary storage devices, we will now describe the techniques
for meeting the real-time playback requirements of multimedia streams from a
disk-based server.

4 EFFICIENT RETRIEVAL OF
MULTIMEDIA OBJECTS
Due to the periodic nature of multimedia playback, a multimedia server can
service multiple clients simultaneously by proceeding in terms of rounds. Dur-
ing each round, the multimedia server retrieves a sequence of media blocks for
each stream. and the rounds are repeatedly executed until the completion of
all the requests. The number of blocks of a media stream retrieved during a
round is dependent on its playback rate requirement. as well as the buffer space
availability at the client [25]. Ensuring continuous retrieval of each stream re-
quires that the service time (i.e., the total time spent in retrieving media blocks
during a round) does not exceed the minimum of the playback durations of the
blocks retrieved for each stream during a round. Hence, before admitting a
Multimedia Storage Systems 133

new client. a multimedia server must employ admission control algorithms to


decide whether a new client can be admitted without violating the continuous
playback requirements of the clients already being serviced.

To precisely formulate the admission control criteria, consider a multimedia


server that is servicing n clients, each retrieving a different media stream (say,
S1, S2, ... , Sn, respectively). Let /t,/2, ... ,/n denote the number of frames of
streams S1. S2, ... , Sn retrieved during each round. Then, assuming that R~I
denotes the playback rate (expressed in terms of frames/sec) of stream Si, the
duration of a round. defined as the minimum of the playback durations of the
frames accessed during a round. is given by:

''"" = . (Ii)
ffiln
iE[1.n)
-.
R~I

Additionally, let us assume that media blocks of each stream are placed on disk
using random placement policy, and that the multimedia server is employing
the SCAN disk scheduling policy, in which the disk head moves back and forth
across the platter and retrieves media blocks in either increasing or decreasing
order of their track numbers.

4.1 Deterministic Admission Control


Algorithm
To ensure that the continuous playback requirements of all the clients are
strictly met throughout the duration of service (i.e., to provide deterministic
service guarantees), the server must ensure that the playback rate requirements
of all the clients are met even in the worst-case. Specifically, if k1' k2' ... , kn de-
note the maximum number of blocks of streams S1, S2, ... , Sn that may need to
retrieved during a round. then. in the worst case. each ofthe (k1 + k2 + ... + kn )
blocks may be stored on separate tracks. Thus. the disk head may have to
be repositioned onto a new track at most (k1 + k2 + ... + kn ) times dur-
ing each round. Hence. using the symbols for disk parameters presented in
Table 3, the ~otal seek time incurred during each round can be computed as:
(a * I:?=1 k; + b * C). Similarly, retrieval of each media block may, in the worst
case, incur a rotational latency of l!?:,~x, yielding that the total rotational la-
tency incurred during each round is bounded by (l!?:,~a; * I:~=1 kd. Hence, the
total service time for each round is bounded by:
n

T = b * C + (a + 1~~X) *L ki (4.1)
;=1
134 CHAPTER 4

II Symbol Explanation Units"


C Number of cylinders on disk -
a,b Constants (seek time parameters) sec
l aee k(CI, C2) Seek time (a + b * ICI - C21) sec
1m ax Maximum rotational latency sec
rot

Table 3 Sununary of disk parameters used in this chapter

Ensuring continuous retrieval of each stream requires that the total service time
per round does not exceed the minimum of the playback durations of kl , k 2 ,
... , or k n blocks [3. 8, 17, 21, 25, 27]. We refer to this as the deterministic
admission control principle, and can be formally stated as:
n
he + (a + 1:.':.~X) * L kj ~ 'R (4.2)
i=1

Notice, however, that due to the human perceptual tolerances as well as the
inherent redundancy in continuous media streams, most clients of a multime-
dia server are tolerant to brief distortions in playback continuity as well as
occasional loss of media information. Therefore, providing deterministic ser-
vice guarantees to all the clients is superfluous. Furthermore, the worst-case
assumptions that characterize deterministic admission control algorithms need-
lessly constrain the number of clients that can be serviced simultaneously, and
hence, lead to severe under-utilization of server resources.

4.2 Statistical Admission Control Algorithm


To exploit the human perceptual tolerances and the differences between the av-
erage and the worst-case performance characteristics of the multimedia server,
statistical admission control algorithm have also been proposed in literature
[24]. This algorithm utilizes precise distributions of access times and playback
rates, rather than their corresponding worst-case values; and provides statis-
tical service guarantees to each client (Le., the continuity requirements of at
least a fixed percentage of media units is ensured to be met).
Multimedia Storage Systems 135

To clearly explain this algorithm, let us assume that the service requirements of
client i be specified as a percentage Pi of the total number of frames that must
be retrieved on time. Moreover, let us assume that each media stream may be
encoded using a variable bit rate compression technique (e.g., JPEG, MPEG,
etc.). In such a scenario, the number of media blocks that contain Ii frames of
stream Si may vary from one round to another. This difference, when coupled
with the variation in the relative separation between blocks, yields different
service times across rounds. In fact, while servicing a large number of clients,
the service time may occasionally exceed the round duration (i.e., T > 'R).
We refer to such rounds as overflow rounds. Given that each client may have
requested a different quality of service (i.e., different values of pd, meeting
all of their service requirements will require the server to delay the retrieval
of or discard (i.e., not retrieve) media blocks of some of the more tolerant
clients during overflow rounds 1 . Consequently, to ensure that the statistical
quality of service requirements of clients are not violated, a multimedia server
must employ admission control algorithms that restrict the occurrence of such
overflow rounds by limiting the number of clients admitted for service.

To precisely derive an admission control criterion that meets the above require-
ment, observe that for rounds in which T ~ 'R, none of the media blocks need
to be discarded. Therefore, the total number of frames retrieved during such
rounds is given by 2:7=1 Ii. During overflow rounds, however, since a few media
blocks may have to be discarded or delayed to yield T ~ 'R, the total number
of frames retrieved will be smaller than 2:7=1 Ii. Given that Pi denotes the
percentage of frames of stream Sj that must be retrieved on time to satisfy
the service requirements of client i, the average number of frames that must be
retrieved during each round is given by Pi * Ii. Hence, assuming that q denotes
the overflow probability (i.e., P( T > 'R) = q), the service requirements of the
clients will be satisfied if:
n n

q * .1'0 + (1 - q) L Ii :::: L Pi * Ii
i=l i=l
(4.3)

where .1'0 denotes the number of frames that are guaranteed to be retrieved
during overflow rounds. The left hand side of Equation (4.3) represents the
lower bound on the expected number of frames retrieved during a round and
the right hand side denotes the average number of frames that must be accessed
during each round so as to meet the service requirements of all clients. Clearly,
the effectiveness of this admission control criteria, measured in terms of the
1 The choice between delaying or discarding media blocks during overflow rounds is appli-
cation dependent. Since both of these policies are mathematically equivalent, in this paper,
we will analyze only the discarding policy.
136 CHAPTER 4

number of clients that can be admitted, is dependent on the values of q and


Fa. In what follows, we present techniques for accurately determining their
values.

Computing the Overflow Probability


While servicing multiple clients simultaneously, an overflow is said to occur
when the service time exceeds the playback duration of a round. Whereas the
playback duration n of a round is fixed (since the server is accessing a fixed
number of frames for each client), the service time varies from round to round.
Let the random variable Tk denote the service time for accessing k media blocks
from disk. Then overflow probability q can be computed as:

L
kmG.x

q = P{T > n) P(T > niB = k)P(B = k)


k=k m • n

L
kmo. x

P(Tk > n)p(B = k) (4.4)


k=k m • n
where B is the random variable representing the number of blocks to be re-
trieved in a round, and kmin and k max , respectively, denote its minimum and
maximum values. Hence, computing the overflow probability q requires the
determination of probability distribution functions for Tk and B, as well as the
values of kmin and k max , techniques for which are described below.

• Service time characterization:


Given the number of blocks to be accessed during a round, since the service
time is dependent only on the relative placement of media blocks on disk
and the disk scheduling algorithm, and is completely independent of the
client characteristics, service time distributions are required to be com-
puted only once during the lifetime of a multimedia server, possibly at the
time of its installation.
The server can derive a distribution function for Tk by empirically measur-
ing the variation in service times yielded by different placements of k blocks
on disk. The larger the number of such measurements, the greater is the
accuracy of the distribution function. Starting with the minimum number
of blocks that are guaranteed to be accessed during a round (i.e., the value
of kd derived in Section 4.2), the procedure for determining the distribu-
tion function for Tk should be repeated for k = kd, kd + 1, ... , k end , where
kend is the minimum value of k for which P(Tkend > n) ~ 1. Using these
Multimedia Storage Systems 137

empirically derived distribution functions, the probability P(Tk > 'R.), for
various values of k, can be easily computed.
• Client load characterization:
Since Ii frames of stream Si are retrieved during each round, the total
number of blocks B required to be accessed is dependent on the frame
size distributions for each stream. Specifically, if the random variable Bi
denotes the number of media blocks that contain Ii frames of stream Si,
then the total number of blocks to be accessed during each round is given
by:
n

i=l
Since Bi is only dependent on the frame size variations within stream Si,
B/s denote a set of n independent random variables. Therefore, using the
centrollimit theorem, we conclude that the distribution function g8(b) of B
approaches a normal distribution [15]. Furthermore, if 118; and (1'8; denote
the mean and standard deviation of random variable Bi, respectively, then
the mean and standard deviation for B are given by:
n n

118 = L 118;, (1'~=L(1'~; (4.5)


i=l i=l

Consequently,
(4.6)
where N is the standard normal distribution function. Additionally, since
B;'s denote discrete random variables that take only integral values, they
can be categorized as lattice-type random variables [15]. Hence, using the
central limit theorem, the point probabilities P(B = k) can be derived as:
1 - (k-~f)2
P(B = k) ~ e 2"8 (4.7)
(1'8...fii

Finally, computing the overflow probability q using Equation (4.4) requires the
br br
values of k min and k marc . If in and arc , respectively, denote the minimum
and the maximum number of media blocks that may contain Ii frames of stream
Si, then the values of kmin and kmarc can be derived as:
n n
. -- "
kman L..Jbimin ,. k marc -- "
L..Jbimarc (4.8)
i=l i=l
138 CHAPTER 4

Thus, by substituting the values of kmin , kmax , P( Tk > R), and P( B = k) in


Equation (4.4), the overflow probability q can be computed.

Determination of :Fo
The maximum number of frames :Fo that are guaranteed to be retrieved during
an overflow round is dependent on: (1) the number of media blocks that are
guaranteed to be accessed from disk within the round duration 'R, and (2) the
relationship between the media block size and the maximum frame sizes.

To compute the number of media blocks that are guaranteed to be accessed


during each round, worst-case assumptions (similar to those employed by de-
terministic admission control algorithms) regarding the access times of media
blocks from disk may need to be employed. Specifically, if k denotes the num-
ber of media blocks that are to be retrieved during a round, and if the server
employs the SCAN disk scheduling algorithm, then as per Equation (4.1), the
worst-case service time can be computed as:

T = h C + (a + 1:::'~X) * k
Since T ::; 'R, the number of media blocks, k d , that are guaranteed to be
retrieved during each round is bounded by:
k 'R-b*C (4.9)
d ::; (a + 1;::'~X)
Now, assuming that I(Sd denotes the minimum number of frames that may
be contained in a block of stream Si, the lower bound on the number of frames
accessed during an overflow round is given by:

:Fo = kd * min I(S;) (4.10)


iE[l,nj

Admitting a New Client


Consider the scenario that a multimedia server receives a new client request
for the retrieval of stream Sn+1' In order to validate that the admission of the
new client will not violate the service requirements of the clients already being
serviced, the server must first compute the overflow probability assuming that
the new client has been admitted. In order to do so, the server must determine:
Multimedia Storage Systems 139

1. The mean and the standard deviation of the number of media blocks that
may contain In+1 frames of stream Sn+1 (denoted by TJT3 n +1 and 0"T3 n +1 ,
respectively), to be used in Equations (4.5) and (4.6);

2. The minimum and the maximum number of media blocks that may contain
In+l frames of stream Sn+1 (denoted by b~i1 and b~.tf, respectively), to
be used in Equation (4.8); and

3. The minimum number of frames contained in a media block of stream Sn+1


(denoted by I(Sn+d), to be used in Equation (4.10).

Since all of these parameters are dependent on the distribution of frame sizes
in stream Sn+l, the server can simplify the processing requirements at the time
of admission by precomputing these parameters while storing the media stream
on disk.

These values, when coupled with the corresponding values for all the clients
already being serviced as well as the predetermined service time distributions
will yield new values for q and :Fo. The new client is then admitted for service
if the newly derived values for q and :Fo satisfy the admission control criteria:
n+l n+l
q * :Fo + (1 - q) L Ii ~ L Pi * Ii
i=l i=l

4.3 Discussion
In addition to the deterministic algorithms (which provide strict performance
guarantees by making worst-case assumptions regarding the performance re-
quirements) and the statistical admission control algorithms (that utilize pre-
cise distributions of access times and playback rates), other admission control
algorithms have been proposed in the literature. One such algorithm is the
adaptive admission control algorithm proposed in [22, 23]. As per this algo-
rithm, a new client is admitted for service only if the prediction from the status
quo measurements of the server performance characteristics indicate that the
service requirements of all the clients can be met ·satisfactorily. It is based on
the assumption that the average amount of time spent for the retrieval of each
media block (denoted by TJ) does not change significantly even after a new client
is admitted by the server. In fact, to enable the multimedia server to accurately
predict the amount of time expected to be spent retrieving media blocks during
a future round, a history of the values of TJ observed during the most recent W
rounds (referred to as the averaging window) may be maintained. If TJavg and
140 CHAPTER 4

(J' denote the average and tne standard deviation of 1] over W rounds, respec-

tively, then the time required to retrieve a block in future rounds (Tj) can be
estimated as:
(4.11)
where f is an empirically determined constant. Clearly, a positive value of
f enables the estimation process to take into account the second moment of
the random variable 1], and hence, make the estimate reasonably conservative.
Thus, if ki and aidenote the average number of blocks accessed during a round
for stream Si, and the percentage of frames of stream Si that must be retrieved
on time so as to meet the requirements of client i, respectively, then the aver-
age number of blocks of stream Si that must be retrieved by the multimedia
server during each round can be approximated by ki * ai.
Consequently, given
the empirically estimated average access time of a media block from disk, the
requirements of tolerant clients will not be violated if:

(4.12)

This is referred to as the adaptive admission control criteria. Notice that since
estimation of the service time of a round is based on the measured charac-
teristics of the current load on the server, rather than theoretically derived
values, the key function of such an admission control algorithm is to accept
enough clients to efficiently utilize the server resources, while not accepting
clients whose admission may lead to the violation of the service requirements.

5 COMMERCIAL VIDEO SERVERS


There has been significant work in developing multimedia servers for a wide
variety of commercial applications. These products range from low-end PC
based multimedia servers designed to serve small work groups to high-end large
scale servers that can serve thousands of video-on-demand users.

The low-end servers are targeted for a local area network environment and
their clients are personal computers, equipped with video-processing hardware,
connected on a LAN. They are designed for applications such as on-site training,
information kiosks, etc., and the multimedia files generally consist of short video
clips. An example of such a low-end server is the IBM LANServer Ultimedia
product, which can serve 40 clients at MPEG-1 rates [4]. Other systems in this
class include FluentLinks, ProtoComm, and Starworks [21]. As the computing
Multimedia Storage Systems 141

power of personal computers increases, the number of clients that these servers
can support will also increase.

High end servers are targeted for applications such as video-on-demand, in


which the number of simultaneous streams is expected to be in the 1000s,
and the distribution system is expected to be cable based, or telephone-wire
based. Since the distribution area is large, network connectivity is an impor-
tant aspect of these systems. In order to provide a large collection of videos
in a cost-effective solution, such servers employ a hierarchy of storage devices.
Additionally, admission control mechanisms are extended to the distribution
network, including allocation of bandwidth on the backbone network and TV
"channels" on the cable plant. Finally, in such servers, the control mecha-
nisms must also interact with large transaction processing systems to handle
bookkeeping operations such as authorization and customer billing.

High-end video servers are based on collections of powerful workstations (IBM,


DEC, Silicon Graphics, Oracle/NCube) or mainframe computers. For instance,
the SHARK multimedia server is implemented on IBM RS/6000, and uses its
own file system to ensure continuous throughput from the disk subsystem [9].
Microsoft's TIGER video server uses a collection of PCs to construct a scalable
server [1]. It uses striping to distribute segments of a movie across the collection
of servers to balance the access load across the servers. It also uses replication
at the segment level as a mechanism for fault-tolerance. Oracle's Media Server
is based on the NCube massively parallel computer. It exploits the large I/O
capability of the NCube and is slated to deliver approximately 25,000 video
streams.

6 CONCLUDING REMARKS
Multimedia storage servers differ from conventional storage servers to the extent
that significant changes in design must be effected. These changes are wide in
scope, influencing everything from the selection of storage hardware to the
choice of disk scheduling algorithms. This chapter provides an overview of
the problems involved in multimedia storage server design and to the various
approaches of solving these problems.
142 CHAPTER 4

REFERENCES
[1] Microsoft Unveils Video Software. AP News, May 17, 1994.
[2] Small Computer System Interface (SCSI-II). ANSI Draft Standard
X3T9.2/86-109, November 1991.
[3] D. Anderson, Y. Osawa, and R. Govindan. A File System for Continuous
Media. ACM Transactions on Computer Systems, 10(4):311-337, Novem-
ber 1992.
[4] M. Baugher et al. A multimedia client to the IBM LAN Server. ACM
Multimedia '93, pp. 105-112, August 1993.
[5] E. Chang and A. Zakhor. Scalable Video Placement on Parallel Disk Ar-
rays. In Proceedings of IS€3T /SPIE International Symposium on Electronic
Imaging: Science and Technology, San Jose, February 1994.

[6] C. Federighi and L.A. Rowe. The Design and Implementation of the UCB
Distributed Video On-Demand System. In Proceedings of the IS€3T/SPIE
1994 International Symposium on Electronic Imaging: Science and Tech-
nology, San Jose, pages 185-197, February 1994.

[7] H. Garcia-Molina and K. Salem. Disk Stripping. International Conference


on Data Engineering, pages 336-342, February 1986.

[8] J. Gemmell and S. Christodoulakis. Principles of Delay Sensitive Mul-


timedia Data Storage and Retrieval. ACM Transactions on Information
Systems, 10(1):51-90, 1992.

[9] R. Haskin. The SHARK continuous media file server. Proc. CompCon,
pp. 12-15, 1993.
[10] K. Keeton and R. Katz. The Evaluation of Video Layout Strategies on
a High-Bandwidth File Server. In Proceedings of International Workshop
on Network and Operating System Support for Digital Audio and Video
(NOSSDAV'93), Lancaster, UK, November 1993.

[11] T.D.C. Little, G. Ahanger, R.J. Folz, J.F. Gibbon, F.W. Reeves, D.H.
Schelleng, and D. Venkatesh. A Digital On-Demand Video Service Sup-
porting Content-Based Queries. In Proceedings of the ACM Multimedia '93,
Anaheim, CA, pages 427-436. October 1993.

[12] M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry. A Fast File


System for UNIX. ACM Transactions on Computer Systems, 2(3):181-197,
August 1984.
Multimedia Storage Systems 143

[13] G. Miller, G. Baber, and M. Gilliland. News On-Demand for Multimedia


Networks. In Proceedings of ACM Multimedia '93, Anaheim, CA, pages
383-392, August 1993.
[14] T. Mori, K. Nishimura, H. Nakano, and Y. Ishibashi. Video-on-Demand
System using Optical Mass Storage System. Japanese Journal of Applied
Physics, l(llB):5433-5438, November 1993.

[15] A. Papoulis. Probability, Random Variables, and Stochastic Processes.


McGraw Hill, 1991.
[16] P. Venkat Rangan and H. M. Vin. Designing File Sy'stems for Digital Video
and Audio. In Proceedings of the 13th Symposium on Operating Systems
Principles (SOSP'91), Operating Systems Review, Vol. 25, No.5, pages
81-94, October 1991.
[17] A.L. Narasimha Reddy and J. Wyllie. Disk Scheduling in Multimedia
I/O System. In Proceedings of ACM Multimedia '93, Anaheim, CA, pages
225-234, August 1993.

[18] R. Van Renesse, A. Tanenbaum, and A. Wilschut. The Design of a High-


Performance File Server. IEEE Transactions on Knowledge and Data En-
gineering, 1(2):22-27, June 1989.

[19] L.A. Rowe, J. Boreczky, and C. Eads. Indexes for User Access to Large
Video Databases. In Proceedings of the IS&T/SPIE 1994 International
Symposium on Electronic Imaging: Science and Technology, San Jose,
pages 150-161, February 1994.
[20] T. Teorey and T. B. Pinkerton. A Comparative Analysis of Disk Scheduling
Policies. Communications of the A CM, 15(3) :177-184, March 1972.
[21] F.A. Tobagi, J. Pang, R. Baird, and M. Gang. Streaming RAID: A Disk
Storage System for Video and Audio Files. In Proceedings of ACM Multi-
media '93, Anaheim, CA, pages 393-400, August 1993.

[22] H. M. Vin, A. Goyal, and P. Goyal. Algorithms for Designing Large-Scale


Multimedia Servers. Computer Communications, 18(3):192-203, March
1995.

[23] H. M. Vin, A. Goyal, A. Goyal, and P. Goyal. An Observation-Based


Admission Control Algorithm for Multimedia Servers. In Proceedings of
the IEEE International Conference on Multimedia Computing and Systems
(ICMCS'94), Boston, May 1994.
144 CHAPTER 4

[24] H. M. Yin, P. Goyal, A. Goyal, and A. Goyal. A Statistical Admission


Control Algorithm for Multimedia Servers. In Proceedings of the A CM
Multimedia '94, San Francisco, October 1994.
[25] H. M. Yin and P. Venkat Rangan. Designing a Multi-User HDTV Storage
Server. IEEE Journal on Selected Areas in Communications, 11(1):153-
164, January 1993.
[26] H.M. Yin, S.S. Rao, and P. Goyal. Optimizing the Placement of Multi-
media Objects on Disk Arrays. In Proceedings of the Second IEEE Inter-
national Conference on Multimedia Computing and Systems, Washington,
D.C., pages 158-165, May 1995.
[27] P. Yu, M.S. Chen, and D.D. Kandlur. Design and Analysis of a Grouped
Sweeping Scheme for Multimedia Storage Management. Proceedings of
Third International Workshop on Network and Operating System Support
for Digital Audio and Video, San Diego, pages 38-49, November 1992.
5
MULTIMEDIA NETWORKS
Borko Furht* and Hari Kalva**
* Department of Computer Science and Engineering,
Florida Atlantic University, Boca Raton, Florida, U.S.A.

** Center for Telecommunications Research,


Columbia University, New York, U.S.A.

ABSTRACT
In a typical distributed multimedia application. multimedia data must be compressed.
transmitted over the network to its destination. and decompressed and synchronized
for playout at the receiving site. In addition. a multimedia information system must
allow a user to retrieve. store, and manage a variety of data types including images,
audio, and video. In this chapter we present fundamental concepts and techniques in
the areas of multimedia networks. We first analyze network requirements to transmit
multimedia data and evaluate traditional data communications versus multimedia
communications. Then, present traditional networks (such as Ethernet, token ring,
FDDI, and ISDN) and how they can be adapted for multimedia applications. We
also descrihe the ATM network. which is well suited for transfering multimedia data.
Finally, we discuss the newtork architectures for current and future information su-
perhighways.

1 INTRODUCTION
In today's communication market, there are two distinct types of networks:
local-area networks (LANs) and wide-area networks (WANs). LANs run on a
premises and interconnect desktop and server resources. while WANs are gen-
erally supported by public carrier services or leased private lines which link
geographically separate computing system elements. Figure 1 illustrates net-
work evolution from 1980s to these days as a function of transmission speed.

Many multimedia applications, such as on-demand multimedia services, video-


conferencing, collaborative work systems, and video mail require networked
146 CHAPTER 5

Transrrission Applications
Speed (EJt:sls)

1G Private Networlcs

o 100M
EJheme(

1M

AIbIic Nett.wrlcs

1K-L----~----------~-----------.~~

1980 1990 2000

Figure 1 Network evolution and typical applications.

multimedia [FM95). In these applications. multimedia objects are stored at a


server and played back at the clients' sit.es. Such applications might require
broadcasting multimedia data to various remote locations or accessing large de-
positories of multimedia sources. Required transmission rates for various types
of media (data, text, graphics, images, video. and audio), are shown in Table
1 [Roy94).

Traditional LAN environments. in which data sources are locally available, can-
not support access to remote multimedia data sources for a number of reasons.
Table 2 contrasts traditional data transfer and multimedia transfer [Fur94).

Multimedia networks require a very high transfer rate or bandwidth even when
the data is compressed. For example. an MPEG-l session requires a bandwidth
Multimedia Networks 147

INFORMATION BIT RATE QUALITY AND


TYPE REMARKS
DATA Wide range of bit rates Continuous, burst and
packet-oriented data

TEXT Several Kbps Higher bit rates for


downloading of large
volumes
GRAPHICS Relatively low bit rates Depending on transfer
time required
Higher bit rates -1 00 Exchange of complex 3D
Mbps computer model
and higher
IMAGE 64 Kbps Group-4 telefax
various Corresponds to JPEG std
Up to 30 Mbps High-quality professional
images
VIDEO 64-128 Kbps Video telephony (H.261)
384 Kbps - 2 Mbps Videoconferencing (H .261)
1.5 Mbps MPEG-1
5-10 Mbps TV quality (MPEG-2)
34145 Mbps TV distribution
50 Mbps or less HDTV quality
100 Mbps or more Studio-to-studio HDTV
video downloading

AUDIO nx64Kbps 3.1 KHz, or7.5 KHz, or


hi-fi baseband sign.als

Table 1 Typical transmission rates of various information types in multime-


dia communication.

of about 1.5 Mbps, while an MPEG-2 session including HDTV takes 3 to 40


Mbps. Besides being high, the transfer rate must also be predictable.

The traffic pattern of multimedia data transfer is stream-oriented, typically


highly bursty, and the network load is long and continuous. Figure 2 shows the
ranges of the maximum bit-rate and utilization of a channel at this rate for some
service categories [WK90j. The multimedia networks carry a heterogeneous
mix of traffic, which could range from narrowband to broadband, and from
continuous to bursty.

Traditional networks are used to provide error-free transmission. However,


most multimedia applications can tolerate errors in transmission due to corrup-
148 CHAPTER 5

CHARACTERISTICS DATA TRANSFER MULTIMEDIA


TRANSFER
DATA RATE low high

TRAFFIC PATIERN bursty stream-oriented


highly-bursty
RELIABILITY no loss some loss
REQUIREMENTS
LATENCY none low, e.g. 20 rnsec
REQUIREMENTS
MODE OF pOint-to-point multipoint
COMMUNICATION
TEMPORAL none synchronized
RELATIONSHIP

Table 2 Traditional communications versus multimedia communications.

tion or packet loss without retransmission or correction. In some cases, to meet


real-time delivery requirements or to achieve synchronization, some packets are
even discarded. As a result, we can apply lightweight transmission protocols
to multimedia networks. These protocols cannot accept retransmission, since
that might introduce unacceptable delays.

Multimedia networks must provide low latency required for interactive oper-
ation. Since multimedia data must be synchronized when it arrives at the
destination site. networks should provide synchronized transmission with low
jitter.

In multimedia networks, most communications are multipoint, as opposed to


traditional point-to-point communication. For example, conferences involving
more than two participants need to distribute information in different media
to each participant. Conference networks use multicasting and bridging distri-
bution methods. Multicasting replicates a single input signal and delivers it
to multiple destinations. Bridging combines multiple input signals into one or
more output signals, which is then delivered to the participants [AE92].
Multimedia Networks 149

Continuous

CHANNEL 0.1
lJTDJ.ZATION
0.01

0.001

lK 10K lOOK 1M 10M 100M


Peak source-bit rate (bits/sec]

Figure 2 The characteristics of multimedia traffic.

2 TRADITIONAL NETWORKS AND


MULTIMEDIA
In this section we describe traditional networks. such as Ethernet, token ring,
Fiber Distributed Data Interface (FDDI). and Integrated Services Digital Net-
work (ISDN). and ther suitability for transfering multimedia data. Improved
traditional networks. such as isochronous. switched. fast, and priority Ether-
nets, and priority token ring, are also presented . These networks are modified
traditional networks with the main purpose to provide transfer of multimedia
data.

2.1 Ethernet
Ethernet (IEEE 802.3 10 Base-T standard) is a local-area network running at
10 Mbps . Ethernet uses heavy coaxial cable that forms a single bus or open
path on which all stations are connected. as illustrated in Figure 3.
150 CHAPTER 5

HUB <,.

Figure 3 Topology of Ethernet.

Because stations on Ethernet all share the same medium. each station listens
before sending data and during sending in order to prevent interference with
each other. Each station hears the traffic that is broadcast on the network and
copies the data that is addressed to it. If another signal is heard, the station
will either delate its sending. or stop it if sending is already in progress. This
strategy is called "Carrier Sense Multiple Access with Collision Detection", or
CSMA/CD.

Ethernet is probabilistic; i.e. there is only a high probability that a data


packet will be delivered to its destination. Therefore. the end-to-end delay is
not deterministic. the access time is not bounded. and latency and jitter are
unpredictable. Because of these characteristics, Ethernet is not suitable for
transmission of multimedia data. The capability of Ethernet to carry video,
audio. and data traffic was analyzed in [DCT94j. Ethernet was evaluated based
on number of streams that can be supported given the stream rate. stream delay
and loss constraints. data traffic load, and data burst size.
Multimedia Networks 151

Isochronous Ethernet
Isochronous Ethernet (IEEE 802.9 standard) is an extension of traditional Eth-
ernet capable of supporting multimedia applications. In addition to 10 Mbps
packet mode service, isochronous Ethernet offers 6.144 Mbps circuit switched
isochronous data service. Each node consists of two channels: a 6 Mbps circuit
switched channel and a 10 Mbps packet switched channel. The circuit switched
channel is suitable for multimedia applications, because there are minimal jitter
and latency. Even though physical resources are shared, this is equivalent to
two networks. one circuit-switched and another packet-switched.

Switched Ethernet
Switched Ethernet is a modification of traditional Ethernet, in which stations
are connected to the network medium from a communication hub instead of a
conventional tapped connection. Each station is connected to the hub by means
of an Ethernet transmission line. The hub reads the destination address of the
packets on an incoming line and switches them on the appropriate outgoing
line. The hub also contains a buffer in case when the outgoing line is not free.
This connection is equivalent of having a dedicated 10 Mbps connection per
station. Since the connections are switched, each station can communicate as
if it were the only station on the network. In such a way, switched Ethernet
provides the bandwidth needed for multimedia applications.

Fast Ethernet
Another approach to make Ethernet more suitable for multimedia traffic is to
increase its bandwidth to 100 Mbps (IEEE 802.3 100Base-T standard). Fast
Ethernet uses the same CSMA-CD access protocol, and therefore has the same
drawbacks as standard 10 Mbps Ethernet - unpredictable transmission delay
and latency. However, due to improved bandwidth, fast Ethernet can support
much larger number of multimedia streams than standard Ethernet, and there-
fore can be used for small or medium configurations of multimedia stations
[Stu95].

Priority Ethernet
Providing higher priorities to stations transfering multimedia data is another
way of adapting Ethernet for multimedia applications [AS94]. In the prior-
ity Ethernet network, nodes are assigned priorities and each node maintains a
152 CHAPTER 5

priority queue. The priority queue consists of node names and the priority of
data it has to transmit. The priority information is exchanged during the pri-
ority exchange phase. There are three ways of exchanging priority information:
synchronous, modified synchronous, and asynchronous.

2.2 Token Ring


Token ring (IEEE 802.5 standard) is a LAN with a ring topology with the
data rate of 16 Mbps. The advantages of a ring structure include: small delays,
simple implementation. absence of collisions. and deterministic worst case delay.
In a token ring, the access to the transmission medium is controlled by a token.
Only a station that has the token can transmit data. When the network is
activated. a free token is circulated around the ring. The station ready to
transmit. reads the token. makes it busy. and retransmits as a part of the
header of the data. When transmission is completed. the station regenerates
an idle token. The other stations on the ring read the busy token and refrain
from transmitting. and therefore there is no chance of packet collisiol.l. Each
station has an active interface to the ring and a latency associated with it. The
network latency increases with the number of stations on the network.

Token ring is deterministic - the worst case network delay can be determined.
If there are n stations connected to a ring. the worst case delay is equal to
(n - 1) x longesLpackeUransmission_time.

Since the delay is deterministic. token ring can be used for multimedia applica-
tions. Due to relatively low bandwidth (16 Mbps), one way for adapting token
ring for multimedia applications is to use configuration control. In this method,
a restriction will be placed on the number of stations that can be connected
to a ring. For example. a 16 Mbps ring can support up to 8 stations with an
average bandwidth of 2 Mbps per station. This solution is not practical. since
no more than 8 multimedia stations can be connected to a ring.

Priority Token Ring


In a priority token ring (PTR) network, the multimedia traffic is separated
from regular traffic by priority, as illustrated in Figure 4. The bandwidth
manager plays a crucial role by tracking sessions. determining ratio priority,
and registering multimedia sessions. Priority token ring supports 8 levels of
priority: priority levels 0-3 are used for user traffic, 4 is used for bridge and
router traffic. and 5 and 6 are reserved for multimedia data (5 for video and 6
Multimedia Networks 153

for audio). Priority token ring works on existing networks and does not require
configuration control [FM95].

Server

Priority Token Ring

Figure 4 Priority token ring can be used in multimedia applications. Five


stations with the highest priorities are perfonning multimedia data transfer,
while the remaining 27 stations are performing regular data transfer.

The admission control in PTR guarantees bandwidth to multimedia sessions;


however, regular traffic can experience delays. For example, assume a priority
token ring network at 16 Mbps that connects 32 nodes (Figure 4). When no
priority scheme is set. each node gets an average of 0.5 Mbps of bandwidth.
When half the bandwidth (8 Mbps) is dedicated to multimedia, the network
can handle about 5 MPEG-l sessions (at 1.5 Mbps) . In this case, the remaining
27 nodes can expect about 8 Mbps divided by 27, or 0.296 Mbps, about half of
what they would get without priority enabled .

Crimmins [Cri93] evaluated three priority ring schemes for their applicability to
multimedia applications: (1) equal priority for video and asynchronous packets,
(2) permanent high priority for video packets and permanent low priority for
asynchronous packets, and (3) time-adjusted high-priority for video packets
(based on their ages) and permanent low priority for asynchronous packets.

The first scheme, which entails direct competition between video conference
and asynchronous stations, achieves the lowest network delay for asynchronous
traffic. However, it reduces the video conference quality. The second scheme,
154 CHAPTER 5

in which video conference stations have the permanent high priority, produces
no degradation in conference quality, but increases the asynchronous network
delay. Finally, the time-adjusted priority system provides a trade-off between
first two schemes. The quality of this scheme is better than the first scheme,
while the asynchronous network delays are shorter than in the second scheme
[Cri93].

2.3 FDDI
The Fiber Distributed Data Interface (ANSI X3T9.5 standard) provides 100
Mbps bandwidth, which may be sufficient for multimedia applications [Ros86,
Jai93]. The FDDI topology consists of two counter rotating independent fiber
optic rings, as shown in Figure 5. The ring can have a perimeter up to 100 km.
Stations can be connected to both rings (class A stations), or only to one ring
(class B stations).

Class B stations

Class A stations

Figure 1) Topology of FOOl network consisting of two rings.

FDDI supports both synchronous and asynchronous modes of traffic. Syn-


chronous traffic consists of delay sensitive traffic (such as voice packets), which
need to be transmitted at regular time intervals. Asynchronous traffic consists
of data packets produced by various computer communication applications,
Multimedia Networks 155

such as File Transfer Protocol and mail. One of the main characteristics of
FDDI is its fault recovery mechanism -- when a component in the network fails.
other components can reorganize and continue to work.

In the synchronous mode. FDDI has low access latency and low jitter. FDDI
also guarantees a bounded access delay ~nd a predictable average bandwidth
for synchronous traffic. However. due to the high cost. FDDI networks are used
primarily for backbone networks. rather than for networks of workstations.

FDDI cannot support isochronous traffic which is highly desirable in interactive


multimedia applications. Isochronous traffic allows fixed number of packets of
data to be delivered in fixed time interval. To support isochronous traffic. the
basic FDDI has been extended to FDDI-Il. In addition to isochronous traffic.
FDDI-II also supports synchronous and asynchronous traffic. The capabilities
of FDDI-II in handling multimedia traffic are evaluated in [Kri90].

2.4 ISDN
Integrated Services Digital Network (ISDN) is an access and signaling expansion
to the basic technology of the public switched telephone network. with the
main purpose to support non-voice communications [Lea88, Tha93]. The local
loop connection between subscriber and switch is made in a digital form, with
multiple multiplexed information channels supported per access line.

Present optical network technology can support the Broadband Integrated Ser-
vices Digital Network (B-ISDN) standard. expected to become the key network
for multimedia applications [Cla92 , KJ91. Sak93]. B-ISDN access can be basic
or primary. Basic ISDN access supports 2B + D channels, where the transfer
rate of a B channel is 64 Kbps. and that of a D channel is 16 Kbps. Primary
ISDN access supports 23B+D channels in the US (1.544 Mbps), and 30B+D
channels in Europe (2.048 Mbps).

The two B channels of the ISDN basic access provide 2 x 64 Kbps. or 128
Kbps of composite bandwidth. Three types of connections can be set up over a
B channel: (a) circuit switched. (b) packet switched. and (c) semi permanent.
The semi permanent connection is setup by prior arrangement and is equivalent
to a leased line. The D channel is used for common channel signaling.

ISDN can be well suited for the high-rate applications, which would include
both data applications and videoconferencing. Videoconferencing applications
156 CHAPTER 5

can use part of ISDN capacity for wideband speech, saving the remainder for
purposes such as control, meeting data, and compressed video. Figure 6 shows
the composition of two B channels for multimedia applications [Cla92].

B1-64 Kbps B2-64Kbps

62.4 Kbps IF~BAS II 62.4 Kbps

H.261 video
IF.-. 1 Basic channels

I
control

56 Kbps
1·,-1 62.4Kbps
Audio, video
plus basic control

1,.4_1
control/data H.261 video

I
Audio, video
plus control and
48 Kbps 62.4Kbps
low speed data

control at 68.8 Kbps


, I

I
.....

1···-1
Audio plus control
56 Kbps 62.4 Kbps and high speed data

Figure 6 Composition of two B channels of the ISDN network for multimedia


applications.

3 ASYNCHRONOUS TRANSFER MODE


Asynchronous Transfer Mode (ATM) [Bou92, SVH91, KMF94, DCH94] is a
packet oriented transport mechanism proposed independently by Bellcore and
several telecommunication companies in Europe. ATM is the transfer mode
recommended by for implementing B-ISDN. ATM is considered the network of
the future and is meant to support applications with varying data rates.

The ATM network provides the following benefits for multimedia communica-
tions:

• It can carryall kinds of traffic, and


Multimedia Networks 157

• It can operate at very high speeds.

The ATM network can carry integrated traffic because it uses small fixed size
cells, while traditional networks use variable-length packets, which can be sev-
eral KB of size. The ATM network uses a connection-oriented technology, which
means that before data traffic can occur between two points, a connection needs
to be established between these end points using a signaling protocol. An ATM
network architecture is shown in Figure 7.

MS

UNI LIter 10 nelWOrlllnr...r.a.


NNI Networl< 10 networl< Inr...rc.
MS MulUmed/. _ _

MS

Figure 7 The architecture of an ATM network consists of a set of tenninal


nodes and a set of intennediate nodes (switches).
158 CHAPTER 5

Two major types of interfaces in ATM networks are the User-to-Network In-
terface (UNI) , and the Network-to-Network Interface, or Network-to-Node In-
terface (NNI).

An ATM network comprises a set. of terminals and a set of intermediate nodes


(switches), all linked by a set of point-to-point ATM links, as illustrated in
Figure 7. The ATM standard defines the protocols needed to connect the
terminal and the nodes; however it does not specify how the switches are to be
implemen ted.

3.1 ATM Cells


The basic transport unit in an ATM network is the cell. Cells are fixed length
packets of 53 bytes. Each cell has a 5 byte header and 48 byte payload as shown
in Figure 8a. The header consists of information necessary for routing, but it
does not, contain a complete destination address. The cells are switched by a
switching node using the routing tables which are set up when the network is
initiated. The header consists of several fields as shown in Figure 8b.

bits
7 6 ................................ 0
bytes
5 bytes
GFC I VPI o
VPI I 1
2
VCI I PT ICLP 3
48 bytes 4
HEC

(b) ATM header


(a)ATM cell

Figure 8 Components of (a) ATM ceil, (b) ATM header.

The Generic Flow Control (GFC) field is used for congestion control at the User-
to-Network Interface to avoid overloading. The Virtual Path Identifier/Virtual
Channel Identifier (VPIjVCI) fields contain the routing information of the cell.
The Payload Type (PT) represents the type of information carried by the cell.
The CLP field indicates cell loss priority, i.e. if a cell can be dropped or not
in case of congestion. The Header Error Control (HEC) field is used to detect
Multimedia Networks 159

and correct the errors in the header. ATM does not have an error correction
mechanism for the payload .

3.2 ATM Connections


In an ATM network. a connection has to be established between two end points
before data can be transmitted. An end terminal requests a connection to
another end terminal by transmitting a signaling request across the UNI to the
network. This request is passed across the network to the destination. If the
destination agrees to form a connection. a virtual circuit is set up across the
ATM network between these two end points.

These connections are made of Virtual Channel (VC) and Virtual Path (VP)
connections. These connections can be either point-to-point or point-to- multi-
point. The basic type of connection in an ATM is VC. It is a logical connection
between two switching points. VP is a group of VCs with the same VPI value.
The VP and VC switches are shown in Figure 9.

\C1 \C1
\a I.C2
\C3 \C3
VP1 VP1
\C1 \C1
I.C2 I.C2
VP2 VP2 \C3 \C3

VP3 VP3 \C1 \C1


\a I.C2
\C3 \C3
(a)WsWtch \,f'3 \,f'3
(b) VP sv.itdl

Figure 9 Virtual connections in ATM: (a) Virtual Path (VP) switch, and (b)
Virtual Channel (VC) switch.

Let us consider an example of setting up a VC connection between two nodes N 1


and N2, as shown in Figure 10. The process of establishing the VC connection
consists of the following steps:

• The end terminal Nl sends the request to the UNI.


160 CHAPTER 5

• The UNI forwards the request to the network.


• Assume that the network selects the following route A-B-C-D. Each of the
four nodes will use an unused VC value for the connection. For example.
the intermediate node A chooses VCl. and therefore all the cells from Nl
will have a label VCl at the output from A.

• The node A sends the cell to the node B. B will change VCl to VC2 and
will send VC2 to the node C.

• At the node C. VC2 is associated with VC3 and sent to the node D.
• At the node D. VC3 is associated with VC4. The node D checks if the
UNI at the terminal node N2 is free. If the UNI is free. the cell with the
label VC4 is given to N2.
• The terminal node N2 uses now VC4 for its connection to the node D.
• D sends this cell to C, which associates VC4 with VC3, and sends it to B.
B associates VC3 with VC2 and sends to A. A associates VC2 with VCl
and delivers the cell to the terminal node N1.
• The connection between terminal nodes Nl and N2 is thus established.

Figure 10 ATM connection between two terminal nodes Nl and N2.

When the transmission of data is completed. Nl sends a message to tear down


the connection, and VCIjVPI values are free to be used by any other con-
nection. Once the connection is established. all the cells travel over the same
virtual channel connection. The cells within a VC are not switched out of or-
der and no retransmissions are done in ATM . This guarantees that all the cells
Multimedia Networks 161

are delivered in sequence to the destination. But, because of buffering at the


source, the end-to-end delay is not predictable. ATM also does not guarantee
that a cell will be delivered to the destination. This is because a cell could be
dropped in the case of congestion.

During the call setup phase, a user negotiates network resources such as peak
data rate and parameters such as cell loss rate, cell delay, cell delay variation,
etc. These parameters are called the Quality of Service (QOS) parame-
ters. Connection is established only if the network guarantees to support the
QOS requested by the user. A resource allocation mechanism for establishing
connections is shown in Figure 11 [SVH91]. Once the QOS parameters for a
connection are set up, both the user and the network stick to the agreement.

Call request

Call level

Call rejection Network load control


No

Alternate paths

Connection
level

Link allocation phase

No
. .-------{Congestion control phase ] - Ce.....
Ves
Connection
establishment

Figure 11 A resource allocation mechanism for establishing connections


[SVH91j.
162 CHAPTER 5

Even when every user/terminal employs the QOS parameters, congestion may
occur. The main cause for congestion is statistical multiplexing of bursty con-
nections. Two modes of traffic operation are defined for ATM: (a) statistical
and (b) non-statistical multiplexing [WK90]. In the general ATM node model
shown in Figure 12, if the sum of the peak rates of the input links does not
exceed the output link rate (EPi ~ L), the mode of operation is called non-
statistical multiplexing; and if E Pi exceeds L it is called statistical multiplexing.
During connection setup, each connection is allocated an average data rate of
that channel instead of the peak data rate. Several such channels are multi-
plexed hoping that all the channels do not burst at the same time. If several
connections of a channel burst simultaneously, congestion might occur.

Peak rates

P1==~~~B7
P2 Link rate L

PN -----1=rIJ1
Node
input ports

Figure 12 General ATM node model.

3.3 ATM Protocols


The ATM protocol reference model, shown in Figure 13, is similar to the OSI
layered model [DCH94]. Communication from higher layers occurs through
three layers: the ATM layer, the Physical layer, and the ATM adaptation layer
(AAL). Tlie portion of the layered architecture used for end-to-end or user-
to-user data transfer is known as the User Plane (U-Plane). Similarly, higher
layer protocols are defined across the ATM layers to support switching - this is
referred as the Control Plane (C-Plane). The control of an node is performed
by a Management Plane (M-Plane), which is further divided into Layer Man-
Multimedia Networks 163

agement, that manages each of the ATM layers. and Plane Management for
the management of all the other planes.

....
.....
Management Plane
............ "iii"
:::l
(I)

User 3
III
Plane :::l
III
(Q
(I)

Higher layers 3(I)


2.
ATM adaptation layer

ATM layer

Physical layer

Figure 13 The structure of the ATM protocol.

ATM Adaptation Layer


The ATM Adaptation Layer (AAL) protocols generate the traffic that is carried
in the ATM cells. The AAL layers provide a service to the higher layers that
corresponds to the four classes of traffic. These four classes are:

• AAL 1 - Constant bit rate traffic, connection oriented, where there is a


strong timing relation between the source and the destination. Examples
of such traffic include constant bit rate video, PCM encoded voice traffic,
and emulation ofT-carrier public network circuits [DCH94].

• AAL 2 - Connection oriented, where a strong timing relation between


source and destination is required, but where the bit rate may be variable.
Example of such traffic is variable bit rate video.

• AAL 3/4 This traffic is connectionless. variable data rate. without a


timing relationship between source and destination. An example of such
traffic is the connectionless packet data carried by LANs.
164 CHAPTER 5

• AAL 5 Connection oriented, variable data rate, where there is no tim-


ing relationship between source and destination. Examples include X.25,
Frame Relay, and TCP lIP.

There are five AAL protocol layers, each of which was designed to optimally
carry one of the four classes of traffic, as illustrated in Figure 14.

TIming relation belwee Required Not required


source and destination

Sit Rate Constant variable

Connection-
Connection Mode Connection orienmd leu

AALl}'pes 1 2 5 314

Figure 14 Classes of traffic and associated AAL layers.

3.4 SONET
The Synchronous Optical Network (SONET) is often associated with ATM.
SONET is a set of physical layers, originally proposed by Bellcore in the mid-
1980s as a standard for optical fiber based. transmission line equipment for
the telephone system. SONET defines a set of framing standards which dictate
how data is transmitted across links, together with ways of multiplexing existing
transmission line frames (e.g. DS-l, DS-3) into SONET.

The lowest SONET frame rate. known as STS-l, defines an 8 KHz frame con-
sisting of 9 rows of 90 bytes each. Its rate is 51.84 Mbps. The next highest
rate is STS-3 with 155.52 Mbps rate, and so on. The STS-24 gives the rate of
1.244 Gbps. Table 3 shows various SONET frame rates.
Multimedia Networks 165

..................
~~ ..............................................N'...................y ..•...•.....•.....·.v...•...• ......................·....,. ....•...•....... '.1'....... ......'Y'o' .................:,.:.:w..~....................... :::

Data Rate European US Interface Optical ij


Interface Interface ':
~
51.84 Mbps - 5T5-1 OC-1 .:
::

155.52 Mbps 5TM-1 5T5-3 OC-3 :j

622.08 Mbps 5TM-4 5T5-12 OC-12

1.244 Gbps 5TM-12 5T5-24 OC-24


2.488 Gbps 5TM-24 5T5-48 OC-48
Nx51.85 5TM-M 5T5-N OC-N ;'
Mbps

Table 3 SONET hierarchy.

Corresponding to each frame rate are optical fiber medium standards. The
STS-l corresponds to the OC-l fiber standard, the STS-3 corresponds to OC-
3, and so on (see Table 3). These standards define fiber types and optical power
levels. The SONET standards were designed to scale to the very high speeds
required for ATM.

4 SUMMARY OF NETWORK
CHARACTERISTICS
Table 4, adapted from [Stu95], evaluates key characteristics of networks de-
scribed in sections 2 and 3. It can be concluded that several networks provide
support for multimedia traffic. however the ATM network is superior due to its
high bandwidth and low transmission delay.

In the next section, we compare the ATM technology with several other switch-
ing technologies (such as STM, SMDS, and Frame Relay) for multimedia ap-
plications.
166 CHAPTER 5

NElWORK BANDWIDTH DEDICATED TRANSMISSION BROADCAST SUITABILITY


[Mbps] or SHARED DELAY CAPABILITY for
MULTIMEDIA

Ethernet 10 Shared Unpredictable Yes Low

Isochronous 10+ 6 Shared Fixed < 1 ms No Average


Ethernet
Switched 10 Dedicated Predictable Yes Average
Ethernet
Fast 100 Shared Unpredictable Yes Average
Ethernet
Priority 10 Shared Predictable No Average
Ethernet (priorities)
Token Ring 16 Shared Predictable Yes No
(high)
Priority 16 Shared Predictable No Average
Token Ring (priorities) (low)
FOOl 2x 100 Shared Configuration Yes Average
dependent (high cost)
FOOl II 2 x 100 Shared Fixed < 1 ms Yes Average
(no available)
ISDN n x 0.064 Dedicated Fixed < 10 ms Yes Good
(videoconf.)
ATM 25-1000 Dedicated Fixed < 10 ms Yes Excellent

Table 4 Sununary of network characteristics. Adapted from [Stu95).

5 COMPARISON OF SWITCHING
TECHNOLOGIES FOR MULTIMEDIA
COMMUNICATIONS
Besides the ATM technology, other switching technologies considered for mul-
timedia communications include:

• Synchronous Transfer Mode (STM),


• Switched Multi-Megabit Data Service (SMDS), called 802.6, and
Multimedia Networks 167

• Frame Relay.

STM is a circuit switched network mechanism used to transport packetized


voice and data over networks. When a connection is established using STM,
the bandwidth allocated for the connection is reserved for the entire duration
of the connection even though there may not be data transmission over the
entire duration. This is a significant waste of bandwidth which could otherwise
have been used by a terminal node waiting for network bandwidth. So this
mode limits the maximum number of connections that can be established si-
multaneously. The ATM network overcomes this limitation by using statistical
multiplexing with hardware switching.

SMDS uses a telecommunication network to connect LANs into Metropolitan


Area Networks (MANs) and Wide-Area Networks (WANs).

Frame Relay is a network similar to packet switching networks, which consists


of multiple virtual circuits on a single access line. It operates at higher data
rates than most packet networks. and also gives lower propagation delays.

Table 5 compares the properties of these four switching technologies for multi-
media communications [RKK94]. The properties compared include support for
multimedia traffic, connectivity, performance guarantee, bandwidth, support
for media synchronization, and congestion controL

According to Table 5, it is clear that ATM is the superior switching technology


for multimedia communications. ATM is capable of supporting multimedia
traffic, it provides various types of connectivity, it guarantees performance, its
bandwidth is very high (up to several Gbps), its end-to-end delay is relatively
low, and it guarantees media synchronization.

6 INFORMATION SUPERHIGHWAYS
Information superhighways are large consumer information networks that al-
low millions of users to communicate, interact among themselves, and receive
various services. Today there are very few networks that can be classified as
information superhighways ~. plain old telephone network, worldwide Internet,
and Teletel, the French videotext network.
168 CHAPTER 5

Various switching technologies


ATM STM SMDS Frame
Parameters (802.6) Relay
support of Data Yes Yes Yes Yes
multimedia
traffic
transfer Audio Yes Yes

Video Yes Yes


Connectivity One-to-many Yes Limited Yes Yes

Many-la-one Yes Limited Yes Yes


Many-Io-one Yes Yes Yes
Performance Delay for Yes Yes
guarantee audiolvideo
Packel loss Yes Not Yes Yes
fordala applicable
Bandwidth 1.5 Mb/s to Fewbits/s 1.5 Mb/s to 56 Kb/s to
multi Gb/s 10 multi Gb/s 45 Mb/s 1.5 Mb/s
Guarantee of media Yes Yes
synchronization
End-to-end delay
Low Lower Higher Higher
thanATM than ATM than ATM
Congestion control Required Not Required Required
required

Table 5 Comparison of switching technologies for multimedia communica-


tions. Adapted from [RKK94J.

6.1 Internet
Internet is a loose connection of thousands of networks, and an estimated num-
ber of users on Internet is today about 30 million. Internet was developed
by researchers, and there is no global network administration. The following
devices provide the internetworking: repeaters. bridges, routers, and gateways.

Repeaters provide the cheapest and simplest solution for interconnecting be-
tween LANs. A repeater links identical LANs, for example two Ethernets, by
Multimedia Networks 169

simply amplifying the signal received on one cable segment and then retrans-
mitting the same signal to another cable segment.

Bridges provide a more intelligent connection solution. A bridge makes that


two or more LANs behave as one LAN. Media access control is provided sepa-
rately for each port to provide decoupling from the physical layers.

Routers, like bridges. can effectively extend the size of a network. They pro-
vide even more intelligent solution than bridges. Routers can be connected to
each other via private leased lines, or can be connected to a switch (e.g., SMDS
switch). The Internet routers serve as intermediate store-and-forward devices
which relay messages from source to destination.

Gateways provide the most intelligent. but slowest connection service. They
provide translation services between different computer protocols. such as SNA,
TCP lIP, and DECnet. Gateways allow devices on a network to communicate
with devices on a dissimilar network.

Figure 15 shows how six LANs (both Ethernet and token ring networks) can be
linked together by five routers. and how three Ethernet LANs can be connected
by two bridges.

A major cultural change began in 1992, with the advent of the World Wide
Web (WEB). so today Internet can be declared as the major information super-
highway. There are about 6.6 million Internet host computers in 106 countries,
with 100 million Internet hosts projected by the end of this decade. Internet is
growing at an average rate of over 40 percent in most regions. However, Inter-
net has many drawbacks. including poor support for video transmission. Due
to high load, Internet often experiences slowed performance. Users trying to
access sites on the WEB and other services are often frustrated by long connect
times or lost data.

In order to overcome these drawbacks, which are caused due to tramendous


increase in users and traffic sent over Internet. the Internet backbones must
be upgraded. Internet could be improved by using larger and more power-
ful routers and switches. as well as fiber-optic lines to link them. Owners of
networks connected to Internet must cooperate to get their respective systems
to communicate smoothly. That gets more difficult as more networks link to-
gether.
170 CHAPTER 5

Ethernet

Figure 15 An example of internetworking using routers and bridges. Six


LANs are connected by five routers and three Ethernets are connected by two
bridges. Routers and bridges create a network of networks by providing access
to previously unavailable devices or services. Adapted from [DCH94j .

6.2 Full Service Network


Realizing that the future is in multimedia and information superhighways, ca-
ble and telephone companies entered the battle of building their information
superhighways which will be capable of transporting video in real time . Many
experimental video-on-demand and interactive TV systems have recently been
built in the U.S.A ., Canada, Europe, and Asia [F+95, B+95]. These sys-
tems are using current cable and telephone networks; however new technologies
have been developed to enable video transmission and user interactivity. The
new access technologies, such as ADSL (Asynchronous Digital Subscriber Line)
Multimedia Networks 171

[KMS95], HFC (Hybrid Fiber-Coax) [Paf95], and Fiber-to-the-Curb (FTTC) ,


provide high bandwidth needed for delivery of interactive broadband services
over existing twisted-pair copper lines or coaxial cable.

Today, thousands of households are already receiving (experimental) video-on-


demand, interactive games, electronic newspapers, and shopping services. With
the tramendous growth of wide area networks, such as ATM, this number will
soon reach millions. Then, these networks will also become information super-
highways. At that time, Internet will experience very strong competition, and
perhaps its importance will be significantly reduced. We envision that Internet
will then become one of the services on these new information superhighways,
refered as Full Service Networks (FSN). Other potential technologies for deliver-
ing braodband services, which may also be used for information superhighways,
include terrestrial and satellite radio transmission systems.

We envision that a future global information superhighway will consist of a


hierarchy of interconnected ATM subnetworks covering various geographical
areas [F+95, Vec95], as illustrated in Figure 16. The system consists of infor-
maticn providers at various levels - such as entertainment houses, television
stations. educational institutions, digital libraries, and many others - that offer
various services, network providers that transport media over integrated net-
works, and several levels of multimedia servers that manage data storage and
contain network switches.

A global ATM backbone network will be designed to provide international con-


nectivity by connecting together many national ATM networks. In the example
in Figure 16, a national ATM backbone network willliekly support up to 100
ATM metropolitan area networks (MANs). Each MAN will be linked to an
average 1,000 broadband access networks (cable head ends or telephone central
offices), and each broadband access network (BAN) will support in average
1,000 users or subscribers. In summary, the presented national information
superhighway infrastructure will link together about 100 million users. We
envision that several technologies (ADSL, HFC, and FTTC) will coexist in
providing the final access to users.

7 CONCLUSION
In this chapter, we presented basic concepts in multimedia networking and
communication. It is clear that current local area networks, such as Ether-
172 CHAPTER 5

1 N -A rM National ArM networlc


(N-ATND

100MANs

1,000 BANs

1,000 Users
TOTAL: 100 ".lIion users

Figure 16 Information superhighway structure of the 21th century.

net and token ring, do not provide neither sufficient bandwidth nor required
transmission times needed for multimedia applications. Therefore. in order to
protect investments in existing networks, they have been adapted to support
multimedia traffic. Adapted traditional networks. such as switched and fast
Ethernet and priority token ring, are capable of transmitting multimedia data,
however they still have many limitations and drawbacks.
Multimedia Networks 173

New networking technologies, such as ATM, are much better suited to support
multimedia traffic. They provide both high bandwidth and low and predictable
transmission delay, needed for multimedia applications. Future information su-
perhighways, with the ambitous goal to transport multimedia data throughout
the world, will be built using these new technologies.

There are still many challenges in the field of multimedia communication and
networking. Once multimedia networks are widely established, the Quality of
Service requirements of multimedia applications will become evident and exist-
ing service models must be refined [Lie95]. The future research will concentrate
on various aspects of multimedia communications including: (a) developing ef-
ficient network management techniques necessary for high-speed network, (b)
developing models and techniques for traffic management, (c) developing ad-
mission control algorithms to guarantee QOS requirements, and (d) developing
network intelligent network switches to satisfy these QOS requirements. Net-
work traffic is becoming more complex, and therefore a great challenge is to
develop new models for traffic characterization, particularly the traffic charac-
terization of compressed video source [Lie95].

REFERENCES
[AE92] S. R. Ahuja and J. R. Ensor, "Coordination and Control of Multime-
dia Conferencing". IEEE Communications Magazine, Vol. 30, No.5, May
1992, pp. 38-43.
[AS94] F. Adlestein and M. Singhal, "Priority Ethernets: Multimedia Support
on Local Area Networks", Proceedings of the Int. Conference on Distributed
Multimedia Systems and Applications, Honolulu, Hawaii, August 1994, pp.
45-48.
[B+95] D.E. Blahut, T.E. Nichols, W.M. Schell, G.A. Story, and E.S.
Szurkowski, "Interactive Television", Proceedings of the IEEE, Vol. 83,
No.7, July 1995, pp. 1071-1088.
[Bou92] J. Y. L. Boudec. "The ATM: A Tutorial". Computer Networks and
ISDN Sytems, Vol. 24. 1992, pp. 279-309.

[Cla92] W. J. Clark, "Multiport Multimedia Conferencing", IEEE Communi-


cations, Vol. 30, No.5, May 1992, pp. 44-50.
174 CHAPTER 5

[Cri93] M. Crimmins, "Analysis of Video Conferencing on a Token Ring Local


Area Network", Proc. ACM Multimedia 93, ACM Press, New York, 1993,
pp. 301-310.
[DCH94] "Data Communication Handbook", Siemens Stromberg-Carlson,
Boca Raton, Florida, 1994.

[DCT94] I. Dalgic, W. Chien, and F.A. Tobagi, "Evaluation of lOBase-T and


100Base-T Ethernets Carrying Video, Audio, and Data Traffic", Proceed-
ings of the IEEE Infocom, Toronto, Canada, June 1994.

[F+] B. Furht, D. Kalra, F. Kitson, A. Rodriguez, and W. Wall, "Design Issues


for Interactive Television Systems", IEEE Computer, Vol. 28, No.5, May
1995, pp. 25-39.
[FM95] B. Furht and M. Milenkovic, "A Guided Tour of Multimedia Systems
and Applications", IEEE Tutorial Text, IEEE Computer Society Press,
1995.

[Fur 94] B. Furht, "Multimedia Systems: An Overview", IEEE Multimedia, Vol.


1, No.1, Spring 1994, pp.47-59.

[Jai93] R. Jain, "FDDI: Current Issues and Future Plans", IEEE Communica-
tions Magazine, September 1993, pp. 98-105.
[KJ91] M. Kawarasaki and B. Jabbari, "B-ISDN Architecture and Protocol",
IEEE Journal on Selected Areas in Communications, Vol. 9, No.9, De-
cember 1991, pp. 1405-1415.

[KMF94] H. Kalva, R. Mohammad, and B. Furht, "A Survey of Multimedia


Networks", Technical Report TR-CSE-94-42, Florida Atlantic University,
Dept. of Computer Science and Engineering, Boca Raton, Florida, Septem-
ber 1994.

[KMS95] P.J. Kyees, R.C. McConnell, and K. Sistanizadeh, "ADSL: A New


Twisted- Pair Access to the Information Highway" , IEEE Communications
Magazine, Vol. 33, No.4, April 1995, pp. 52-59.
[Kri90] B. Krishnamurthy, "FDDI-II: Capacity Analysis", Proceedings of the
IEEE Int. Conference on Computers and Communications, 199(), pp. 546-
550.

[Lea88] D.M. Leakey, "Integrated Services Digital Networks: Some Possible


Ongoing Evolutionary Trends", Computer Networks and ISDN Systems,
Vol. 15, No.5, October 1988, pp. 303-312.
Multimedia Networks 175

[Lie95] J. Liebeherr, "Multimedia Networks: Issues and Challenges", IEEE


Computer, Hot Topics, Vol. 28, No.4, April 1995, pp. 68-69.

[Paf95] A. Paff, "Hybrid Fiber/Coax in the Public Telecommunications Infras-


tructure" ,IEEE Communications Magazine, Vol. 33, No.4, April 1995, pp.
40-45.

[RKK94] R.R. Roy, A.K. Kuthyar, and V. Katkar, "An Analysis of Universal
Multimedia Switching Architectures", AT&T Technical Journal, Vol. 73,
No.6, November/December 1994, pp. 81-92.

[Ros86] F.E. Ross, "FDDI Tutorial", IEEE Communications Magazine, Vol.


24, No.5, May 1986, pp. 10-17.

[Roy94] R.R. Roy, "Networking Constraints in Multimedia Conferencing and


the Role of ATM Networks", AT&T Technical Journal, Vol. 73, No.4,
July/August 1994, pp. 97-108.

[Sak93] S. Sakata. "B-ISDN Multimedia Workstation Architecture", IEEE


Communications Magazine, August 1993, pp. 64-67.

[Stu95] H.J. Stuettgen, "Network Evolution and Multimedia Communication",


IEEE Multimedia, Vol. 2, No.3, Fall 1995, pp. 42-59.

[SVH91] E. D. Sykas, K. M. Vlakos, and M. J. Hillyard, "Overview of ATM


Networks: Functions and Procedures", Computer Communications, Vol.
14, No. 10, December 1991, pp. 615-626.
[Tha93] C.S. Thachenkary, "Integrated Services Digital Networks (ISDN): Six
Case Study Assessment of a Commercial Implementation", Computer Net-
works and ISDN Systems, Vol. 25, No.8, March 1993, pp. 921-932.

[Vec95] M.P. Vecchi, "Broadband Networks and Services: Architecture and


Control", IEEE Communications Magazine, Vol. 33, No.8, August 1995,
pp.24-32.

[WK90] G. M. Woodruff and R. Kositpaiboon, "Multimedia Traffic Manage-


ment Principles for Guaranteed ATM Network Performance", IEEE Jour-
nal on Selected Areas in Communications, Vol. 8, No.3, April 1990, pp.
437-446.
6
MULTIMEDIA
SYNCHRONIZATION
B. Prabhakaran
Department of Computer Science fj Engineering,
Indian Institute of Technology, Madras. India

1 INTRODUCTION
Multimedia information comprises of different media streams such as text, im-
ages, audio, and video. The presentation of multimedia information to the
user involves spatial organization, temporal organization. delivery of the com-
ponents composing the multimedia objects, and allowing the user to interact
with the presentation sequence. The presentation of multimedia information
can be either live or orchestrated. In live presentation, multimedia objects are
acquired in real-time from devices such as video camera and microphone. In
orchestrated presentation. the multimedia objects are typically acquired from
stored databases. The presentation of objects in the various media streams have
to be ordered in time. Multimedia synchronization refers to the task of coordi-
nating this ordering of presentation of various objects in the time domain. In
a live multimedia presentation, the ordering of objects in the time domain are
implied and are dynamically formulated. In an orchestrated presentation. this
ordering is explicitly formulated and stored along with the multimedia objects.

Synchronization can be applied to concurrent or sequential presentation of ob-


jects in the various media streams composing the multimedia presentation.
Figure 1 shows an example of synchronization during sequential and concur-
rent presentation. Consider the concurrent presentation of audio and images
shown in Figure 1. Presentation of the objects in the individual media streams,
audio and image, is sequential. The points of synchronization of the presenta-
tion corresponds to the change of an image and the beginning of a new audio
clipping. Multimedia presentation applications must ensure that these points
of synchronization are preserved.
178 CHAPTER 6

10 11 12 In

(I) SeQuentiel Syncl'lronlzalion

10 11 12 In

(II) COncurrenl Syncl'lronizaUon

Figure 1 Synchronization in multimedia presentation.

The points of synchronization in a multimedia presentation can be modified by


the user going through the presentation. In an orchestrated presentation, for
example, a user may interact by giving inputs such as skip event(s), reverse
presentation , navigate in time, scale the speed of presentation, scale the spatial
requirements. handle spatial clash, freeze and restart of a presentation. User
inputs such as skip, reverse presentation and navigate time modify the sequence
of objects that are being presented. User inputs such as scaling the speed of
presentation modify the presentation duration of the objects. User inputs such
as handling spatial clash on the display screen make the corresponding media
stream active or passive depending on whether the window is in the foreground
or background. Similarly, while the freeze user input suspends the activities
on all streams, restart input resumes the activities on all streams. In the
presence of non-sequential storage of multimedia objects, data compression,
data distribution, and randQm communication delays, supporting these user
operations can be very difficult.

1.1 Modeling Temporal Relations


The points of synchronization in a multimedia presentation are described by
the corresponding temporal information . Temporal information can be rep-
resented by time points or time instants and time intervals. Time instants
are characterized by specifications such as AT.9.00 AM and time intervals by
specifications such as 1 hour or 9.00 AM to 5. 00 PM. Time interval is defined
by two time instants: the start and the end. A time instant is a zero-length
Multimedia Synchronization 179

moment in time whereas a time interval has a duration associated with it. A
time interval can be defined as an ordered pair of instants with the first instant
less than the second [5]. Time intervals can be formally defined as follows [13].
Let [S,~] be a partially ordered set, and let a, b be any two elements of S such
that a ~ b. The set {xla ~ x ~ b} is called an interval of S denoted by [a,b].
Temporal relations can then be defined based on the start and end time instants
of the involved intervals. Given any two time intervals, there are thirteen ways
in which they can relate in time [5], whether they overlap, meet, precede, etc .
Figure 2 shows a timeline representation of the temporal relations. Six of the
thirteen relations are inverse of the seven relations that are shown in Figure 2.

(v) a slarts b
(I) a betore b
b
11 12
11
a b

(i) a meets b
11 12 (VI)bckJnnga \1
b
a
t2

(IN) a overlaps b t1
a
b (VIi) a equats b
a b
eIV) b rushes a 11

Figure 2 Temporal relations between two events.

M\lltimedia presentation can be modeled by temporal intervals with the time


and duration of presentation of the multimedia objects being represented by
individual time intervals. The time dependencies of the multimedia objects
composing the presentation are brought out by their temporal relationships. In
some cases, it may not be possible to formulate precise temporal relationships
among the multimedia objects. However, relative temporal relationships can be
specified among the objects to be presented. This mode of fuzzy representation
or the relative temporal specification is to be contrasted with the absolute
specification. As an example, the specification present object A at time T is an
absolute temporal relation. Whereas present object A with (or after) object B is
180 CHAPTER 6

a relative temporal specification. Consider a scenario where n objects are to be


presented and the exact temporal requirements of the composing objects are not
available. Using fuzzy temporal specification, one can formulate a relationship
saying that all the n - 1 objects are to be presented during the presentation of
the object that has the longest duration.

1.2 Synchronization Specification


Methodologies
Multimedia synchronization specification in terms of the temporal relationships
can be absolute or relative. Even in the case of absolute temporal relations,
the time deadlines are soft. For example, nothing untoward happens if the
object is not available for presentation at the scheduled time instant. Method-
ologies used for synchronization specification should capture all the temporal
interdependencies of the objects composing the multimedia presentation and at
the same time should have the flexibility for indicating soft deadlines. A syn-
chronization methodology should also allow the user to modify the sequence
of presentation dynamically. The temporal relationships in orchestrated mul-
timedia presentations have to be captured in the form of a database and a
specification methodology must be suitable for developing such a database.

One approach to the specification of multimedia synchronization is to use par-


allellanguage paradigms that can be parsed during presentation to the user.
Here, the presentation of individual media streams are assigned to executable
processes and these processes synchronize via inherent language constructs.
Language based schemes for multimedia synchronization have been proposed
in [9, 11, 14]. Another approach is to use graphical models. Graphical mod-
els have the additional advantage of providing a visual representation of the
synchronization specification. Graphical models for synchronization specifica-
tion have been based on Petri nets [1] and Time-flow Graphs [19]. Another
new approach to synchronization specification has been proposed in [21]. This
proposal, called the content-based synchronization, is based on the semantic
structure of the objects composing the multimedia presentation.

Outline of this Chapter: In this chapter, we discuss the various method-


ologies used for describing multimedia synchronization in an orchestrated mul-
timedia presentation. In Section 2, we examine language based models for
multimedia synchronization. In Sections 3 and 4, we discuss graphical mod-
els based on Petri nets and Time-flow Graphs. In Section 5, we study the
content-based approach for synchronization specification. After discussing the
A1ultimedia Synchronization 181

synchronization methodologies. we present the database storage aspects ofmul-


timedia synchronization in Section 6. We then consider distributed orchestrated
multimedia presentations where objects have to be retrieved over a computer
network before their delivery to the user. In Sections 7 and 8. we describe how
a multimedia synchronization specification can be applied to derive schedules
for object retrieval over computer networks and to derive the communication
requirements of an orchestrated presentation.

2 LANGUAGE BASED
SYNCHRONIZATION MODEL
Language based synchronization models view multimedia synchronization in
the context of operating systems and concurrent programming languages. Based
on an object model for a multimedia presentation. the synchronization charac-
teristics of the presentation are described using concurrent programming lan-
guages. As discussed in [9], the basic synchronization characteristics of a multi-
media presentation as viewed from the point of view of programming languages
are:

• number of multimedia objects involved


• blocking behavior of individual multimedia objects while waiting for syn-
chronization

• allowing more than two multimedia objects combine many synchronization


events

• ordering of synchronization events


• coverage of real-time aspects

The above characteristics influence the basic features required from a concur-
rent programming language in order to describe a multimedia presentation. For
example. the number of involved objects describes the communication among
the multimedia objects that should be described by the language. Blocking
behavior of a multimedia object describes the nature of synchronization mech-
anism that is required viz. blocked·mode. non-blocked mode and restricted
blocked mode. The non-blocked mode would never force synchronization but
allows objects to exchange messages at specified states of execution. Blocked
182 CHAPTER 6

mode forces an object to wait till the cooperating object(s) reach the specified
state(s). Restricted blocked mode is a concept introduced in [9]. In the context
of multimedia presentation, restricted blocking allows an activity to be repeated
by an object till the synchronization is achieved with the cooperating object(s).
The activity can be presenting the last displayed image for an image object or
playing out a pleasant music in the case of an audio object. Restricted blocking
basically allows the multimedia presentation to be carried out in a more user
friendly manner.

More than two objects may have to be allowed to combine many synchronizing
events. This is especially true in presentations where many streams of infor-
mation are presented concurrently. A programming language should be able
to describe this scenario. The ordering of synchronization events become nec-
essary when two or more synchronization events are combined by some kind
of a relation. The ordering can be pre-defined, prioritized or based on certain
conditions. As multimedia presentations deal with data streams conveying in-
formation to be passed in real-time, synchronization of the events composing
the presentation incorporates real-time features.

2.1 Enhancements For Concurrent


Programming Languages
Following the above discussions, additional features are necessary in a concur-
rent programming language for describing real-time features and for describing
restricted blocking. In this section, we discuss these enhancements for a con-
current programming language suggested in [9].

Real-Time Synchronization Semantics: In a multimedia presentation,


events have to be synchronized at specified times or within a time interval. As
an example, video images and the corresponding audio have to be synchronized
within 150 milliseconds. In [9], following terms of multimedia relations are
introduced for a concurrent programming language.

• Timemin is the minimum acceptable delay between two synchronization


events. The usual parameter is "0".

• Timeave is the ideal delay between synchronization events i.e., the syn-
chronization should be as close to timeave as possible, though it might
occur before or after timeave. The usual timeave parameter is also "0".
Multimedia Synchronization 183

Expression Meaning
timemin( .... ) Minimum acceptable delay
between two synchronizing events.
timeave( .... ) Ideal delay.
timemax( .... ) Maximum acceptable delay.
timemin( .... ) AND Synchronization:
timemax( .... ) between timemin and timemax.
timemin( .... ) AND Synchronization: closer to
timeave( .... ) timeave, never before timemin.
timemax( .... ) AND Synchronization : closer to
timeave( .... ) timeave, never after timemax.
timemin( .... ) AND Synchronization : closer to
timemax( .... ) AND timeave, between timemin and
timeave( .... ) timemax.

Table 1 Semantics For Basic Real-Time Synchronization.

• Timemax is the time-out value and refers to the maximum acceptable


delay between synchronization events. The parameter for timemax vary
depending on the type of media objects involved in the synchronization.

The proposal in [9] also allows the above specified temporal operands to be
combined. The operands time min and timemax are not commutative, thereby
allowing specification of a different delay depending on which multimedia ob-
ject executes its synchronizing operation first. The temporal operands can be
combined using logical operator AND. Table 1 describes the possible temporal
expressions and their associated meaning.

Restricted Blocking Semantics : Figure 3 shows a possible multime-


dia presentation scenario with restricted blocking. Here, activity A presents a
full motion video while the activity B presents the corresponding audio. Ac-
tivity A, if it first waits for synchronization with activity B, keeps displaying
the last presented image. In a similar manner, activity B, if it first waits for
synchroniz'ation with activity A, plays out pleasant music. The proposal in [9]
describes a WHILE-WAITING clause for specifying restricted blocking. The
program segment in Table 2 shows the synchronization between the activities
A and B (shown in Figure 3) with restricted blocking and real-time restrictions.
The synchronization delay between the two objects should be between 1 second
of the audio object ahead of the video object and 2 seconds of the video object
184 CHAPTER 6

TIme Activity A Activity B

1
Wail For Synch. ;
Display ~l Image ~


t... - - - - - - - Ready ForSynch.
Proceed Funher Proceed Further
,

j j
t
I

Figure 3 Restricted blocking in multimedia presentation.

program of Object-A : program of Object_B :


- video _. audio
- from Source A - from Source B
display (Object-A) display (Object_B)
SYNCHRONIZE WITH SYNCHRONIZE WITH
ObjecLB AT end Object-A AT end
MODE restricted_blocking MODE restricted_blocking
WHILE_WAITING WHILE-WAITING
display( lastimage) play (music)
TIMEMIN 0 TIMEMIN 0
TIMEMAX Is TIMEMAX 2s
TIMEAVE 0 TIMEAVE 0
EXCEPTION display(lasUmage) EXCEPTION play(music)

Table 2 Program With Restricted Blocking &. Time Constraints

ahead ofthe image object. However. the target is no delay (with timemin being
0).

Summary: Language based specification of the multimedia synchronization


characteristics uses concurrent programming language features. Enhancements
Multimedia Synchronization 185

to the facilities offered by the programming languages are needed for describing
certain features of multimedia presentation such as real-time specification and
restricted blocking. However, language based specifications can become com-
plex when a multimedia presentation involves multiple concurrent activities
that synchronize at arbitrary time instants. Also, modifying the presentation
characteristics might even involve rewriting the program based specification.

3 PETRI NETS BASED MODELS


Petri Nets, as a modeling tool, have the ability to describe real-time process
requirements as well as inter process timing relationships, as required for multi-
media presentations [1, 2]. A Petri net is a bipartite graph consisting of 'place'
nodes and 'transition' nodes. Places, represented by circles, are used to rep-
resent conditions; transitions, drawn as bars, are used to represent events. A
Petri net structure P is a triple.

P = { T, P, A } where

T = { tl, t 2, ... } represents a set of transitions (bars)


p = { PI, P2, ... } represents a set of places (circles)
A :{T *P }U {P *T }- I, I = {1, 2, ... } representing set of inte-
gers

The arcs represented by A describe the pre- and post-relations for places and
transitions. The set of places that are incident on a transition t is termed the
input places of the transition. The set of places that follow a transition t is
termed the output places of the transition. A Marked Petri Net is one in which
a set of 'tokens' are assigned to the places of the net. Tokens, represented by
small dots inside the places, are used to define the execution of the Petri net,
and their number and position change during execution. The marking (M) of a
Petri net is a function from the set of places P to the set of nonnegative integers
I, M : P -+ I. A marking is generally written as a vector (ml, m2, .... , m n ) in
which mi = M(p;). Each integer in a marking indicates the number of tokens
residing in the corresponding place (say Pi)'

Execution of a Petri net implies firing of a transition. In a Marked Petri net,


a transition t is enabled for firing iff each of its input place contain atleast one
186 CHAPTER 6

token,i.e.,
't/(p E InputPlace(t» : M(p) ~ 1.
Firing t consists of removing one token from each of its input place and adding
one token to each of its output place. and this operation defines a new marking.
An execution of a Petri net is described by a sequence of markings, beginning
with an initial marking Mo and ending in some final marking Mf. Marking of
a Petri net defines the state of the net. During the execution, Petri net moves
from a marked state Mi to another state Mj by firing any transition t that are
enabled in the state Mi.

For the purpose of modeling time-driven systems, the notion of time was in-
troduced in Petri nets, calling them as Timed Petri Nets (TPN) [4]. In TPN
models. the basic Petri net model is augmented by attaching an execution time
variable to each node in the net. The node in the Petri net can be either the
place or the transition. In a marked TPN. a set of tokens are assigned to places.
A TPN N is defined as,

N = { T, P, A, D, M } where

M : P ---> I represents marking M which assigns tokens (dots) to each place in


the net
D : P -+ R represents durations as a mapping from the set of places to the
set of real numbers

other descriptions T, P. A are same as that of the normal Petri net.

A transition is enabled for execution iff each of its input place contain atleast
one token. The firing of a transition causes a token to be held in a locked state
for a specified time interval in the output place of the transition, if the time
intervals are assigned to places. If time intervals are assigned to transitions, the
token is in a transition state for the assigned duration. The execution of a Petri
net process might have to be interrupted in order to carry out another higher
priority activity. Pre-emption of the on-going execution of a Petri net process
can be modeled using escape arcs to interrupt the execution of a process [6]. In
this section, we examine Petri net based models for the purpose of describing
the synchronization characteristics of the multimedia components.
Multimedia Synchronization 187

3.1 Hypertext Models Using Timed Petri


Nets
TPN concept has been used to specify the browsing semantics in the Trellis Hy-
pertext System [7, 10]. This Petri net based Hypertext model allows browsing
events to be initiated by the reader or by the document itself. A synchronous
hypertext H is defined as a 7-tuple H = < N, Mo, C, W, B, PI, Pd > in which

N is a timed Petri net structure

M 0 is an initial state (or initial marking) for N

C is a set of document contents


W is a set of windows

B is a set of buttons

PI is a logical projection for the document

Pd is a display projection for the document.

In the above definition, the structure of the timed Petri net specifies the struc-
ture of the hypertext document. A marking in the hypertext hence represents
the possible paths through a hyperdocument from the browsing point it rep-
resents. The initial marking of a synchronous hypertext therefore describes a
particular browsing pattern. The definition also includes several sets of com-
ponents (contents, windows, and buttons) to be presented to the user going
through the document. Two collections of mappings or projections are also
defined: one from the Petri net to the user components and another from the
user components to the display mechanism.

The content elements from the set C can be text, graphics, still image, motion
video, audio information or another hypertext. A button is an action selected
from the set B. A window is defined as a logically distinct locus of information
and is selected from the set W. PI, the logical projection, provides a mapping
from the components of a Petri net (place and transitions) to the human-
consumable portions of a hypertext (contents, windows and buttons). A content
element from the set C and a window element for the abstract display of the
content from the set W, are mapped to each place in the Petri net. A logical
button from the set B is mapped to each transition in the Petri net. Pd,
the display projection, provides a mapping from the logical components of
188 CHAPTER 6

the hypertext to the tangible representations, such as screen layouts, sound


generation, video, etc. The layout of the windows and the way text information
and buttons are displayed, is determined by Pd.

Execution Rules For Trellis Hypertext


The net execution rule in Trellis hypertext is a hybrid one, with a combination
of a singleton transition firing rule (representing the actions of the hyperdocu-
ment reader) and a maximal transition firing rule (to allow simultaneous auto-
matic or system-generated events). The execution rule works as follows.

• All the enabled transitions are identified. The enabled transitions set con-
sists of both timed out ones (i.e., have been enabled for their maximum
latency) and active ones (i.e., have not timed out).

• One transition from the group of active ones and a maximal subset from
the group of timed-out transitions will be chosen for firing.

This two step identification of transitions to be fired is basically for modeling


both reader and timer based activity in the hypertext. A reader may select a
button (fire a transition) during the same time slot in which several system-
generated events occur. In this case, the Trellis model gives priority to the
reader's action by choosing the reader-fired transition first. When a transition
is fired, a token is removed from each of its input place and a token is added
to each of its output place.

The implication of the firing rules is such that the document content elements
are displayed as soon as tokens arrive in places. The display is maintained for a
particular time interval before the next set of outgoing transitions are enabled.
After a transition t becomes enabled. its logical button is made visible on the
screen after a specified period of time. If the button remains enabled without
being selected by the reader, the transition t will fire automatically at its point
of maximum latency. The way the hypertext Petri net is structured along with
its timings describes the browsing actions that can be carried out. In effect,
windows displaying document contents can be created and destroyed without
explicit reader actions. control buttons can appear and disappear after periods
of inactivity, and at the same time user interactive applications can be created
with a set of active nodes. Petri nets being a concurrent modeling tool, multiple
tokens are allowed to exist in the net. These tokens can be used to effect the
Multimedia Synchronization 189

action of multiple users browsing one document or to represent multiple content


elements being visible together.

Trellis Hypertext: An Example

Figure 4 Guided tour using trellis hypertext.

We shall consider the guided tour example discussed in [7]. In a guided tour,
a set of related display windows is created by an author. All the windows in
a set are displayed concurrently. A tour is constructed by linking such sets
to form a directed graph. The graph can be cyclic as well. From anyone set
of display windows, there may be several alternative paths. Figure 4 shows
a representation of guided tour using the Trellis hypertext. Here, the set of
windows to be displayed at any instant of time is described by a Petri net
place. A place is connected by a transition to as many places as there are sets
of windows to be displayed. A token is placed in the place(s) representing the
first set of windows to be displayed. The actual path of browsing is determined
by the user going through the information. For example, when the token is in
pl, the information contents associated with pl are displayed and the buttons
for the transitions t1 and t2 are selectable by the user.

Summary: The synchronous hypertext models both the logically linked


structure of a hyperdocument and a timed browsing semantics in which the
reader can influence the browsing pattern. The Trellis Hypertext model uses
Petri net structure for representing user interaction with a database (basically
for browsing actions). However, in tht> Trellis model. the user's inputs such as
190 CHAPTER 6

freeze and restart actions, scaling the speed of presentation (fast forward or
slow motion playout) and scaling the spatial requirements cannot be modeled.
Another aspect in the Trellis model is that the user can interact with the
hypertext only when the application allows him to do so (i.e., when the buttons
can be selected by the user). Random user behavior where one can possibly
initiate operations such as skip or reverse presentation at any point in time,
are not considered in Trellis hypertext.

3.2 The Object Composition Petri Nets


The Object Composition Petri Net (OCPN) model has been proposed in [8,
12] for describing multimedia synchronization. The OCPN is an augmented
Timed Petri' Nets (TPN) model with values of time represented as durations
and resource utilization being associated with each place in the net. An OCPN
is defined as

COCPN = { T, P, A, D, R. M } where

R : P --+ { rl, r2, ... } represents the mapping from set of places to a set of
resources.

other descriptions being same as that in the TPN model.

The execution of the OCPN is similar to that of TPNs where the transition
firing is assumed to occur instantaneously and the places are assumed to have
states. The firing rules for the OCPN are as follows :

1. A transition tj fires immediately when each of its input place contain an


unlocked token.

2. Upon firing the transition ti. a token is removed from each of its input
place and a token is added to each of its output place.
3. A place Pi remains in an active state for a specified duration Tj associated
with the place, after receiving a token. The token is considered to be in a
locked state for this duration. After the duration Ti, the place Pi becomes
inactive and the token becomes unlocked.
Multimedia Synchronization 191

It has been shown in [8, 12], that the OCPN can represent all the possible
thirteen different temporal relationships between any two object presentation.
The Petri nets have hierarchical modeling property that states that subnets of
a Petri net can be replaced by equivalent abstract places and this property is
applicable to the OCPN as well. Using the subnet replacement, an arbitrarily
complex process model composed of temporal relations can be constructed with
the OCPN by choosing pairwise, temporal relationships between process enti-
ties [8]. The OCPNs are also proved to be deterministic since no conflicts are
modeled, transitions are instantaneous, and tokens remain at places for known,
finite durations. OCPNs are also demonstrated to be live and safe, following
the Petri nets definitions for these properties.

Synchronization Representation Using OCPN : An


Example
We can consider a specific example orchestrated presentation and use the exam-
ple to describe its OCPN representation. Figure 5 illustrates the media streams
representation of the example, comprising of four streams of information: au-
dio (A), motion video (V) and two image (It, 12 ) streams. The synchronization
characteristics of the presentation is such that audio and video objects are syn-
chronized for every object. Image stream It synchronizes with audio and video
streams once in three object intervals and the stream 12 synchronizes at the
start and the end of presentation.

Figure 6 illustrates the OCPN model of the multimedia presentation example


shown in Figure 5. The transitions shown in the net represent the points of
synchronization and the places represent the processing of information. When
the audio object represented by the place al is presented completely, the token
gets unlocked. Since the video object represented by the place VI is presented
for the same duration as aI, the place VI unlocks its token synchronously with
place al. Now the common transition is fired immediately, thereby allowing the
next set of video and audio objects to be presented. In a similar manner, the
presentation of image object :1':1 synchronizes with a2 and V2 and :1':2 synchronizes
with a4 and V4'

Summary: The OCPN represents a graph-based specification scheme for


describing the temporal relationships in an orchestrated multimedia presenta-
tion. Comparing the OCPN with the Trellis Hypertext model, the Trellis model
can have multiple outgoing arcs from places, and therefore can represent non-
deterministic and cyclic browsing, with user interactions. The OCPN specifies
192 CHAPTER 6

(0 (I (2 l3
A

II

12

Figure 5 An example of multimedia presentation.

Figure 6 OCPN model for example in Figure 6.5.

exact presentation-time play-out semantics and hence useful in presentation


orchestration.
Multimedia Synchronization 193

The OCPN model ignores the spatial considerations required for composition of
multimedia objects. The spatial characteristics can be assigned to the resource
component of the OCPN' model. Modeling the occurrence of spatial clashes
(when two processes require the same window on the screen), however, can-
not be done using the OCPN. Also, the OCPN model does not provide many
facilities for describing user inputs to modify the presentation sequence. For
instance. the user's wish to stop a presentation, reverse it. or skip a few frames
cannot be specified in the existing OCPN architecture. Also, user inputs for
freezing and resuming an on-going presentation. or scaling the speed or spatial
requirements of a presentation cannot be described by the OCPN model.

3.3 Dynamic Timed Petri Nets


A Dynamic Timed Petri Nets (DTPN) model has been suggested in [16] by
allowing user defined 'interrupts' to pre-empt the Petri net execution sequence
and modify the time duration associated with the pre-empted Petri net process.
In the DTPN model. nonnegative execution times are assigned to each place
in the net and the notion of instantaneous firing of transitions is preserved.
Basically, following types of modifications of execution time after pre-emption
are allowed :

1. Deference of execution :
For the operation of pre-emption with deference of execution, the remain-
ing duration associated with the pre-empted Petri net place is changed
considering the time spent till its pre-emption.
2. Termination of execution :
The operation of pre-emption with termination is like premature ending
of the execution.
3. Temporary modification of remaining time duration :
The operation of pre-emption with modification of execution time is like
'setting' the time duration associated with the pre-empted Petri net place
to a 'new value', as appropriately determined by the type of user input.
For temporary modification of execution time, the remaining time duration
associated with the place is modified.
4. Permanent modification of execution time duration :
For permanent modification, the execution time duration associated with
the place is modified.
194 CHAPTER 6

nTPN Structure
In the DTPN model, pre-emption is modeled by using escape arcs. Escape
arcs are marked by dots instead of arrow heads and they can interrupt an
active Petri net place. Modification of execution time duration (temporary or
permanent) associated with a place is modeled by modifier arc. Modifier arcs
are denoted by double-lined arcs with arrow heads. A Dynamic Timed Petri
Nets is defined as

DTPN = { T, P, A, D, M, E, C, Cm } where

C represents a set of escape arcs, C is a subset of P * T, and A and Care


disjoint sets
Cm represents a set of modifier arcs, Cm is a subset of P * T, and A, C and
Cm are disjoint sets
E : P -- R represents remaining duration as a mapping from the set of places
to the set of real numbers. Initially, E is set to INVALID before the
activation of a place. After activation, E is set to a value equal to D. In
case of pre-emption, E maintains the remaining; time duration for which
execution is to be carried out if the place is made active again by the firing
of an input transition. After completion of the execution duration, E is
set to INVALID again.

other descriptions being same as that in the TP~ model.

A place p in a DTPN is an escape place of a transition tiff (p,t) is a member


of the set C. The set of all escape places of t is denoted by Esc{p), and the
set of transitions connected by escape arcs with a place p is denoted by Esc{t),
=
Esc{p) {t I p E Esc(t)}. A transition is called a modifier transition if it has
a modifier arc as an output arc. This notation is generalized in the usual way
for sets of places and transitions. A typical DTPN structure with escape and
modifier arcs is shown in Figure 7. The execution rules of the DTPN are
discussed in [16]

Synchronization Models Using nTPN


The proposed DTPN can be effectively used in synchronization models for
flexible multimedia presentation with user participation. DTPN constructions
Multimedia Synchronization 195

~ I
j
.------------.
Legend:
" I : _ Arc
ok :~Arc

Ii : Modfiar Tranaition

Figure 7 Dynamic Timed Petri Nets structure.

for describing handling of user inputs such as reverse presentation, freeze and
restart to a single object presentation is the simplest case. These DTPN con-
structions for handling user inputs on single object presentations can be used
in full multimedia presentation.

Legend :
P & T : Pre-emption and
Tenninetion

(P&n

Figure 8 Reverse operation.

The reverse operation can be modeled as shown in Figure 8. Execution of Pi


is pre-empted and terminated, and the transition to enable reverse presenta-
tion is fired. In a similar manner, freeze and restart operation can be modeled
as shown in Figure 9. The type of interrupt to be applied to the execution
sequence is pre-emption with deference of execution. The remaining time dura-
tion associated with Pi is modified to reflect the time spent till its pre-emption
and the locked token in Pi is removed. A token is added to the output place Pj
of the pre-empting transition. A large time duration is associated with Pj and
its execution is pre-empted only on the receipt of restart signal from the user.
Pre-emption of Pj causes the token to be 'returned' to the place Pi.
196 CHAPTER 6

I~ ~I
~r.:J Legend: I
P & T : Pre-emption and I
~~i Termination
P & 0 : Pre-emption and
Deference
(P & T) (P& 0)

Figure 9 Freeze and restart operation.

(P tk. 1M)

Figure 10 Skip operation for example in Figure 5

In the orchestrated presentation example discussed in Figure 5, consider skip


operation on the object presentation Xl, when the objects ai, Vi and Yl are
being executed in parallel. The DTPN model for the skip operation is shown in
Figure 10. Here, objects ai, Vi, a2, and V2 must be skipped in the audio (A) and
video (V) streams. Hence, the input of pre-emption followed by termination
of execution (P&T) is applied for these objects. Object Xl is also given the
same input. For the object Yl in the stream h, the execution duration must be
modified to reflect the skip operation. This is effected by applying the input
Multimedia Synchronization 197

pre-emption followed by temporary modification of execution time (P&TM) for


Yl'

Summary The DTPN modeL its structure and the associated execu-
tion rules, can be adopted by the OCPN where resource utilizations are also
specified. This augmented model can describe multimedia presentations with
dynamic user participations.

4 FUZZY SYNCHRONIZATION MODELS


In [19, 20], multimedia synchronization involving independent sources for tele-
orchestra applications (remote multimedia presentation) has been considered.
In the teleorchestra application, a user creates multimedia presentations us-
ing data objects stored in a distributed system. The distributed nature of the
data objects may result in the non-availability of precise temporal relationships
among the objects. However, relative or fuzzy temporal relationships 'can be
specified among the objects to be presented. A Time-flow Graph (TFG) model
for describing fuzzy temporal relationships has been proposed in [19, 20] where
temporal interval is taken as a primitive. The terms object X and interval X are
considered synonymous. One temporal interval may contain several objects in
different media, i.e., several concurrent multimedia intervals.

4.1 Time-flow Graph Model


There are thirteen relations (seven relations and their inverses) that are consid-
ered between any two temporal intervals [5]. The relations are 'meets', 'before',
'equal', 'overlap', 'during', 'start', and 'finish'. The other relations are their
inverse, e.g., 'di' is the inverse of the relation 'd' (during). The following set R
describes all the thirteen possible temporal relations.
R = {b, e, m, 0, d, s, f, bi, mi, di, oi, si, fi}.
The temporal relationships among the involved objects can be parallel or se-
quential.

Sequential Relations in TFG: The sequential relation between any two


intervals can be either 'meets' (m) or 'before' (b). In teleorchestra applications,
multiple intervals can be involved in sequential relation requirement. Hence,
the sequential relation specification provided in TFG are :
198 CHAPTER 6

1. A{B} : Interval(s) in B will start after all the intervals in A are finished.

2. < A > B : Interval(s) in B will start when one of the intervals (the first)
in A is finished.

Parallel Relations in TFG: A subset Rj (Rj E R) is defined as


R = fe, 0, d, s,f, oi, di, si,fi}.
Considering two object intervals X and Y, X(r)Y describes the temporal rela-
tionship between the two intervals. When r = 's', the relation X(s)Y specifies
that object X and Yare to be displayed with the same start time. In a tele-
orchestra application scenario, the temporal relationships have to be specified
despite the lack of duration information of the involved intervals. The presen-
tation semantics are hence defined in TFG models as follows [19, 20].

1. X(d)Y(d)Z: All the other objects are displayed during the presentation of
the object with the longest presentation duration.

2. X(e)Y(e)Z : The display of all the objects are started simultaneously. The
presentation of all the other objects will be cut off when one of the objects
(first) is finished.

3. ch(X,Y,Z) = Y : The presentation duration of all the involved objects


should equal the one chosen. Objects are displayed according to the rela-
tions r E Rj specified between every two of them. Presentation of some of
the objects might be cut off to equal the chosen duration.

Hence, three duration specifications r E Rm =


{d, e, ch}, can be applied to
presentations involving concurrent multiple intervals.

Using the above sequential and parallel relationship specification, multimedia


presentation scenarios can be described. In TFG, the notation N is used to
denote all the intervals contained in a scenario. The temporal relationship of an
object interval N x with other intervals is represented by the interval vector n x .
The interval vector nx is defined as nx = (Sr.Did.Fs,Fe ) where Sr denotes the
source (or the owner) of the interval, Did the object identifier, Fs and Fe the
presentation semantics chosen for the start and end of the display of the object.
An interval vector can represent an object presentation or an intermission. The
intermission vector. represented by nT, describes a presentation interval that is
not mapped to any multimedia object. Intermissions are used to specify gaps
in the multimedia presentation.
Multimedia Synchronization 199

For describing multimedia presentations, the model of interval vectors and the
involved temporal information are maintained in a Flow Graph (TFG). In TFG,
intervals are described by nodes. A TFG is defined by the tuple T FG =
{fl.N, Nt, Ed} where fl.N is a set of nodes for the interval vectors, Nt is a set of
transit nodes and Ed is a set of directed edges. The model fl.x of an interval
vector nx is composed of an interval node N x , representing the interval of n x ,
and 6 node(s), representing its parallel relations to other intervals. N x may
associate none. one or two 6 nodes. An intermission node has no 6 nodes. The
sequential specifications {A}B and < A > B are represented by the transit
nodes in Nt, the sq-node and the tri-node respectively.

4.2 Synchronization Specification Using TFG:


An Example
The TFG specifications for the relations A ends B and A overlaps B are shown in
Figure U(i) and (ii). The square nodes N, and Ne are the sq-nodes representing
the sequential relation {A}B. Figure U(ii) has intermission node NT to describe
the time lag between the event A and B. The 6 nodes are used in Figure U(ii)
to signify the end of the concurrent presentation. The TFG model for the
orchestrated presentation example (discussed in Figure 5) is shown in Figure
11.

Summary : The TFG model for multimedia synchronization can handle rel-
ative and imprecise temporal requirements in teleorchestra applications. The
advantage in the TFG model is that no accurate temporal information, the
duration value or the occurring points, is required. The TFG can represent all
the possible temporal relations involved in a multimedia presentation scenario.
Comparing the TFG model with the earlier Petri nets based ones, the Petri
nets based models rely on the values for the duration of presentation in formu-
lating the temporal specification. Hence, the Petri nets based models cannot
represent the relative synchronization requirements. However, the TFG model
does not address the issue of dynamic user participation during a multimedia
presentation.
200 CHAPTER 6

(i) Xeqwls Y (ii) xo;!rlaps Y

./
/

(Iii) 1Ri Model of PresenI1Iioo in Ii~ S

Figure 11 TFG synchronization models.

5 CONTENT-BASED INTER-MEDIA
SYNCHRONIZATION
Content-based inter-media synchronization is based on the semantic structure
of the multimedia data objects [21]. Here, a media stream is viewed as hierar-
chical composition of smaller media objects which are logically structured based
on their contents. The temporal relationships are then established among the
logical units of the media objects. The logical units of a media stream are the
Multimedia Synchronization 201

semantic events that are either manually identified a priori or automatically


detected by analysis of the media contents.

Traditional approaches for synchronization discussed in the earlier sections con-


sider each media stream as being composed of set of objects to be presented
at specified time intervals. Here, the synchronization specifications are given
for describing concurrent relationships among the presentation of objects be-
longing to different media streams. Following these approaches, a segment of
a media stream cannot be manipulated without re-establishing the temporal
relationships among all other related media streams. Content-based synchro-
nization, however, allows such manipulations to be done more easily. As an
example, an audio stream can be dubbed onto a video stream in movie editing
very easily following content-based synchronization approach.

In this section, we discuss the content-based synchronization approach dis-


cussed in [21].

5.1 Hierarchical Classification of Multimedia


Objects
The top-most member in the hierarchy is an individual media stream that is
considered as a composite-object. Let us consider a multimedia lecture on the
topic Distributed Multimedia Systems. The video source of the lecture is a
composite-object. A composite-object consists of segment objects. A segment
is defined as an episode. The lecture presentation on Distributed Multimedia
Systems can consist of the following segments : Media Characteristics (com-
pression, coding, etc.), File System requirements, Device driver requirements,
Network requirements and Distributed multimedia applications. A segment
object in turn is composed of event objects. An event object is identified by
theme changes in a segment object. As an example, the segment on network
requirements can be composed of Quality of Service (QoS) requirements, net-
work protocol features and network access methods. An event is composed of a
set of shot objects. A shot is defined as a sequence of video frames that contains
pictures taken under the same angle of a camera. The hierarchical classification
of the lecture presentation on Distributed Multimedia System is illustrated in
the Figure 12. Each node in this hierarchical structure is an abstraction of
component objects where information such as component object IDs, temporal
relationships among them, etc. is stored.
202 CHAPTER 6

COMPOSITE DISTRIBUTED
OBmer MM SYSTEM

,~'
f '
SEGMENTS \ I
,,-,,J

Media File Device Distributed


Characteristics System Drivers MM Applications

EVENTS

SHOTS
T'put Delay QoS Multi Synchro ATM FDDI FRAME
Require Require Negotiation casting nization RELAY
ments ments

Figure 12 Hierarchical classification of objects.

5.2 Temporal Relations


The temporal relationship among media objects in content-based approach is
specified with respect to their component media objects only. The temporal
relationships among component objects is defined by the term synchronization
schedule. A synchronization schedule statement consists of Node IDs and op-
erators. Node ID describes the object to be presented. An operator is defined
as an infix symbol that dictates the temporal relation between two objects. A
sequential operator (;) and a simultaneous operator (II) are used for describing
the synchronization relationships. Considering two nodes A and B, A;B implies
that the activity B starts at the end of the activity A and A II B implies that
the two activities go on parallelly. Using a combination of these two operators,
the thirteen different possible temporal relations between media objects can be
described.
Multimedia Synchronization 203

The synchronization schedule can be created by the following algorithm given


in [21]. This algorithm works on the hierarchical structure of the multimedia
object.

1. Do depth first search on each subtree whose root node is a direct child of
the ROOT node and create an operation schedule for each.

2. Choose a subtree A whose begin[nodeID} = self


3. Choose another subtree B.
4. If (B's begin[nodeID) = self) then
{ add B's first object ID to A's first object ID with a II operator, B's second
to A's second with a II operator, and so on until whichever runs out of its
IDs first.

if A runs out of it first then just add B's each object ID to the end of A's
schedule with a : operator. }
else (B's begin[nodeID) f. self)
{ identify the segment that contains the object ID which is given for the
value of B's begin[nodeID}
if it is identified then from the identified object ID do the same adding
operation as the first part of the step 4 (If clause). i.e., the synchronization
schedule is created at this point.}
else (it fails to be identified - this means that a user tries to bring a media
object from the outside of the current object hierarchy and to compose
with the current objects) { error.}

5. Repeat the steps 3, 4, 5 until all the segments are chosen and operated.

We can use the above algorithm to form the synchronization schedule for the
example, lecture on Distributed Multimedia Systems, shown in Figure 12. For
the segment on network requirements of distributed multimedia systems, the
synchronization schedule will be :
throughput requirements: delay requirements: QoS negotiation: mul-
ticasting; synchronization: ATM: FDDI: Frame Relay.

Summary: The content-based synchronization model suits well for content-


based multimedia information retrieval. The model has several advantages. It
can be used to artificially modify or create synchronization between the involved
204 CHAPTER 6

media. For example, the asynchrony in video and audio due to the differences
in speeds of light and sound can be modified using the content-based synchro-
nization scheme. In a similar manner, the scheme can be used to create a new
synchronization specification between video and a dubbed audio track during
movie editing.

6 MULTIMEDIA SYNCHRONIZATION
AND DATABASE ASPECTS
The approaches discussed so far describe effective ways of modeling the tem-
poral requirements of an orchestrated multimedia application. The multime-
dia objects have to be logically structured in a multimedia database and the
structure should reflect the synchronization characteristics of the multimedia
presentation. Also, multimedia data has to be delivered from a storage medium
based on a predefined retrieval schedule. The size and the real-time character-
istics of the multimedia objects necessitate different storage architectures. In
this section, we describe the database schema representation and physical stor-
age representation of multimedia objects with respect to the synchronization
characteristics.

6.1 Database Schema Representation


Modeling of the synchronization characteristics of a multimedia presentation
using the OCPN or the nTPN or the TFG approach provides a convenient vi-
sual technique for capturing the temporal relationships. However, a conceptual
database schema is needed which preserves the semantics of the synchronization
representation model to facilitate reproduction, communication, and storage of
multimedia presentation. Synchronization models basically identify and group
the temporally related multimedia objects of increasing complexity.

In [12, 8], a hierarchical synchronization schema has been proposed for multi-
media database representation. Two types of nodes - terminal and nonterminal
nodes - have been defined in this approach. The terminal nodes in this model
indicate base multimedia objects (audio, image, text, etc.) and the nodes points
to the location of the data for presentation. The nonterminal nodes have addi-
tional attributes defined for facilitating.database access. The attributes include
timing information and node types (sequential and parallel), allowing the as-
sembly of multimedia objects during the presentation. The timing information
Multimedia Synchronization 205

in the nonterminal node includes a time reference, playout time units, temporal
relationships and required time offsets for specific multimedia objects. Figure
13 shows the hierarchical database schema for the multimedia presentation ex-
ample described in Figure 5.

~()
\ ....J

,
~~~;~
t" t "

b~ ~~ ~G
Figure 13 Hierarchical database schema for Figure 6.5.

6.2 Physical Storage Requirements


The physical medium used for storing multimedia objects should be able to
meet the synchronization requirements of a multimedia application, both in
terms of the storage capacity and bounds on retrieval time. Multimedia objects
such as audio and video require very large amounts of storage space. For
example, a data rate of about 2 Mbytes/second is required for HDTV video.
Apart from the storage technology requirements. storage architectures adopted
by the operating system also become important. The reason is that data must
206 CHAPTER 6

be retrieved.at a very high rate for HDTV video objects, e.g., 2Mbytes/second,
from the disks.

Hence, the file system organization has to be modified for handling digital video
and audio files. The aim is to handle multiple huge files as well as simultane-
ous access to different files, given the real-time constraint of data rates upto 2
Mbytes/second. This problem has been studied in detail in [18]. Most of the
existing storage architectures allow unconstrained allocation of blocks on disks.
Since there is no constraint on the separation between disk blocks storing a
chunk of digital video or audio, bounds on access and latency times cannot
be guaranteed. Contiguous allocation of blocks can guarantee continuous ac-
cess, but has the familiar disadvantage of fragmentation of useful diskspace.
Constrained block allocation can help in guaranteeing bounds on access times
without encountering the above problems. For constrained allocation, factors
like size of the blocks (granularity) and separation between successive blocks
(scattering parameter) have to be determined for ensuring guaranteed bounds
[18]. The retrieval schedule of data for multimedia objects can be affected when
the system becomes busy with some other tasks. The allocation or the data
placement schemes also have to take into consideration the factor of contention
with other processes in the system.

Summary: Multimedia objects composing orchestrated presentations need


to be stored in database(s). The logical storage structure or the schema
should reflect the synchronization characteristics of the orchestrated presen-
tation. Also, the storage of the multimedia objects should satisfy the synchro-
nization requirements of the orchestrated presentation in terms of the bounds
on object retrieval time.

7 MULTIMEDIA SYNCHRONIZATION
AND OBJECT RETRIEVAL
SCHEDULES
Orchestrated multimedia presentations might be carried out over a computer
network thereby rendering the application distributed. In such distributed
presentations, the required multimedia objects have to be retrieved from the
server(s) and transferred over the computer network to the client. The commu-
nication network can introduce delays in transferring the required multimedia
objects. Other conditions such as congestion of a database server at a given
Multimedia Synchronization 207

time and locking of the data objects by some other application also have to
be considered. Retrieval of the multimedia objects have to be made keeping
in mind these delays that might be introduced during the presentation. A
retrieval scheduling algorithm has to be designed based on the synchroniza-
tion characteristics of the orchestrated presentation incorporating features for
allowing the delays that might be encountered during the presentation.

In [12, 15], a multimedia object retrieval scheduling algorithm has been pre-
sented based on the synchronization characteristics represented by the OCPN.
Characterizing the properties of the multimedia objects and the communica-
tion channel. the total end-to-end delay for a packet can consist of the following
components.

• Propagation delay, Dp
• Transfer delay, D t , proportional to the packet size
• Variable delay, D v , a function of the end-to-end network traffic.

The multimedia objects can be very large and hence can consist of many pack-
ets. If an object consists of r packets, the end-to-end delay for the object is :
De = Dp + rDt + L:j=l DVj
Control time Ti is defined as the skew between putting an object i onto the
communication channel and playing it out. Considering the end-to-end delay,
De, the control time 11 should be greater than De. The various timing param-
eters, the playout time (11"), the control time (T) and the retrieval time (tP) are
as shown in Figure 14. The retrieval time for an object i (tP;) or the object
production time at the server is defined as <Pi = 1I"i - 1£.

The above retrieval schedule is for a single object in a media stream. When
multiple objects are retrieved, the timing interaction of the objects have to be
considered. An optimal schedule for multiple objects is defined as one which
minimizes their control times [12, 15]. The constraints associated with the
determination of the minimum control time for a set of objects are as follows.

• An object cannot be played out before arrival, i.e., 1I"i ~ tPi + T;.
• The minimum retrieval time between successive objects, i.e., <Pi-l ~ tPi -
T;-l + Dp.

Following these constraints, an optimal retrieval schedule can be worked out.


An optimal retrieval time for the final, mth object, is tPm = 1I"m - Tm. The
208 CHAPTER 6

T
< :>
t Time
I >
¢ IT

Figure 14 Delay factors in object retrieval.

retrieval times for the other objects can be worked out backwards based on the
schedule <Pm for the final object. The scheduling algorithm given in [12, 15] is
summarized as follows.

<Pm = '1rm - Tm
for i = to m-2
if (<Pm-i < '1rm -i-l - Dp) <Pm-i-l = <Pm-i - 'li-l + Dp
else <Pm-i-l = '1rm -i-l - Tm - i - l
end.

Retrieval Schedule Example : Consider the orchestrated presentation


shown in Figure 5. We can apply the retrieval schedule for the image stream
It with following assumptions. Let the channel capacity, C. be 10 Mbps. Let
the size of the image objects (assuming color image with 1024 * 1024 pixels
and 24 bits/pixel of color information) be 24 Mbits. Let D p , the propagation
delay, and D v , the variable delay, be 100 milli-seconds each. Then the transfer
delay D t is 2.4 seconds. Then, the retrieval schedule for the It stream looks as
follows.
Schedule = {-2.6, O}
This implies that the first object Xl is to be retrieved 2.6 seconds before the
start of the presentation.

Summary: In this section, we discussed how an optimal retrieval sched-


ule can be designed based on the synchronization characteristics of multimedia
objects. The synchronization characteristics, as defined by the OCPN or the
DTPN, specifies the object playout time instants '1r. The optimal retrieval
Multimedia Synchronization 209

scheduling algorithm works on the fact that these playout time instances are to
be satisfied given the end-to-end delay De involved in the transfer of multimedia
objects. In the TFG model, the playout instances (71') are for the multimedia
objects are described in a fuzzy or relative manner. This fuzzy temporal re-
lations have to be converted into absolute schedules for retrieving the objects
from a server. In [20], retrieval scheduling algorithms have been designed for
the TFG specification of the synchronization characteristics.

8 MULTIMEDIA SYNCHRONIZATION
AND COMMUNICATION
REQUIREMENTS
In a distributed orchestrated multimedia presentation, the communication net-
work provides support for transfer of data in real-time. The synchronization
characteristics of an orchestrated presentation define a predictable sequence of
events that happen in the time domain. This sequence of events can be used by
the network service provider to understand the behavior of the orchestrated ap-
plication in terms ofthe Quality of Service (QoS) requirements and the network
load that might be generated [17,22,23]. The QoS parameters are character-
ized by a set of parameters such as the end-to-end throughput, the end-to-end
delay and packet loss probabilities. The offered network load is characterized
by the traffic that might be generated by the application.

8.1 QoS Requirements of an Orchestrated


Presentation
In a distributed orchestrated presentation. it is essential that the object is avail-
able at the client before the presentation. In general, the client can adopt two
different strategies for buffering the multimedia objects: minimum buffering
strategy (Bl) and maximum buffering strategy (B2). The Bl strategy tries to
minimize the buffer requirements by buffering at most one object or a frame, as
the case may be, before its presentation. In the B2 strategy, the approach is to
buffer upto a certain limit (say Bmax) for each media stream on the client side,
before the presentation. This implies that more than one object in a media
stream will be buffered by the client during the presentation sequence.
210 CHAPTER 6

The BI strategy minimizes the buffer space requirements of the client and the
QoS derived based on this strategy gives the preferred values for the client.
The QoS requirements derived, based on the B2 strategy, specifies the accept-
able values for the client. In a distributed orchestrated presentation, multi-
media objects composing the presentation are retrieved from the server. The
retrieval of objects do not impose any real-time demands from the network
service provider. Hence, for orchestrated presentations, it is sufficient if the
network service provider guarantees the required throughput.

Computation of Preferred Throughput :

For objects such as still images. we can determine the average throughput re-
quired by the application with the B I buffering strategy, assuming that at most
one object in each multimedia stream is buffered by the application on the
client system before the presentation. For objects such as video that consists
of a set of frames to be presented at regular intervals, the application follow-
ing the BI strategy can buffer atmost one frame of the video object. Let us
consider a multimedia stream with an object Oi at a playout time instant ti
and let the object size be Zo;. Since only one object is assumed to be buffered
before the playout time instant for the stream, the retrieval of objects can be
started only after the immediately preceding playout time instant ti-l and has
to be completed before ti. Hence, the preferred throughput c~;;/ is [23] :

cpref
app
[t i-I, t i ] >
- t
Zo.
t
i - i-I

Computation of Acceptable Throughput :

For calculating the acceptable QoS parameters, we should find out the minimum
values that are required by the application. A minimum value of throughput
(c~;~) implies that the application will be following the B2 buffering strat~gy
of buffering more than one object (in the stream for which throughput is being
calculated) before its presentation. When more than one object in a stream is
buffered by the application, the start of the presentation is delayed by the time
required for retrieving Bma:c bits using the acceptable throughput of c~;~. If
Bma:c is the maximum buffer space available and L~=1 Zo. is the total size of
all objects to be retrieved, then the acceptable throughput is [23] :

Cacc > (L~-1 Zo. - Bma:c)


app - t final - tinitial
Multimedia Synchronization 211

QoS Computation - A n Example


Let us consider the orchestrated presentation example discussed in Figure 5.
In the video stream V in Figure 5, we can assume an uniform frame interval
with 10 frames/second and a frame size of 2 Mbits for the objects Vi, V2 ..• , V4.
Let us assume a uniform time interval of 10 seconds for presentation of the
video objects (Vi, V2, •• ). The preferred QoS requirement of the application
is 20 Mbps. For calculating the acceptable QoS, let us assume the maximum
buffer space (Bmarc) available as 200 Mbits. The acceptable QoS will then be 15
Mbps. The preferred and acceptable QoS is shown in Figure 15. The preferred
QoS values are required from the start of the presentation (with sufficient time
for retrieving the first object) till the end of presentation. The acceptable QoS
values are required from a time r seconds before the start of the presentation
for retrieving Bmarc bits. (r = Bmarc/C:;;). In this example, r is 13.33 seconds.

Accoplol>l. Th_.hp•• -
Pref.rre4 Throa,hpat -+-

J
!

f 10

.~~------~----~------~----~------~~
·10 o ..
Figure 15 Throughput for stream V ill Figure 6.5.
212 CHAPTER 6

8.2 Traffic Source Modeling


The synchronization characteristics of an orchestrated presentation describes a
sequence of objects with an associated duration of presentation, to be presented
at different time instants. This sequence gives an implicit description of the
traffic associated with the orchestrated presentation [23]. However, the traffic
generated by an orchestrated presentation server also depends on the object
retrieval schedule adopted by the client going through the presentation. This
retrieval schedule basically determines the time instant(s) at which the client
wants to receive the object(s). This schedule depends on the buffering that
can be done at the client side. With the minimum buffering strategy (Bd,
the delivery time of an object (or a frame, in case of video objects) is the
presentation time of the previous object. As an example, object 0; will be
delivered from the server at ti-l just after the presentation time instant ti-l of
the object Oi-l. With maximum buffering strategy B2 , the delivery schedule
at the server is such that Bmax bits will be delivered from the server every r
seconds, where r = Bmax/C:;~. C:;~ is the minimum acceptable throughput
that should be offered by the network for a multimedia stream with the B2
buffering strategy.

Traffic Source Model Example : Consider the orchestrated presentation


example shown in Figure 5. The assumptions made in Section 8.1 for com-
puting the QoS requirements of the video stream V can be used for deriving
the stream's traffic characteristics. In addition, let the network packet size, p,
be 100 bits. With the B1 strategy, the minimum throughput required from the
network (C~;;J) will be 20 Mbps. Figure 16 shows the traffic generated for the
video stream using the minimum buffering strategy B 1 . Delivery of the first
frame will start 0.1 seconds before the start of the presentation. Each frame is of
size 2* 106 bits and hence 2* 10 4 packets (framesize/p) will be generated every
0.1 seconds. For the B2 strategy, let us assume a maximum buffer space (Bmax)
of 200 Mbits for the video stream. The minimum throughput needed from the
network (C:;;) will be 15 Mbps. With these assumptions, 2 * 106 packets
= =
(Bmax/P) will be generated by the server every r Bmax/C 13.33seconds.
The initial chunk will be delivered 13.33 seconds before the presentation. Fig-
ure 17 shows the traffic generated for the video stream V in Figure 5 with
maximum buffering strategy B2. It should be noted that the origin in the
X-axis corresponds to the presentation start time.
Multimedia Synchronization 213

25000
BIStra_IJ -

20000

1•
g 15000

10000

5000
-.2 -.1 •1 .2 .3
n •• (.......)
.. .6 .1

Figure 16 Traffic for stream V in Figure 6.5. with Bl strategy

9 SUMMARY AND CONCLUSION


Distributed orchestrated multimedia presentation applications such as multi-
media lecture presentations, multimedia databases and on-demand multimedia
servers are increasingly becoming popular. An orchestrated application has
a synchronization characteristic that specifies the time instants at which the
presentation of objects in the media streams must synchronize. This synchro-
nization characteristic is specified in the form of pre-defined temporal relations
and are stored along with the multimedia database. In this Chapter, we pre-
sented the problem of specifying the synchronization characteristics of an or-
chestrated multimedia presentation. We discussed the methodologies that have
been proposed in the literature for describing the synchronization characteris-
tics. The methodologies use concurrent programming language approach or
graphical approach or content based approach. We then studied the database
representation of the synchronization characteristics.

The synchronization characteristics of the presentation describe a predictable


pattern of events that happen in the time domain_ In a distributed orches-
trated presentation, this pattern of events can be used to determine the object
retrieval schedule that is to be adopted by a client. It can also be used by the
214 CHAPTER 6

2.2 ••06 .---..,..---,.---r---"T----,,..---..-----r----r----.----,


825 ....." -

2.1.+06

2.. 06

1
i 1.9•• 06

o
e
....; 1.8.+06

1.7.+06

1.6• .06

1-' ••06 L.l._-'-_--'-_---L_---'-_--''--....L.I-.._-'-_.....I--'---'-_-'


-U -10 -5 10 15 20 25 30 l5
TI_(_.a)

Figure 17 Traffic for stream V in Figure 6.5, with B2 strategy

network service provider to understand the QoS requirements of orchestrated


applications and its network traffic generation characteristics. In [22, 23], an ap-
proach based on the Probabilistic Attributed Context Free Grammar (PACFG)
has been proposed for deriving the QoS requirements and the network traffic
generation characteristics of orchestrated multimedia presentation based on its
synchronization characteristics. Based on these studies, protocols have been
specified for multimedia synchronization [12] and for handling QoS negotia-
tions in orchestrated presentation [23]. The traffic source model of an orches-
trated presentation, based on its synchronization characteristics, can be used
for simulation studies for assessment of network performance.

REFERENCES
[1] J.L. Peterson, Petri Net Theory and The Modeling of Systems, Prentice-
Hall Inc., 1981.

[2] W. Reisig, Petri Nets: An Introduction, Springer-Verlag Publication, 1982.


Multimedia Synchronization 215

[3] K.S. Fu, Syntactic Pattern Recognition and Applications, Prentice-Hall


Inc., Englewood Cliffs, New Jersy, 1982.
[4] J.E. Coolahan. Jr., and N. Roussopoulos, 'Timing requirements for Time-
Driven Systems Using Augmented Petri Nets', IEEE Trans. Software Eng.,
vol. SE-9, Sept. 1983, pp 603-616
[5] J.F. Allen, 'Maintaining Knowledge about Temporal Intervals', Commu-
nications of the ACM, November 1983, vol. 26, no. 11, pp. 832-843.
[6] W.M. Zuberek, 'M-Timed Petri nets, Priorities, Pre-Emptions and Perfor-
mance Evaluation of Systems', Advances in Petri nets 1985, Lecture Notes
in Computer Science (LNCS 222), Springer-Verlag, 1985.
[7] P.D. Stotts and R. Frutta, 'Petri-Net-Based Hypertext: Document Struc-
ture With Browsing Semantics', ACM Trans. on Office Information Sys-
tems, vol. 7, no. 1, Jan 1989, pp. 3- 29.
[8] T.D.C. Little and A Ghafoor, 'Synchronization and Storage Models for
Multimedia Objects', IEEE Journal on Selected Areas of Communication,
vol. 8, no. 3, April 1990, pp. 413-427.
[9] R. Steinmetz, 'Synchronization Properties in Multimedia Systems', IEEE
J. on Selected Areas of Communication, vol. 8, no. 3, April 1990, pp. 401-
412.
[10] P.D. Stotts and R. Frutta, 'Temporal Hyperprogramming', Journal bf Vi-
sual Languages and Computing, Sept. 1990, pp. 237-253.
[11] S. Gibbs, 'Composite Multimedia and Active Objects', Proc. OOPSLA
'91, pp. 97-112.
[12] T.D.C. Little, Synchronization For Distributed Multimedia Database Sys-
tems, PhD Dissertation, Syracuse University, August 1991.
[13] T.D.C. Little, A. Ghafoor, C.Y.R. Yen, C.S. Chen and P.B. Berra, 'Mul-
timedia Synchronization', IEEE Data Engineering Bulletin, vol. 14, no. 3,
September 1991, pp. 26-35.
[14] J. Stefani, L. Hazard and F. Horn, 'Computational model for distributed
multimedia applications based on a synchronous programming language',
Computer Communication, Butterworth- Hienmann Ltd., vol. 15, no. 2,
March 1992, pp.114-128.
[15] T.D.C. Little and A. Ghafoor, 'Scheduling of Bandwidth-Constrained
MultiMedia Traffic', Computer Communication, Butterworth-Heinemann,
July/August 1992, pp. 381-388.
216 CHAPTER 6

[16] B. Prabhakaran and S.V. Raghavan, 'Synchronization Models For Multi-


media Presentation With User Participation', ACMjSpringer-Verlag Jour-
nal of Multimedia Systems, vol.2, no. 2, August 1994, pp. 53-62. Also in
the Proceedings of the First ACM Conference on MultiMedia Systems,
Anaheim, California, August 1993, pp.157-166.

[17] S.V. Raghavan, B. Prabhakaran and S.K. Tripathi, 'Quality of Service


Negotiation For Orchestrated MultiMedia Presentation', Proceedings of
High Performance Networking Conference HPN 94, Grenoble, France,
June 1994, pp.217-238. Also available as Technical Report CS-TR-3167,
UMIACS-TR-93-113, University of Maryland, College Park, USA, Octo-
ber 1993.
[18] H.M. Yin and P. Venkat Rangan, 'Designing a Multi-User HDTV Storage
Server', IEEE Journal on Selected Areas on Communication, January 1993.

[19] L. Li. A. Karmouch and N.D. Georganas, 'Multimedia Teleorchestra With


Independent Sources: Part 1 - Temporal Modeling of Colloborative Mul-
timedia Scenarios', ACM jSpringer-Verlag Journal of Multimedia Systems,
vol. 1, no. 4, February 1994, pp.l43-153.

[20] L. Li. A. Karmouch and N.D. Georganas, 'Multimedia Teleorchestra


With Independent Sources : Part 2 - Synchronization Algorithms',
ACMjSpringer-Verlag Journal of Multimedia Systems, vol. 1, no. 4, Febru-
ary 1994, pp.153-165.
[21] Dong-Yong Oh, Arun Katkare, Srihari Sampathkumar, P. Venkat Rangan
and Ramesh Jain, 'Content-based Inter-media Synchronization', Proc. of
SPIE'95, High Speed Networking and Multimedia Systems II Conference,
San Jose, CA, February 1995.

[22] S.V. Raghavan, B. Prabhakaran and S.K. Tripathi, 'Synchronization Rep-


resentation and Traffic Source Modeling in Orchestrated Presentation', to
be published in IEEE Journal on Selected Areas in Communication, special
issue on Multimedia Synchronization.
[23] S.V. Raghavan, B. Prabhakaran and S.K. Tripathi, 'Handling QoS Nego-
tiations In Orchestrated Multimedia Presentation', to be published in the
journal of High Speed Networking.
7
INFOSCOPES: MULTIMEDIA
INFORMATION SYSTEMS
Ramesh Jain
Virage Inc.,
San Diego, CA 92121 and

Electrical and Computer Engineering,


University of California at San Diego,
La Jolla, CA 92093-0407

ABSTRACT
Infoscopes will be the microscopes and telescopes of the information systems of the fu-
ture. The emergence of information highways and multimedia computing has resulted
in exponential growth in the availability of multimedia data. Most information in
computers used to be alphanumeric. Increasingly information has been appearing in
graphic, image, audio, and video forms. Many approaches are being proposed for stor-
ing, retrieving, assimilating, harvesting, and prospecting information from disparate
data sources. Infoscopes will allow users to access information independent of the lo-
cations and types of data sources and will provide a unified picture of information to
a user. Due to their ability to represent information at different levels of abstractions
these systems must recover and assimilate information from disparate sources. In this
chapter, we discuss requirements of these emerging information systems. We discuss
basic architecture and data models for these systems. Finally, we briefly present a
few examples of early infoscopes.

1 INTRODUCTION
Most information on computers used to be alphanumeric. Technological changes
have brought a major change in the nature of computer-based information sys-
tems in the last few years. Increasingly the information has started appearing
in graphic, image, audio, and video forms. The increased capability of com-
puters to deal with multimedia is resulting in an exponential growth in the
availability of data in every form. The World Wide Web clearly demonstrates
what advances in communications, networking, and computing power can do.
218 CHAPTER 7

Just a few years ago, it was difficult to imagine that net surfing would become
so common and it would be possible to navigate through so many places so
easily while sitting at your own desk! Now the WWW is the fastest growing
aspect of computing. It is clear that the WWW has become a model of modern
information systems.

We commonly mistake data for information. Information starts with data, but
it must be recovered. Data is NOT information; it is a source of information.
Data represents facts or observations. Data comes in many forms. The form,
or type, of data depends on the source used to acquire facts or observations.
Text, images, video, and sound, are all examples of data. Information is task
dependent and is derived from the data in a particular context using knowledge.
Multimedia computing is the ability of computers to deal with data of many
disparate types.

Better tools to produce and manage data combined with human desire to use
information has resulted in tremendous data explosion. This data explosion
has resulted in high information anxiety in modern society. In most cases,
including while surfing on the WWW, we suffer from data overload and become
confused, disoriented, and inefficient. Commonly people think that they have
information overload; what they really have is data overload. When there is
too much data, the human cognitive processes to recover information out of
that ocean of data become overloaded. In the last decade, advanced techniques
that help in production, communication, storage, and even display of data have
seen significant advances, but progress in techniques for recovering information
out of data has been slow.

A picture is worth a thousand words. This statement is true for all forms
of data. The information extracted from the data depends on the observer
and the context in addition to the data itself. The same data can provide
different, sometimes conflicting, information. Current database systems have
mechanisms that result in very rigid semantics. This semantics in the current
database system come from the database designers and the users. In both
cases, the tools used in current database systems to associate semantics to the
data is more or less fixed at the time of database design. Such databases were
satisfactory in the early days of the information revolution. Now tools must be
developed to provide a rich semantic environment for users.

It would be impossible to cope with the explosion of multimedia data, unless the
data is organized in such a way that we can retrieve the information rapidly on
demand. A similar situation occurred for numeric and other structured data,
and led to the creation of computerized database management systems. In
InfoScopes 219

databases, large amounts of data are organized into fields and key fields are used
to index the databases, making searches very efficient. These database systems
have revolutionized modern society. These systems, however, are limited by the
fact that they work well only with numeric data and short alphanumeric strings.
Since so much information is in non-alphanumeric form (such as images, video,
and speech), as a natural extension to the ideas in databases, researchers started
exploring the design and implementation of image databases. But creation of
mere image repositories, as practised in current image databases, is of little
value unless there are methods for rapid retrieval of images based on their
content, ideally with an efficiency that we find in today's databases. We should
be able to search image databases with image-based queries, in addition to
alphanumeric queries. The fundamental problem is that images, video, and
other similar data differ from numeric data and text in format, and hence
they require totally different techniques of organization, indexing, and query
processing. We need to consider the issues in visual information management,
rather than simply extending the existing database technology to deal with
images. We must treat images as one of the central sources of information
rather than as an appendix to the main database.

A few researchers have addressed problems in image databases, some of them


very persistently [12]. Most of these efforts in image databases, however, have
focused either on only a small aspect of the problem, such as data structures or
pictorial queries, or on a very narrow application, such as databases recogniz-
ing similar faces in controlled environments. Other researchers have developed
image processing shells which use image databases. Clearly, visual informa-
tion management systems encompass not only central aspects of databases,
but aspects of image processing and image understanding, advanced inter-
faces, knowledge-based systems, compression and decompression of images,
and object-oriented systems. Moreover, memory management and organiza-
tion issues start becoming much more serious than in the largest alphanumeric
databases. In failing to address most of these topics, one may either address
only theoretical issues, or may work in a microcosm that will, at best, be ex-
tremely narrow in its utility and extensibility.

In this paper, we discuss some ofthese issues and then present the basic archi-
tecture, data model, and interaction environment for the information systems
that will be essential in the coming information age. 1 We call these systems
infoscopes. These systems allow a closer and detailed view of the data to an ob-
server who wants to extract information. Like telescopes, microsopes, and now
1 We demonstrate these in the context of visual information, forms of data also.
220 CHAPTER 7

scanning probe microscopes in their application areas, infoscopes are becoming


essential tools in the information society.

2 INFORMATION AND DATA


In dealing with information in multimedia systems, a clear distinction between
data and information is essential. Multimedia information systems must allow
access to information at different levels of abstractions, from data to high-level
concepts. This is possible by dealing with the semantics explicitly. In this
section we discuss semantics in multimedia information systems.

2.1 Semantics in Multimedia


The power of multimedia systems originates in the fact that disparate infor-
mation can be represented as a bit stream. This is a big advantage because
every form of representation, from video to text, can be stored, processed, and
communicated using the same device: a computer. This uniformity of represen-
tation is the main reason behind the popularity and excitement of multimedia.
All computers in the next few years will handle multimedia. This trend is ob-
vious. Everyone, even those who were computerphobes, are now being seduced
by multimedia computers and are using computers in many interesting applica-
tions. By reducing all forms of information to bit streams, we can start focusing
on information rather than the sensor used to acquire it and the communica-
tion channel used to transport it. We can also use an appropriate presentation
method to supply information to a user.

A representation is always task dependent. The bit stream is good for storage
and communication of data because our present day computers and communi-
cation devices are truly adept at dealing with bits. A bit stream without the
knowledge of its structure is just a bit stream, however. It is useless to hu-
mans. For our use, the bit stream must be converted to a form that one of our
senses can understand. A picture displayed as a sequence of bits, a sequence
of digits, or a wave form is usually useless to us. It must be displayed as a
picture. The same is true with any other form of information. Most of us have
'seen' sound waves; few, if any, can make sense out of those waves that in an
appropriate form are obvious to most. Thus, to make a sense of a data stream,
it should be presented in the proper format on a proper device. To accomplish
thIS, information or metadata related to the source and format must be tagged
InfoScopes 221

with every bit stream. For interfacing with humans, computer must know how
to send a stream to the proper device and adjust the parameters of the device
for 'impedance matching' with human senses. In other words, with every bit
stream there should be information that enables the computer to interpret the
bit stream in the context of interfacing with humans.

Abstraction of information plays a very important role in understanding it. In


its raw form, data acquired using a sensor is converted to abstract symbolic
form for understanding it. The abstraction process is complex and task de-
pendent. Usually several levels of abstractions are involved in this process.
Commonly the representations of data closer to sensor level are called low-level
and the higher symbolic levels are called higher- level. Alphanumeric informa-
tion is high level information. The amount of data decreases with an increasing
level of abstraction. What is usually not obvious is that each level of abstrac-
tion involves models and models are always task and domain dependent. The
process of abstraction from sensory data to alphanumeric, or any other highly
symbolic form, is the process of systematically introducing semantics.

Most information in computers used to be alphanumeric and was already at a


higher level. In multimedia systems, different types of information are used.
If the role of the computer is to add minimum value to the information and
act mostly as a communication channel, then computers can manage it with
little or no understanding of semantics. To combine and compare two informa-
tion sources, it is essential that both information sources are understood, and
this understanding should be at a level where we can compare and contrast
information independent of the original representation medium. This is true
for us, and if we want computers to seamlessly deal with disparate information
sourceS, then we will have to do this for computers also. Bit strings are too low
level to allow us an understanding of the content they carry. Semantics must
be associated with them to make them more understandable.

If we want to use multimedia computers more than just as sophisticated com-


munication channels, then computers should have understanding of each form
of information. Strictly syntactic knowledge about a video, audio, graphic, or
any other form of data will make them just a communication channel. Most
current applications of multimedia require very little semantic processing of
information. The most attractive feature of current multimedia systems is that
even with very little semantic information they make different forms of infor-
mation available in one environment. This facility is an enormous step in the
right direction. By bringing all this information in computing environments,
now we are developing systems that allow us to deal with this information in a
very flexible way.
222 CHAPTER 7

Images, video, audio, and other information representation have a large volume
of data. Technology is progressing rapidly to deal with the required storage and
bandwidth problems. These information sources represent low-level informa-
tion. When considered as a bit stream with the meta information, the explicit
semantic information content in these sources is very low. This poses a serious
problem in accessing these information sources. Humans are very efficient in
abstracting information and then interacting with humans and other devices
at a high level. This allows high bandwidth interactions among humans and
between human and machines. Multimedia systems currently have this seman-
tic bottleneck. Traditionally, most attention has been focused on storage and
communication of multimedia information. We are soon reaching the point
where multimedia systems will add a great deal to the data overload on peo-
ple. Techniques must be developed to add semantics to the data acquired from
disparate sources in disparate forms.

2.2 Semantics in Databases


Databases form the core of current information systems (Figure 1). Databases,
as the name indicates, are designed to organize large volume of data so that the
required data, or information, can be retrieved very rapidly. In databases, the
microcosm represented in the computer is relatively simple. It is assumed that
the data is at a very high level and is already available in abstract form, such as
a name or a social security number, and it can be represented in alphanumeric
form. When multimedia images are also part of the information system, images
can be considered as blobs or similar data items, but that will be a very limited
use of the potential they provide. Images contain a wealth of information.
Efforts to assign keywords to images and then store those keywords in the
database are very limited. Keywords often provide a better description of the
person who assigns them than they do the image. Images are a projection of a
real world and most people refer to them as if they may be refering to the real
world. An information system must be able to distinguish between the real
world and its projections. As we will see later, to make this distinction, the
system will have to use a powerful knowledge base.

A database can only be useful if it provides required information. The recovery


of information from data requires inclusion of semantics, either explicitly or im-
plicitly. Explicit semantics can be introduced by declarative knowledge repre-
sentation techniques. Such techniques are being actively explored in databases.
One may also specify procedures to assign semantics to data. Such procedures
in turn may use models that may be represented in declarative forms. The
InfoScopes 223

Database

f-+-+--t---t--I--I Computer's world

Multimedia Databases
Real World

Imagls
end 0111.
media
objects

Computer's woI1d

Figure 1 A conventional database contains an abstraction of the real world


in it. An information system dealing with multimedia data must add one more
level of semantics by adding the projection of the world in addition to the high
level abstractions.
224 CHAPTER 7

important point is that if information is to be extracted from a database, then


there has to be a semantic component in a system. This semantic component
may require declarative as well as procedural representations.

Commonly there are two different places in database systems where semantics
is introduced. While designing the database, the designer includes semantics
in the form of relationships and attributes. These relationships form the first
level of semantics in databases. The second, and possibly more important, level
of semantics is provided by the user, or the application programmer. Based on
the knowledge of the semantics introduced by the designer and the knowledge
of the domain, a user or application programmer develops procedures to get
the information required from the database.

A user is an important component of a database system and plays a key role


by providing a major component of semantics, both in declarative as well as
procedural form. The success of a database is determined by the combination
of these two semantic agents, the designer and the user. The designer is sup-
pose to anticipate all important queries and use these queries in defining the
relationships among entities and their attributes in the database. The missing
semantics are provided by the user.

This scheme of introducing semantics is satisfactory in simple business related


applications. As the complexity of the microcosm modeled in databases in-
creases, the difficulties in capturing semantics using limited data models, such
as a relational model, become difficult. Another difficulty arises due to the fact
that in many applications, a database is used by people with litttle, if any,
knowledge of the domain. To add to the complexity, the type of information
is non-alphanumeric. The entities in the databases have become so complex
that relational approaches are becoming increasingly difficult to use in these
applications.

Inspired by research in knowledge-based systems, databases have started intro-


ducing more declarative knowledge. In fact, databases have started becoming
more and more like knowledge bases. For dealing with multimedia information
systems, explicit representation of knowledge is essential. Moreover, this knowl-
edge should not be an addendum to the system; it should be an integral part of
the system. Traditional databases abstracted a microcosm into computer data
structures. In multimedia databases, an extra level of complexity is added to
the problem. The multimedia data, like images, is itself an abstraction of the
real world. When refering to data in an image, we often refer to them as if
they are entities in the real world, which they are not. An image of an object is
not the real object. In multimedia information systems, a user interchangeably
InfoScopes 225

refers to entities in images as if they were real entities. This means that these
information systems must deal with one added level of data abstraction, and
this must be done transparently to users.

3 OPERATIONS IN INFOSCOPES
An information system should allow storage, communication, organization, pro-
cessing, and envisioning of information. It should facilitate interactions by using
natural interactions. Natural interactions include multimedia input and output
devices and use of high-level domain knowledge by a user.

Domain knowledge should be so much a part of a system such that a user


feels that the system is an intelligent aide. A user should be able to articulate
queries using terminology commonly used in his field and should not have to
worry about the organization of information in the system.

The system should allow powerful navigation tools. Unlike early database users,
users of infoscopes will not articulate their queries in well-defined, crisp lan-
guages like SQL. These users will use vague natural language, and that should
be understood by the system to let a user navigate through the system. The
nature of queries will be fuzzy not due to the laziness of the user, but due to the
nature of information and the size of the database. A general query environ-
ment will be like one shown in Figure 2. A user looking for certain information,
say about a person who he vaguely recalls, will go to an infoscope and specify
whatever important things he remembers about the person. This specification
may be she has big eyes, wide mouth, long hair, and a small forehead. Based
on this information, candidate people's pictures may be shown. The user can
then select the closest person and modify his query by modifying the photo,
either by specifying features or by using graphical and image editing tools. This
refines the query and is sent to the system to provide new candidates to satisfy
the query. Thus, a query is incrementally formulated starting with the original
vague idea. This process will terminate when the user is satisfied.

Due to the nature of data, several levels of abstraction in the data, and temporal
changes in the data, the types and nature of interactions in such systems will
be richer than those in a database or image processing system. We loosely refer
to all interactions initiated by a user as queries. The types of queries in such
systems can be defined in the following classes:
226 CHAPTER 7

Describe the
target face in general ,...----,
descripo\ eterm . sex
age
hair color

Candidate fa are returned.

Ane\ query
The user ch a
. generated. fa eand alters aspecific
feature.

This proce is continued until


the correct face is located.

Figure 2 This figures shows that the queries in infoscopes will be incremental
in nature. These queries will facilitate navigation and browsing of data.

1. Search. Search queries in some cases may be similar to traditional database


systems. As discussed later, meta features are used to provide some in-
formation about images. In many applications, queries can be formulated
to search specific images using only meta data. These queries can be an-
swered, in most cases, using conventional database queries. In fact many
early image databases and browsers were designed using this approach .
In more difficult cases, one may be interested in image attributes. To
answer these queries, one may have to use visual attributes of images to
InfoScopes 227

search. A major difference in these queries will be the fact that similarity
will become a central operation, rather than conventional matching. Tech-
niques to evaluate similarity are an active research topic in many fields
of science and technology [22]. Many approaches have been proposed to
compare several attributes to evaluate similarity of two objects. In addi-
tion to the decision on what attributes to select, a very difficult decision
is how to combine those attributes. Methods to combine attributes are
domain dependent and subjective. It is clear, however, that in dealing
with images, and similar data sets, similarity rather than matching will be
a key function in searching.

2. Browse. A user may have a vague idea about the attributes of an entity,
relationships among entities in an image, or overall impression of an image.
Such ideas are formed due to the overall appearance of the image rather
than very specific objects and relations among them. In such cases, the user
may be interested in browsing the database based on an overall impression
or appearance of images, rather than searching for a specific entity. The
system should allow formulation of fuzzy queries to browse through the
database. In browsing mode, there is no specific entity that a user is
looking for. The system should provide datasets that are representative
of all data in the system. The system should crlso keep track of what has
been shown to the user. Some mechanism to judge interest of the user in
the data displayed should be developed and this interest should be logged
to determine what to display next.
3. Construct Solutions or Design. Most databases and information sys-
tems are designed assuming that a user will articulate his queries in one
attempt. Many applications require an environment in which a user can
incrementally introduce constraints and use the sequence of constraints to
articulate his query. Each constraint reduces the search space and the user
can browse through this reduced space to decide what constraint to intro-
duce next. The system interface should facilitate sequential introduction
of constraints and evaluation of results of each constraint introduction.
These constraints may be symbolic or pictorial. In some cases, user may
want to select an image from the database and modify it to specify his
requirements. Many design problems, including artistic design, are based
on cbnstructing solutions by introducing constraints sequentially.

4. Statistical Information. Images may be retrieved based on some statis-


tical aspect of its contents. Simple queries like "show me images containing
several animals" or queries like "show me images of densely populated ar-
eas in the United States" fall in this class. Such queries are very common
in scientific and other exploratory analysis studies. Some of these queries
228 CHAPTER 7

may be based on image attributes like pixel colors or edge density. It is


also possible that these queries may require attributes of objects. The sys-
tem should either store these attributes or should provide mechanisms to
compute them rapidly.

5. Temporal Events. In video sequences one may want to retrieve images


based on some events taking place in the sequence. A typical query of this
type may be show me all sequences in which player X was blocked by player
Y. These queries will require temporal analysis of video sequences in terms
of the events of interest. Some primitive spatia- temporal features must
be computed and stored in the database to answer questions concerning
events of interest to users.
Abstractions in spatia-temporal space are not yet understood well enough
to automatically extract them from video sequences. Though some tech-
niques have been developed to represent relative time ordering of two
events, representations for abstraction of events need to be developed to
allow users to articulate questions related to temporal events.

3.1 Relevance Feed back


In text retrieval, a user is allowed to provide feedback to the system by evaluat-
ing the responses of the system. This feedback helps system select appropriate
parameters to satisfy user needs. Infoscopes use the concept of relevance feed-
back in many different forms. During browsing this concept is used to steer the
direction of browsing. Incremental search also allows a user to provide feedback
to the system. In many cases this may be implemented in the form of a query
refinement without any parameter adjustment in similarity measurements. It
is possible to implement learning in this environment. We are not aware of
any system that has implemented parameter adjustments using learning for an
image or video database system.

4 INFOSCOPE ARCHITECTURE
As is clear from the queries and the nature of the data and information, info-
scopes must combine features of databases, image understanding systems, and
knowledge-based systems. The interfaces for these systems will require careful
considerations and will depend on the type of data and queries. In many cases,
information from multiple disparate sources must be combined to synthesize
InfoScopes 229

the answer and then adequate visualization methods must be used to present
the answer.

Interactive
Query Module

Data (Image)
Processing Module

Knowledge Module

Figure 3 High level architecture of llfoscope. This architecture shows the


importance of feature database and knowledge base in Infoscopes.

Considering the needs of infoscopes, we believe that the high-level architecture


for these systems should have four explicit modules. The system architecture,
shown in Figure 3, provides the necessary functionality for an infoscope. This
will allow different applications to be developed and implemented in a consis-
tent way. This architecture and some of the specific operations discussed in
the foHowing sections are strongly motivated by the data model that will allow
representing information at several levels of abstractions to allow interactions
with infoscopes at different levels. The VIMSYS data model [7] described a
230 CHAPTER 7

hierarchical representation of the data using various levels of semantic inter-


pretation that may satisfy needs of infoscopes. This data model is shown in
Figure 4. At the image representation (IR) level, the actual image data is
stored. Image objects (such as lines and regions) are extracted from the image
and stored in the image object (10) layer, with no domain interpretation. Each
of these objects may be associated with a domain object in the DO layer. The
semantic interpretation is incorporated in these objects. The domain event
(DE) layer can then associate objects of the DO layer with each other, pro-
viding the semantic representation of spatial or temporal relationships. This
hierarchy provides a mechanism for translating high-level semantic concepts
into content-based queries using the corresponding image data. This allows
queries based on object similarity to be generated, without requiring the user
to specify the low-level image structure and attributes of the objects. Another
very important aspect of this representation is that the first two levels, IR
and 10, are domain-independent levels and the other two, DO and DS, are
domain-dependent levels. We do not know any system yet where this goal of
clearly organizing domain-dependent and domain-independent components can
be cleanly partitioned and implemented. We believe, however, that is a worth-
while target. The architecture discussed below is motivated by this desire.

The system is comprised of four main functional components. The components


are the Database, Insertion Module, Interface, and Knowledge Base. The
first three components interact with the Knowledge Base, which contains the
domain- specific knowledge that is required for each specific application. Each
of these components is necessary to provide the desired functionality. This sys-
tem uses domain- specific knowledge at every step of processing. This knowl-
edge allows the system to function efficiently, and to interpret and perform the
actions requested by the user. The knowledge module insures a separation of
the general architecture from the domain- specific knowledge. This allows the
architecture to serve as a general platform for development of other applica-
tions. This architecture may be viewed as the architecture used in knowledge-
based systems as it emphasizes separation of domain-dependent knowledge from
other processes in the system. The details of each block are beyond the scope
of this paper.

These ideas have been used to develop several systems for the retrieval of images
and video information in our group [7, 3, 24, 25, 10]. Here we discuss each of
the system components briefly. In the following discussion, we will discuss this
architecture in the context of images and video, but our concepts are applicable
to any kind of data.
InfoScopes 231

Domain Knowledge

:Domain
:Independent

I
I

~------------------------~

Figure 4 A four-level data model to capture different levels of abstractios in


infoscopes is shown here. The image levels are domain independent, the other
two levels depend on the domain.

4.1 Database
This component provides the storage mechanism for the actual data as well as
the features of the data. Features evaluated at the time of insertion as well
as meta features are stored in the database with their value and a reference
to the image containing it. Similarly, every image in the database references
the features which it contains. This corresponds to the VIMSYS data model
discussed in [7]. Data is represented in the database at different levels of the
hierarchy, which allows well- defined relationships between the actual images,
image objects, and the real-world domain objects which they represent. In
232 CHAPTER 7

addition to storing the actual image data, segmented regions of the image
which pertain to domain objects are also identified and stored. This provides an
effective mechanism for the computation of an additional feature of a domain
object, since it is not necessary to relocate the object in the image. Issues
related to compression, storage management, indexing, and other database
aspects are relevant in the design of the database.

The database is a very important component of infoscopes. But it is only one


component and must work seamlessly with other components. The organization
of the database may require careful analysis of current systems and the needs
of multimedia information systems. It is not clear how far one can go with
conventional database management. It is commonly believed that a relational
model is very limited for such applications and one must use object-oriented
databases. It is not clear, however, whether object-oriented databases will allow
all facilities required in such systems. On considering the VIMSYS data model
it becomes clear that each plane can be captured in an object-oriented system,
but it is not clear how to represent inter-plane relationships efficiently.

4.2 Insertion module


This component allows the insertion and evaluation of images into the database.
In conventional databases, the insertion component was trivial because the data
was directly inserted into the system. In infoscopes, data must be analyzed and
appropriate features extracted for insertion into the database. It is during the
input process that values will be computed for all the important features in
the image. Features in the image will be examined to determine which domain
object they correspond to (if any). When domain objects are identified, their
values are computed and stored in the database. Although it is desirable for
this insertion process to be entirely automatic, there are several limitations to
this type of approach. Many domains consist of poorly defined objects and
features which are difficult to accurately evaluate in many cases. Allowing the
user to alter or override the results when necessary will provide much more
accurate data than an automatic system, and much more convenience than a
totally manual approach.

Computer vision techniques are required to analyze the data and input it into
the system. Unfortunately computer vision techniques can automatically ex-
tract features only in certain limited cases. In many applications, it may be
required to develop semiautomatic techniques for extracting features. Domain
InfoScopes 233

knowledge plays a very important role in defining the processes used for auto-
matic feature extraction. Most research in computer vision has been concerned
with development of general purpose and automatic techniques. In infoscope ap-
plications, one may require techniques tuned to particular applications. These
techniques may be based on some basic image processing tools, but should use
domain knowledge to define and compute domain-dependent features. Also,
these techniques may be designed to assist a user rather than do everything
automatically.

4.3 Interface
This module is used interactively by the user to retrieve information from the
database. A user will articulate his request using symbolic and visual tools
provided by the system. Also, the system must decide the best display meth-
ods. In these systems, the role of the system is not just to provide requested
information, but this information must be provided in the most appropriate
format.

Queries may either be completely user-specified, or generated based on the


results of previous queries. The latter type consists of feature values derived
from an actual image which has been retrieved for the user. This allows the
user to query based on the contents of other images, without forcing him to
know the actual feature values. This is very important since most feature values
will be in terms of attributes such as pixel intensities, numbers of pixels, etc.
Users must be provided tools to cut, paste, and scan visual information from
disparate sources. Also, these visual specifications should be combined with
symbolic specifications, such as keywords.

During the retrieval process, a similarity value is assigned to data which satisfies
the constraints of the generated query. This value can then be used to rank the
results to be displayed to the user. Several factors can be incorporated into the
calculation of this similarity value. After the results of the query are displayed,
the user can generate a new query by using either the contents of these images,
newly-specified feature values, or both.

A query will be usually specified by specifying several features and their relative
weights. The system must use this weighted feature distance to judge the
similarity of the example image ·with the target images in the database. A
serious problem in evaluating similarity distances is the very subjective and
application-dependent nature of similarity functions. Many studies have been
234 CHAPTER 7

performed to find the nature of similarity functions and many alternatives exist.
It is not very clear which function should be used in which situation [22].

4.4 Knowledge Base


This component maintains all the domain-specific information that must be
maintained for each specific application. This information is used at every step
of processing in this system. Domain object descriptions are used to locate
and evaluate important features in the image. These features are then stored
in the database, using the specified representational scheme. During query
processing, user- specified descriptions must be mapped into relevant feature
values and value-ranges. The knowledge module also maintains data describing
how to create and alter these feature values, and how to evaluate the similarity
of images and individual features. We discuss these issues in more details in
the following section.

5 KNOWLEDGE ORGANIZATION
Although specific processes and information will differ greatly between differ-
ent domains, the types of information and the tasks required will be similar in
many applications of infoscopes. The implementation of the knowledge module
provides a consistent method for incorporating this common information into
the system, while allowing a more general architecture to be developed. By
recognizing this separation of knowledge, we will develop a system architecture
which will not be limited to a specific domain.

In any application, the domain objects are those objects (possibly unrelated)
that can be considered to be a unique entity type in the real world. Each of
these objects may be composed of other objects (a composite object), or itself
make up another object. These are the elements which the system and the user
must be able to identify and manipulate in order to represent the information
portrayed in the image. For example in a face-identification application, the
most obvious object is a FACE, which is comprised of several other objects.
The objects that comprise a FACE are LEFT..EYE, RIGHT..EYE, NOSE, etc. This
relationship of objects corresponds to the Domain Event layer of the VIMSYS
model. Regardless of the domain or the objects chosen to represent the domain,
certain aspects of how to process these objects must be maintained. Domain
InfoScopes 235

objects and events must be related to image objects. Image objects are deter-
mined by considering what can be computed from images and what is required
for the given application. The set of image objects represents the alphabet used
by the system.

The functions that make use of this knowledge can be divided into the three
categories listed below:

• Segmentation and Insertion Knowledge. This involves the type of


processing that is required to locate each object in an image. For each
domain object, the necessary functions and attributes will be maintained to
allow this object to be identified and segmented in an image. Segmentation
of an image can only be performed by using some basic image objects and
knowledge of domain objects. Image objects must be defined based on
their computability from images. Another factor in the definition of image
objects is the fact that the set of image objects represents the alphabet
of the language used to define objects and their relationships. We should
select this alphabet carefully. It is desirable to make this alphabet set as
domain independent as possibly, but efficiency reasons may dictate use of
the knowledge of the domain in selecting this alphabet.

• Representation Knowledge. This determines how each domain object


is represented in the database. The representation of the domain object
must be one that incorporates all the necessary information to distinguish
this object from others of its type. The representation scheme for each
object will provide a mapping between the different layers of the VIM-
SYS data model. The representation will determine which types of image
objects each domain object may be associated with. This mapping then
provides the ability to determine which images contain a certain domain
object, and which domain objects are contained in a certain image. The
chosen representation will also be used for determining the similarity of
objects, during the object retrieval process. This calculation would be
difficult if only actual image data were available to represent the objects.
These representational schemes will be stored in a common library to allow
consistent processing for any application that is implemented.

• Query Knowledge. The query knowledge is used during the processes of


formulating and submitting queries, and evaluating the results. For each
feature, initial values must be specified to determine the range of values
that are acceptable for this feature. Information must also be provided to
determine how these range values are affected by factors such as the user's
236 CHAPTER 7

confidence in assigning a value to this feature. Each feature value may also
change in a unique manner when altered by the user. These changes will
be based on the global statistics of each individual feature. Once images
are retrieved, additional knowledge must be used to rank the images by
order of similarity. This makes use of information about the importance
of each individual feature in differentiating between objects and images.

5.1 Types of Features


Features must be extracted from input images and stored in the database. As
is well known [13], different applications may require different features. Since
the features must be stored at the time of data entry, one must carefully decide
the features that will be used in a system. We consider that all features must
be classified in one of the following classes:

1. F ti. This set contains the features which are commonly referred to as
meta-features. Some of these features can be automatically acquired from
the associated information on images. These features may include the
size of the image, photographer, date taken, resolution and similar other
information. This group also contains other features that can be called
user- specified. Values are assigned to these features by the user at the
time of insertion. Many of these features can be read by the system either
from the header, filename, or other similar sources. These features can not
be directly extracted from images.
2. F d. This set contains the features which are derived directly from the im-
age data at the time of insertion of the images in the database. Values
are automatically calculated for these features using automatic or semiau-
tomatic functions. These features are called derived features and include
features that are commonly required in answering queries. These features
are stored in the database.
3. F c. This set contains the features whose values are not calculated until
they are needed. Routines must be provided to calculate these values when
they become necessary. These features may be computed from data at the
query time. These features are called query-only features or computed
features

The first two types offeatures are actually stored in the database. Metadata can
be frequently read from other sources or should be manually entered. Which
InJoScopes 237

feature should be in F d and which should be in Fe is an engineering decision.


One must study frequently asked queries and determine frequently required
features. This determines the set to which a particular feature should belong.

The system interface encourages users to formulate his queries using metadata
and derived features as much as possible. It reluctantly allows use of computed
features. To access data, the system can purge the search space significantly
using metadata and derived features and then apply computed features to only
this reduced set of images. This strategy allows flexibility while maintaining a
reasonable response time. The system may be able to predict wait time using
number of images from which computed features must be extracted.

6 INTERFACES
We discussed the types of operations that will be required in an infoscope.
Many of these operations can not be conveniently performed using traditional
interfaces. Here we discuss some general issues in designing interfaces for in-
foscopes. Interactions with infoscopes are likely to be multimedia. Due to the
nature of the data and several abstraction levels, it is expected that users will
require multimodal interface mechanisms.

Infoscopes must allow facilities to formulate following interactions:

• Symbolic Queries: Though much information in infoscopes is likely to


be in non-alphanumeric form, it is expected that many queries will still be
symbolic. In general, there will be two modes of navigation in infoscopes:
locating and browsing. In the location mode, a user knows what he or she
wants and his queries will be to get precisely that information. In the
location mode, many queries may be symboli~ because what is required
can be articulated using meta data. Some location queries may require
visual data.
Symbolic queries will deal mostly with meta data. For these queries some
query language, like SQL, may be used.

• Query by Pictorial Example (QPE): A very powerful expression of


a query is to point to a picture and expect that the system will show
all pictures similar to the example. This approach is easy to use, but
very complex to implement. The system must use certain features and
some similarity measures to evaluate other pictures that are similar to
238 CHAPTER 7

the example. Effectively, the system must rank all data with respect to
the example and then display pictures that are closest to the example.
Interestingly, this approach has been a very popular approach in the image
databases that are being designed [17].
In QPE, features and similarity measures must be clearly defined for use in
retrieving images. Similarity judgement has been a difficult problem and
continues to attract attention of several researchers [22]. The most inter-
esting fact about similarity measures is that they are domain dependent
and very subjective. Assuming that we have identified a measure that is
acceptable to a user for his or her domain, we face some interesting prob-
lems in QPE. All images are compared to the example to evaluate their
similarity. This is possible in those cases where the size of the database is
such that computations can be done in reasonable time. When the size of
the database grows such that it is not possible to accomodate all data in
main memory and such computations become impractical, one must resort
to indexing techniques.
Indexing techniques for spatial data have been developed [11, 17, 21].
These techniques are very limited when it comes to addressing the problem
of similarity indexing. Techniques like TV-trees [9] are a good step in the
right direction but lack several important features [26].
• Query Canvas: Queries may be formulated by starting with an exist-
ing picture, scanning a new picture and modifying these by using visual
and graphical tools available in common picture editing programs, such as
Adobe Photoshop. One may cut-and-paste from several images to artic-
ulate a query in the form of an image. It is also possible to start from a
clean image and then draw an image using different tools. The basic idea
in this approach is to provide a tool to define a picture that may be used
in a QPE. This approach allows a user to define a picture that they are
looking for.
• Image Part Queries: In many cases, a user may point to an object,
or circle an area in an image and request all images that contain similar
regions. These queries appear very easy, and will be very easy, if complete
segmentation of images is performed and then all region properties are
stored. Most image database systems store only global characteristics of
image. In these cases, one is looking for all images that are a superset
of the region attributes. Once all such images are retrieved, some other
filtering techniques could be developed to solve this problem.
• Semantic Queries: All the above queries were based on image attributes.
In most applications, an image database is likely to be prepared for a spe-
cific domain-dependent application, such as human faces, icefloe images, or
InfoScopes 239

retinal images. It is important that users can then interact using domain-
dependent terms. It is common that people may describe a person using
terms like big eyes, wide mouth, small ears, rather than the corresponding
image objects. An infoscope must be able to respond to these queries.
Semantic queries require extensive use of domain knowledge. Domain
knowledge is required both in defining features that will be used by the
system and in interpreting user queries. Most image database systems ei-
ther considered domain knowlede implicitly by defining features or ignored
it [5]. The role of explicit knowledge in image databases is discussed in
[7, 24, 25].
• Object Related Queries: These queries are semantic queries that ask
for presence of an object. These queries may deal with three-dimensional
objects. Since three-dimensional objects are difficult to recognize using
automated techniques, these queries may become very complex. Three-
dimensional object recognition is a very active research area in machine
vision. Queries based on recognizing objects in a query image may be,
therefore, very difficult to execute.
• Spatio-temporal Queries: In video sequences, and in many other appli-
cations where pictures are obtained over a long period, a user may want to
get answers to some spatia-temporal events and concepts. Answers to such
questions may require complete analysis of all video sequences and storing
some important features from there. Considering the fact that methods
to represent temporal events are not well developed yet, this area requires
much research before one can design a system to deal with spatio- temporal
queries at the natural language level.

7 EXAMPLE SYSTEMS
In this section we discuss different levels in infoscopes and point to some existing
systems that have been implemented.

7.1 Image Databases


One may design a powerful system just by considering basic image features.
Some very basic image features are color, texture, and shape. Grayscale im-
ages are considered here as a special case of color. At the first sight it appears
that one should consider attributes of objects in images. From machine vision
240 CHAPTER 7

literature [13], it is clear that segmentation is a difficult problem. While deal-


ing with a diverse set of images which are acquired under varying conditions,
segmentation may be very difficult. In such cases one may want to completely
ignore domain knowledge and build a database only using image attributes.
These attributes may be computed for complete images or for their predefined
areas.

Many systems have been designed using image-only attributes. QBIC from
IBM [5] uses color, texture, and manually segmented shapes. QBIC was the first
complete system to demonstrate the efficacy of simple attributes in appearance-
based retrieval of images from a reasonably sized database. The use of shape
in QBIC is problematic, however. Shape is defined for individual segments
which must be obtained manually. This also creates an artificial situation in
the database because for each manually obtained segment, one must consider a
separate record in the database. Thus if an image has N objects, the database
must contain N+1 records; one for the image, and one each for N segments.
Shape measures on complete images are not satisfactory because shape is de-
fined for an image region. Some heuristics have been proposed, but much
remains to be done in this area.

Color is considered a global characteristic. Most systems rely on color his-


tograms. Some kind of histogram matching is done to determine similarity
of two images Histogram-based approaches clearly ignore spatial proximity of
colors and hence may result in erroneous results. In most cases, however,
histogram-based matching is quite effective.

Texture poses a bit more difficult problem. Most systems use global measures of
texture and try to assign some texture attribute to images. These attributes are
then used for evaluating similarity of texture in images. Most images contain
different types of texture in different parts of the image. The global texture
attributes, therefore, could be misleading. These systems use only the first
two levels of the VIMSYS data model. Since both these, IR and 10, levels are
domain independent, these image databases are domain independent. Users of
these systems must supply the semantics in these systems. This semantics can
be provided by using color and texture attributes of objects of interest. One
may filter using these attributes and then use domain-dependent features on
remaining images to retrieve desired information.

An example of an image database that provides tools to organize and retrieve


information using image level information is the PinPoint system developed
at Virage. This system extracts features to characterize images using color,
texture, structure, and composition. These features can be combined using
InfoScopes 241

distance functions. This system treats keywords also like features by using
a thesaurus to compute distances between keywords in the query and stored
images. The weights of the features can be changed to retrieve similar images
using different similarity functions. We show a screen shot of this system in
Figure 5. This shot shows all images retrieved as similar to the example image,
which is the best matching image and hence appears as the first image in
similar images. If the images are created using the query canvas, shown in
Figure 6, then one can articulate a query by cutting and pasting and some
other image manipulation operations. It must be mentioned that this system
has no domain-level knowledge.

Interestingly, even without any domain knowledge in this system, users very
quickly learn to retrieve images of their choice by using an example image and
appropriate weights of the features provided in the system. The system uses
color, texture, and structure as features of an image. In color, both global
colors, and automatically segmented segments and their locations, defined as
composition, are used. For texture several properties are computed using stan-
dard texture features and are combined to represent an overall measure of the
texture. Structure addresses shapes and location of edge segments. It is in-
teresting to see that these purely image-based features when combined with
handdrawn queries on a canvas or an image selected for QPE perform quite
effectively in retrieving semantically relevant objects. This strongly suggests
that by defining a pictorial alphabet and suitable rules to use this alphabet, it
may be possible to develop powerful domain dependent systems.

7.2 Semantic Knowledge: Xenomania System


One can use domain knowledge to extract features at insert time and inter-
prete user queries using domain knowledge and statistical characteristics of the
information in the database. Many projects in academia and industry address
these issues. Here we demonstrate some of these ideas using a face retrieval
system, called Xenomania, implemented at the University of Michigan [3].

Xenomania was an interactive system for the retrieval of face images and in-
formation. It allows a user to locate a specific person in the database, and
retrieve the person's image and other information. The user can describe the
target face in general terms, like shape of eyes, nose, length of hair, to begin
the location process and retrieve the initial results. After that, the target face
may be described using these general terms, or by using the actual image con-
tents of the retrieved faces. All aspects of the architecture described above are
242 CHAPTER 7

Figure I) A screen shot of Pinpoint showing the query window and all images
retrieved using QPE. Notice that a user can adjust weights of features and the
feedback to the user is instantaneous.
InfoScopes 243

Figure 6 The query canvas provides a user to articulate a query using vi-
sual means. One can cut and paste from images and use image manipulation
programs to articulate a query.
244 CHAPTER 7

incorporated into this system. This system is described in [3).

We chose the interactive face identification problem because of the lack of well-
defined image objects and features, and the heavy dependence on both pre-
defined domain knowledge and extensive user participation. Although much
work has been done towards modeling of facial features, it is still very difficult
to accurately extract and evaluate these features over a variety of faces and sit-
uations, and even more difficult to assign semantic attributes to these features
which are meaningful to human users. This application exploits the demands
for both extensive, predefined domain knowledge, and user-incorporated knowl-
edge at every step of processing.

Xenomania relied very heavily on previous research in the field of face recogni-
tion. Much work has been done in the psychological aspects which provided a
basis for our initial implementation. Many automatic face recognition systems
have also been developed. The Xenomania project is not a face recognition sys-
tem, but rather an image database system used for interactive face retrieval.
Some face- recognition systems have approached the problem from strictly an
image processing point of view, with little or no emphasis on descriptive rep-
resentation of faces. These systems do not incorporate the user for describing
the face, or guiding the query refinement once the recognition process has been
initiated. The most successful face reocgnition system is based on eigenfaces
[19, 20). This system is also influenced by image recognition approaches. In
eigenface-based system, one can specify an image and the system will retrieve
all images that are similar to that. It may be interesting to combine eigenfaces
with the descriptive approach used in Xenomania.

Domain Knowledge
As in any image management application, we are faced with the difficulty of
determining which attributes are important for each domain object, and how
to accurately represent these attributes in the system. However, this is an
attractive problem from our point of view, because it gives us the opportu-
nity to investigate different types of object and feature representations. For
instance, there are several attributes about an eye that may be important. In-
dividual eye attributes such as area and width will be necessary, as will relative
attributes such as the width of the eye compared to the height of the eye. Spa-
tial attributes such as distance between the left eye and the right eye are also
InfoScopes 245

important and must be incorporated into the system. Other objects, such as
eyebrows, may require entirely different attributes than those for eyes to be
maintained in the system. We have based much of our initial implementation
on research that has been done to evaluate which facial features and attributes
are best suited for face identification and differentiation.

Many image databases are likely to be for specific applications and hence will
require strong domain knowledge. The domain objects should be described
using the image alphabet or image objects in the VIMSYS model. This task will
require close interactions among database designers, image processing experts,
and domain experts.

7.3 Video Databases: TV News on Demand


Video is rapidly becoming the preferred mode of receiving information. Video
is most certainly the most vivid medium for conveying information. Video has
gained tremendous popularity since it appeared on the scene. As is well known,
television has been one of the most influential inventions of this century. The
last decade has seen growth in the use of camcorders in all aspects of human
activities.

Video is the most impressive medium for communicating and recording events
in our life. Its use is limited, however, by its basically sequential nature. To
access a particular segment of interest on a tape, one must spend significant
time is searching for the segment. Video databases have potential to change
the way we access and use video.

By storing each individual shot in the database, one can then access any indi-
vidual frame based on the content of the shot. Each shot can be analyzed to
find what is contained in each shot. Frames in each shot can be analyzed to
find events in it. By segmenting videos into shots and analyzing those shots,
one can extract information that can be put into a database. This database
can then be searched to find sequences of interest.

Video databases can be useful in many applications. One application is news


on demand. Suppose that each sequence is analyzed and the information in
it is stored in a database with pointers to the relevant frames. This database
then can be used to view the news of choice to the depth desired by a user,
in the sequence desired. We are implementing such a system in our laboratory
[23, 24, 9, 10). Details of segmentation of the sequence, architecture of the
246 CHAPTER 7

system, role of knowledge in such a system, and all other aspects have been
presented in [23, 24, 9, 9). It must be mentioned here that many other systems
of this type are being implemented in other places.

The architecture for the video database is composed of the four major com-
ponents discussed above. The input module is further divided into two major
components: a sequence segmentation subsystem and a feature detection sub-
system. The knowledge module has a video object schema definition subsystem
to help a user enter knowledge into the system for a specific application. The
video object schema definition subsystem provides tools to model the video ob-
ject schema for an application based on the operator's available in the input
and query processing systems. Based on the video object schema, the fea-
ture detection subsystem analyzes a video frame sequence to extract structure
and the semantic information about each objects of interest in the video. The
extracted objects and related semantic information are then stored into the fea-
ture database. According to the video object schema definition, a user query
interface is automatically customized. A user can also navigate the video object
schema defined from the video object schema definition subsystem as well as
its associated video object data through the user query interface.

7.4 Multiple Perspective Interactive Video


The traditional model of television and video is based on a single video stream
transmitted to a passive viewer. A viewer has the option to watch, and re-
watch in the case of recorded video, but little else. Due to the emergence of
the information highways and other related information infrastructure, there
has been much talk about concepts like video-an-demand, interactive movies,
interactive TV, and virtual presence. Some of these concepts are very exciting
and do suggest many drammatic changes in the society due to the dawning
information edge. By combining some of these concepts, we can design a novel
form of video and television which will provide true interactivity for viewers.
A viewer could view an event from multiple perspectives, even based on the
contents of the events.

In this project, we are developing an approach towards Multiple Perspective


Interactive (MPI) video. The interactive video will overcome several limitations
of the conventional video and provide interactivity essential in activities ranging
from scientific inventions to entertainment. In the conventional video, viewers
are passive; all that they can do is to control the flow of video by pressing but-
tons such as play, pause, fast forward or fast reverse. These controls essentially
InfoScopes 247

provide you only one choice for a particular segment of video: you can see it or
skip it. In the case of TV broadcast, viewers have essentially no control. Even
in those events where multiple cameras are used, a viewer has no choice except
the obvious one of viewing the channel or using the remote control and going
channel surfing. We believe that with the increased bandwidth, and advances
in several areas of technology, the time has come to address issues involved in
providing real interactive video and TV systems. Incidentally, the most seri-
ous limitation of modern television pointed out by George Gilder[6] is that the
viewers really have no choice. We believe that MPI video goes in the direction
of liberating video and TV from the traditional single-source broadcast model
and puts the viewer in the driver's seat.

We demonstrate our concepts in the context of a sporting event [14]. Our model
allows viewers to be active; they may request their preferred camera positions
and angles or they may ask questions about contents described in the video.
Our system will automatically determine the best camera and view to satisfy
the demands of the viewer. We believe these new functions are the key to
make MPI video a revolutionary new media. To make such functions possible,
however, much advancement of technology is required in the fields of computer
vision, multimedia databases and human interface design, see [23,24, 9, 8] and
[27,28,16,2,4,1].

Architecture of the MP! Video


A physical phenomena or an event can be usually viewed from multiple per-
spectives. The ability to view from multiple perspectives is essential in many
applications. Current video allows viewing only from one perspective, that of
the director. A viewer has no choice. Yet even this much has been very attrac-
tive and has influenced our society in many aspects. Now the technology has
advanced to the state that we can provide viewers the choice of viewing from
whatever perspective they want and to interactively select what they want to
VIew.

Let us assume that an episode is being recorded, or being viewed in real time.
In a simplest and most obvious case, the episode can be recorded using multiple
cameras strategically located at different points. These cameras may provide
different perspectives of the episode. One camera view is very limited. Using
computer vision and related techniques, it may be possible to take individual
camera views and reconstruct the scene. These individual camera scenes can
then be assimilated into a model that represents the complete episode. We call
this model an environment model. The environment model has a global view of
248 CHAPTER 7

the episode and also knows where the individual cameras are. The environment
model can be used by the system to allow a user to view what they want and
from where they want.

Now let us assume that a viewer is interested in one of the following:

1. Specific Perspective: One may want to view the episode from a specific
perspective. The user may even specify the individual camera, or may
specify a general location for the camera.

2. Specific Object: There may be several objects in the episode. A viewer


may want to always view a particular object independent of its situation
in the episode.

3. Specific Event: A viewer may specify characteristics of an event and may


want to view the episode from the best perspective for that event.

4. Virtual Camera: It is possible that the viewer may request to view the
event from a perspective that is not provided by any real camera situated
to acquire the episode. In such cases, using environment model and simu-
lation techniques, one may create a virtual camera to generate the episode
from this specified perspective.

The high level architecture for this system is shown in Figure 7. Each camera
perspective is converted to its camera scene. Multiple camera scenes are then
assimilated into the environment model. A viewer can select his perspective
and that perspective is communicated to the environment model. The reasoning
system in the environment model decides what to send to the display of the
user.

A MPI video system implementation requires advances in several technology


areas. A system with limited features can be implemented, however, using the
existing technology. The exact architecture of an MPI video system will depend
on the application area and the type and level of interactivity allowed. This
system is described in[14].
InfoScopes 249


Video Stream 1 Video Data ( ] .....t-......,
~ l Analyser lAssimi1ato~

~ L-t Anno"'" J.....~J.-- ....

Video Stream n

Visualizer
and
Virtual View
Builder

Display
Manager

User
250 CHAPTER 7

8 CONCLUSION AND FUTURE


RESEARCH
In this paper, we presented our ideas of infoscopes. Infoscopes provide micro-
scopes as well as telescopes for the information systems of the future. These
systems use databases for multimedia data, and provide very rich semantics and
techniques to recover information at different levels of abstractions from the
data in multimedia information systems. Navigation, browsing, and content-
based search will be common modes of interaction with these systems. The
interaction environment provides multimedia tools to interact with the system
and to articulate the queries. The system uses multimedia tools to provide
information in appropriate forms to the user. In infoscopes, similarity replaces
matching for search operations.

Research in techniques related to infoscopes is in its early infancy. Techniques


from databases, computer vision, knowledge-based systems, and related areas
are being developed to solve problems unique to this area. In this paper, we
presented unique requirements, some ideas about the directions that these sys-
tems are taking, and an overview of some approaches being developed. We are
studying the requirements of these systems by implementing different systems.
In this paper, we presented an overview of our approach and discussed briefly
systems that were developed based on the ideas presented here. Due to space
limitation, we did not provide all the details. These details are available in pub-
lished literature. Our goal here was to present concepts behind the techniques
presented earlier in detailed papers on individual systems and data models.

These projects are very active currently. We believe that we have started only
scratching the surface of these systems. Research in all aspects of these systems
continues in our laboratory and at many other laboratories. We believe that
the next few years will see exponential growth in infoscopes. In this paper, our
focus was on image and video systems. We expect most infoscopes to be able to
deal with different modes of information, ranging from alphanumeric to sound
and others, seamlessly. Due to their ability to represent information at different
levels of abstractions these systems can recover and assimilate information from
disparate sources. We did not discuss assimilation and display of information
here, but those will be equally important topics.

In summary, infoscopes will allow to access information independent of the


locations and types of data sources and provide a unified picture of information
to a user.
InfoScopes 251

Acknowledgements
The research and ideas presented in this paper evolved during collaborations
with several people in the InfoScope project. I am thankful to everyone who
actively participated in the project. I want to particularly thank Jeff Bach,
Shankar Chatterjee, Amarnath Gupta, Arun Hampapur, Bradley Horowitz,
Arun Katkere, Don Kuramura, Saied Moezzi, Simone Santini, Chiao-Fe Shu,
Bo Smorhay, Deborah Swanberg, and Terry Weymou'th for collaboration in
different aspects of this work.

REFERENCES
[1] A. Akutsu, Y. Tonomura, H. Hashimoto, and Y Ohba, "Video indexing
using motion vectors", Proceedings of SPIE: Visual Communications and
Image Processing 92, November 1992.

[2] F. Arman, A. Hsu, and M.-y' Chiu, "Image processing on compressed data
for large video databases", Proceedings of the ACM Multimedia, pages 267-
272, California, USA, June 1993.
[3] J. Bach, S. Paul, and R. Jain, "An interactive image management system
for face information retrieval", IEEE Transaction on Knowledge and Data
Engineering, Special Section on Multimedia Information Systems, June
1992.
[4] G. Davenport, T. A. Smith, and N. Pincever, "Cinematic primitives for
multimedia", IEEE Computer Graphics & Applications, pages 67-74, July
1991.
[5] C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic,
and W. Equitz, "Efficient and effective querying by image content", J. of
Intelligent Information Systems, Vol. 3, No. 3/4, pages 231-262, July 1994.

[6] G. Gilder " Life After Television: The The coming transformation of Media
and American Life", W. W. Norton & Co., 1994.
[7] A. Gupta, T. Weymouth, and R. Jain, "Semantic queries with pictures:
the VIMSYS model", Proceedings of the 17th International Conference on
Very Large Data Bases, September 1991.

[8] A. Hampapur, R. Jain, and T. Weymouth. "Digital video indexing in mul-


timedia systems". Proceedings of the Workshop on Indexing and Reuse in
Multimedia Systems, American Association of Artificial Intelligence, 1994.
252 CHAPTER 7

[9] A. Hampapur, R. Jain, and T. Weymouth, "Digital video segmentation",


Proceedings of the A CM Conference on Multimedia, San Francisco, Cali-
fornia. October 1994.

[10] A. Hampapur, R. Jain, and T. Weymouth, "Production model based dig-


ital video segmentation", Journal of Multimedia Tools and Applications,
Vol. 1, No.1, pages 9-46, March 1995.

[11] H. V. Jagadish. "A retrieval technique for similar shapes", Proc. ACM
SIGMOD Conference, pages 208-217, May 1991.
[12] R. Jain, "NSF workshop on visual information management systems" SIG-
MOD Record, 22(3):57-75, December 1993.

[13] R. Jain, R. Kasturi, and B. Schunck, "Introduction to Machine Vision",


McGraw Hill Publishing, 1995.

[14] R. Jain and K. Wakimoto, " Multiple perspective interactive video", IEEE
Multimedia Computing Systems, pages 202-211, May 1995.

[15] K.-I. Lin, H. V. Jagadish, and C. Faloutos, "The tv-tree - an index struc-
ture for high-dimensional data", VLDB Journal, 3:517-542, October 1994.
[16] A. Nagaska and Y. Tanaka, "Automatic video indexing and full-video
search for object appearances", 2nd Working Conference on Visual
Database Systems, pages 119-133, Budapest, Hungary, October 1991.
[17] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic,
P. Yanker, C. Faloutsos, and G. Taubin. "The QBIC project: Querying
images by content using color, texture and shape", SPIE 1993 Inti. Sym-
posium on Electronic Imaging: Science and Technology, Storage and Re-
trieval for Image and Video Databases, February 1993. Also available as
IBM Research Report RJ 9203 (81511), February 1993.

[18] E. Oomoto and K. Tanaka, "OVID: Design and implemenntation of a


video-object database system", IEEE Tmnsactions on Knowledge and
Data Engineering, Vol. 5, No.4, pages 629-643, August 1993.

[19] A. Pentland, R. W. Picard, and S. Sclaroff, "Photobook: Tools for content-


based manipulation of image databases" In Proceedings of the SPIE: Stor-
age and Retrieval for Image and Video Databases II, San Jose, CA, Volume
2185, pages 34-47, February 1994. Also available as M.I.T. Media Labora-
tory Perceptual Computing Technical Report No.255, November 1993.
InfoScopes 253

[20] A. Pentland, B. Moghaddam, and T. Starner, "View-based and modular


eigenspaces for face recognition", Proceedings of the IEEE Computer So-
ciety Conference on Computer Vision and Pattern Recognition, Seattle,
WA, pages 84-91, June 1994.
[21] H. Samet, "The Quadtree and Related Data Structures", 16(2):187-260,
June 1984.
[22] S. Santini and R. Jain, "Similarity matching", IEEE Pattern Analysis and
Machine Intelligence, submitted.
[23] D. Swanberg, T. Weymouth and R. Jain, "Domain information model: an
extended data model for insertions and query", Proceedings of the Multi-
media Information Systems, pages 39-51, Intelligent Information Systems
Laboratory, Arizona State University, February 1992.

[24] D. Swanberg, C.-F. Shu, and R. Jain, "Architecture of a multimedia infor-


mation system for content-based retrieval", Audio Video Workshop, San
Diego, California, November 1992.

[25] D. Swanberg, C.-F. Shu, and R. Jain, "Knowledge guided parsing in video
databases", Electronic Imaging: Science and Technology, San Jose, Cali-
fornia, February 1993.

[26] D. White and R. Jain, "Similarity indexing using the ss-tree"'; IEEE Data
Engineering, submitted.
[27] H. J. Zhang, A. Kankanhalli, and S. W. Smoliar, "Automatic partitioning
of video", Multimedia Systems, 1(1):10-28, 1993.
[28] H. J. Zhang, Y. Gong, S. Smoliar, and S. Y. Tan, "Automatic parsing of
news video" , Proceedings of the IEEE Conference on Multimedia Comput-
ing Systems, Boston, Massachusetts, May 1994.
8
SCHEDULING IN MULTIMEDIA
SYSTEMS
A. L. Narasimha Reddy
IBM Almaden Research Center,
650 Harry Road, K56/802,
San Jose, CA 95120, USA

ABSTRACT
In video-on-demand multimedia systems, the data has to be delivered to the consumer
at regular intervals to deliver smooth playback of video streams. A video-on-demand
server has to schedule the service of individual streams to ensure such a smooth
delivery of video data. We will present scheduling solutions for individual components
of service in a multiprocessor based video server.

1 INTRODUCTION
Several telephone companies and cable operators are planning to install large
video servers that would serve video streams to customers over telephone lines
or cable lines. These projects envision supporting several thousands of cus-
tomers with the help of one or several large video servers. These projects aim
to store movies in a compressed digital format and route the compressed movie
to the home where it can be uncompressed and displayed. These projects aim
to compete with the local video rental stores with better service; offering the
ability to watch any movie at any time (avoiding the situation of all the copies of
the desired movie rented out already) and offering a wider selection of movies.
Providing a wide selection of movies requires that a large number of movies
be available in digital form. Currently, with MPEG-l compression, a movie of
roughly 90 minute duration takes about 1 GB worth of storage. For a video
server storing about 1000 movies (a typical video rental store carries more), we
would then have to spend about $500, 000 just for storing the movies on disk at
a cost of $0.5/MB. This requirement of large amounts of storage implies that
256 CHAPTER 8

the service providers need to centralize the resources and provide service to a
large number of customers to amortize costs. Hence the requirement to build
large video servers that can provide service a large number of customers. See
[5)[14)[2)[15] for some of the projects on video servers.

If such a large video server serves about 10,000 MPEG-1 streams, the server
has to support 10,000 * 1.5 Mbits/sec or about 2 GBytes/sec of I/O band-
width. Multiprocessor systems are suitable candidates for supporting such
large amounts of real-time I/O bandwidth required in these large video servers.
We will assume that a multiprocessor video server is organized as shown in
Figure 1. A number of nodes act as storage nodes. Storage nodes are responsi-
ble for storing video data either in memory, disk, tape or some other medium
and delivering the required I/O bandwidth to this data. The system also has
network nodes. These network nodes are responsible for requesting appropriate
data blocks from the storage nodes and routing them to the customers. Both
these functions can reside on the same multiprocessor node, i.e., a node can
be a storage node, or a network node or both at the same time. Each request
stream would originate at one of the several network nodes in the system and
this network node would be responsible for obtaining the required data for this
stream from the various storage nodes in the system and delivering it to the
consumer. The data transfer from the network node to the consumer's monitor
would depend on the medium of delivery, telephone wire, cable or the LAN.

We will assume that the video data is stored on disk. Storing the video data
on current tertiary mediums such as tapes is shown to be not attractive from
price performance analysis [3]. Storing the video in memory may be attractive
for frequently accessed video streams. We will assume that the video data is
stored on disk to address the more general problem. The work required to
deliver a video stream to the consumer can then be broken down into three
components: (1) the disk service required to read the data from the disk into
the memory of the storage node, (2) the communication required to transfer
the data from the storage node memory to the network node's memory and (3)
the communication required over the delivery medium to transfer the data from
the network node memory to the consumer's monitor. These three phases of
service may be present or absent depending on the system's configuration. As
pointed out already, if the video data is stored in memory, the service in phase
1 is not needed. If the video server does not employ a multiprocessor system,
the service in phase 2 is not needed. In this chapter, we will deal with the
scheduling problems in phases 1 and 2. If the consumer's monitor is attached
to the network node directly, the service in phase 3 is not needed. Service in
phase 3 is dependent on the delivery medium and we will not address it here.
Scheduling in Multimedia Systems 257

Figure 1 System model of a multiprocessor video server.

Deadline scheduling [10)[6] is known to be an optimal real-time scheduling


strategy when the task completion times are known in advance. The disk
service time (in phase 1) is dependent on the relative position of the request
block with respect to the head on the disk. The communication time in the
network (in phase 2) is dependent the contention for network resources and
this contention varies based on the network load. Service for one video stream
requires multiple resources (disk at the storage node, links, input and output
ports in the network) unlike the assumption of requiring only one resource in
these studies. Hence, these results cannot be directly applied to our problem.

The organization of the system and the data distribution over the nodes in
the system impact the overall scheduling strategy. In the next section, we will
describe some of the options in distributing the data and their impact on the
different phases of service. In Section 3, we will discuss scheduling algorithms
for (phase 1) disk service. In Section 4, we will describe a method for scheduling
the multiprocessor network resources. Section 5 concludes this chapter with
some general remarks and future directions.
258 CHAPTER 8

2 DATA ORGANIZATION
If a movie is completely stored on a single disk, the supportable number of that
movie will be limited by the bandwidth of a single disk. As shown earlier by
[12], a 3.5" 2-GB IBM disk can support upto 20 streams. A popular movie
may receive more than 20 requests over the length of the playback time of
that movie. To enable serving a larger number of streams of a single movie,
each movie has to be striped across a number of nodes. As we increase the
number of nodes for striping, we increase the bandwidth for a single movie.
If all the movies are striped across all the nodes, we also improve the load
balancing across the system since every node in the system has to participate
in providing access to each movie.

The width of striping (the number of disks a movie may be distributed on)
determines a number of characteristics of the system. The wider the striping,
the larger the bandwidth for any given movie and the better the load balancing.
A disk failure affects a larger number of movies when wider striping is employed.
When more disk space is needed in the system, it is easier to add a number of
disks equal to the width of striping. Hence, wider striping means a larger unit
of incremental growth of disk capacity. All these factors need to be considered
in determining the width of striping. For now, we will assume that all the
movies are striped across all the disks in the system. In a later section, we will
discuss the effects of employing smaller striping widths. The unit of striping
across the storage nodes is called a block.

Even though movies are striped across the different disks to provide high band-
width for a movie, it is to be noted that a single MPEG-1 stream bandwidth of
1.5 Mbits/sec can be sufficiently supported by the bandwidth of one disk. Re-
quests of a movie stream can be served by fetching individual blocks at a time
from a single disk. Striping provides simultaneous access to different blocks of
the movie from different disks and thus increases the bandwidth available to
a movie. Higher stream rates of MPEG-2 can also be supported by requests
to individual disks. We will assume that a single storage node is involved in
serving a request block.

Data organization has an impact on the communication traffic within the sys-
tem. During the playback of a move, a network node responsible for delivering
that movie stream to the user has to communicate with all the storage nodes
where this movie is stored.' This results in a point to point communication from
all the storage nodes to the network node (possibly multiple times depending
on the striping block size, the number of nodes in the system and the length of
Scheduling in Multimedia Systems 259

ao 80 do 81 d1 82 d2
I I I
to t1 t2 t3 t4 t5 t6 t7 ts

Figure 2 Progess of disk service of a request.

the movie) during the playback of the movie. Since each network node will be
responsible for a number of movie streams, the resulting communication pat-
tern is random point-to-point communication among the nodes of the system.
It is possible to achieve some locality by striping the movies among a small
set of nodes and restricting that the network nodes for a movie be among this
smaller set of storage nodes.

3 DISK SCHEDULING
A real-time request can be denoted by two parameters (c, p), where p is
the period at which the real-time requests are generated and c is the service
time required in each period. The earliest-deadline-first (EDF) [10] algorithm
showed that tasks can be scheduled by EDF if and only if the task utilization
L~=1 ci/Pi < 1. We will specify the real-time requests by specifying the re-
quired data rate in kbytes/sec. The time at which a periodic request is started
is called the release time of that request. The time at which the request is to
be completed is called the deadline for that request. Requests that do not have
real-time requirements are termed aperiodic requests.

Figure 2 shows the progress of disk service for a stream. Request for block 0
is released at time to and the request is actually scheduled at time t1 and is
denoted by event 80. The block 0 is consumed (event do) at time beginning
t 2 . The time between the consumption of successive blocks of this stream
di +1 - di has to be maintained constant for providing glitch-free service to the
user. For example, when 256 Kbyte blocks are employed for MPEG-1 streams,
this is equal to about 1.28 seconds. The time between the scheduling events of
successive blocks need not be constant. Only requirement is that the blocks be
scheduled sufficiently in advance to .guarantee that di+ 1 - di can be maintained
constant. This is shown in Figure 2. The vertical bars in the picture represent
the size of the request block.
260 CHAPTER 8

In real-time systems, algorithms such as earliest deadline first, and least slack
time first are used. As pointed out earlier, strict real-time scheduling policies
such as EDF may not be suitable candidates because of the random disk service
times and the overheads associated with seeks and rotational latency.

Traditionally, disks have used seek optimization techniques such as SCAN or


shortest seek time first (SSTF) for minimizing the arm movement in serving
the requests [4]. These techniques reduce the disk arm utilization by serving
requests close to the disk arm. The request queue is ordered by the relative
position of the requests on the disk surface to reduce the seek overheads. Even
though these techniques utilize the disk arm efficiently, they may not be suitable
for real-time environments since they do not have a notion of time or deadlines
in making scheduling decisions.

Video-on-demand systems may have to serve aperiodic requests also. It is


necessary to ensure that periodic requests do not miss their deadlines while
providing reasonable response times for aperiodic requests. A similar problem
is studied in [9]. I/O requests are known to be bursty. A burst of aperiodic
requests should not result in missing the guarantees for the periodic requests.

The scheduling algorithm should be fair. For example, shortest seek time first,
is not a fair scheduling algorithm since requests at the edges of the disk surface
may get starved. If the scheduling algorithm is not fair, an occasional request
in the stream may get starved of service and hence will result in missing the
deadlines.

To guarantee the service of scheduled real-time requests, worst-case assump-


tions about seek and latency overheads can be made to bound random disk
service times to some constant service time. Another approach to making ser-
vice times predictable is to make the request size so large that the overheads
form a smaller fraction of the request service time. This approach may result in
large demands on buffer space. Our approach to this problem is to reduce the
overheads in service time by making more efficient use of the disk arm either
by optimizing the service schedule and/or by using large requests. By reducing
the random overheads, we make the service time more predictable. We will de-
scribe two techniques in the next section, larger requests and delayed deadlines,
for reducing the variances in the service time.

We will consider three scheduling algorithms, CSCAN, EDF and SCAN-EDF.


CSCAN is a seek optimizing disk scheduling algorithm which traverses the
disk surface in one direction from innermost pending request to the outermost
pending request and then jumps back to serving the innermost request [4].
Scheduling in Multimedia Systems 261

EDF is the earliest deadline first policy. SCAN-EDF is a hybrid algorithm that
incorporates the real-time aspects of EDF and seek optimization aspects of
SCAN. CSCAN and EDF are well known algorithms and we will not elaborate
on them further.

3.1 SCAN-EDF scheduling algorithm


SCAN-EDF disk scheduling algorithm combines seek optimization techniques
and EDF in the following way. Requests with earliest deadline are served first.
But, if several requests have the same deadline, these requests are served by a
seek-optimizing scheduling algorithm.

SCAN-EDF applies seek optimization to only those requests that have the same
deadline. Its efficiency depends on how often these seek optimizations can be
applied, or on the fraction of requests that have the same deadlines. SCAN-
EDF serves requests in batches or rounds. Requests are given deadlines at the
end of a batch. Requests within a batch then can be served in any order and
SCAN-EDF serveS the requests within a batch in a seek optimizing order. In
other words, requests are assigned deadlines that are multiples of the period p.

When the requests have different data rate requirements, SCAN-EDF can be
combined with a periodic fill policy [16] to let all the requests have the same
deadline. Requests are served in a cycle with each request getting an amount
of service time proportional to its required data rate, the length of the cycle
being the sum of the service times of all the requests. All the requests in the
current cycle can then be given a deadline at the end of the current cycle.

A more precise description of the algorithm is given below.


SCAN-EDF algorithm
Step 1: let T = set of tasks with the earliest deadline
Step 2: if ITI = 1, (there is only a single request in T), service that request.
else let tl be the first task in T in the scan direction, service t l .
go to Step 1.

The scaI1 direction can be chosen in several ways. In Step 2, if the tasks are
ordered with the track numbers of tasks such that Nl <= N2 <= ... <= NI,
then we obtain a CSCAN type of scheduling where the scan takes place only
from smallest track number to the largest track number. If the tasks are ordered
such that Nl >= N2 >= ... >= N I , then We obtain a CSCAN type of scheduling
where the scan takes place only from largest track number to the smallest track
262 CHAPTER 8

number. If the tasks can be ordered in either of the above forms depending on
the relative position of the disk arm, we get (elevator) SCAN type of algorithm.

SCAN-EDF can be implemented with a slight modification to EDF. Let D; be


the deadlines of the tasks and N; be their track positions. Then the deadlines
can be modified to be Dj + f(Nj), where fO is a function that converts the
track numbers of the tasks into small perturbations to the deadlines. The
perturbations have to be small enough such that D; + f(N;) > Dj + f(Nj ),
if D j > D j . We can choose fO in various ways. Some of the choices are
f(N j )= N;jNmax or f(N;) = N;jNmax - 1, where N max is the maximum
track number on the disk or some other suitably large constant. For example,
let tasks A, B, and C have the same deadline 500 and ask for data from tracks
347, 113, and 851 respectively. If N max = 1000, the modified deadlines of
A, B. and C become 499.347. 499.113 and 499.851 respectively when we use
f(N;) = N;j N max - 1. When these requests are served by their modified
deadlines, they are served in the track order. A request with a later deadline
will be served after these three requests are served. Other researchers have
proposed similar scheduling policies [1] [17] [14].

3.2 Buffer space tradeoff


Available buffer space has a significant impact on the performance of the sys-
tem. Real-time requests typically need some kind of response before the next
request is issued. Hence, the deadlines for the requests are made equal to the
periods of the requests. The multimedia I/O system needs to provide a constant
data rate for each request stream. This constant data rate can be provided in
various ways. When the available buffer space is small, the request stream can
ask for small pieces of data more frequently. When the available buffer space is
large, the request stream can ask for larger pieces of data with correspondingly
larger periods between requests. This tradeoff is significant since the efficiency
of the disk service is a varying function of the request size. The disk arm is
more efficiently used when the request sizes are large and hence it may be pos-
sible to support larger number of multimedia streams at a single disk. Figure
3(a) shows two streams providing the same constant stream rate, the second
request stream scheduling twice as large requests at half the frequency of the
first stream. A (c,p) request supports the same data rate as a (2c,2p) request
if larger buffers are provided, at the same time improving the efficiency of the
disk. However, this improved efficiency has to be weighed against the increased
buffer space requirements. Each request stream requires a buffer for the con-
suming process and one buffer for the producing process (disk). If we decide
Scheduling in Multimedia Systems 263

ao 80 do 81 d1 82 d2
I I I
to t1 t2 t3 t4 t5 t6 t7 ts
II II
a~ 8'0 d~ 8~ d~

(a). Larger requests.

ao 80 do 81 d1 82 d2
I I I
to h t2 t3 t4 t5 t6 t7 ts tg
I I I
a~ 8~ 8'1 d'0 8~ d'1 d'2

(b). Delayed deadlines.

Figure 3 Request streams with same data rate requirements.

to issue requests at the size of 5, then the buffer space requirement for each
stream is 25. If the I/O system supports n streams, the total buffer space
requirement is 2n5.

There is another tradeoff that is possible. The deadlines of the requests need
not be chosen equal to the periods of the requests. For example, we can defer
the deadlines of the requests by a period and make the deadlines of the requests
equal to 2p. This gives more time for the disk arm to serve a given request and
may allow more seek optimizations than that are possible when the deadlines
are equal to the period p. Figure 3(b) shows two streams providing the same
constant stream rate, but with different charecteristics of progress along the
time scacle. The stream with the deferred deadlines provides more time for the
disk to service a request before it is consumed. This results in a scenario where
the consuming process is consuming buffer 1, the producing process (disk) is
reading data into buffer 3 and buffer 2 is filled earlier by the producer and
waiting consumption. Hence, this raises the buffer requirements to 35 for each
request stream. The extra time available for serving a given request allows
more opportunities for it to be served in the scan direction. This results in
more efficient use of disk arm and as a result, larger number of request streams
can be supported at a single disk. A similar technique called work-ahead is
utilized in [1]. Scheduling algorithms for real-time requests when the deadlines
are different from the periods are reported in [8][13].
264 CHAPTER 8

Both these techniques, larger requests with larger periods and delayed dead-
lines, increase the latency of service at the disk. When the deadlines are delayed,
the data stream cannot be consumed until two buffers are filled as opposed to
waiting for one filled buffer when deadlines are equal to periods. When larger
requests are employed, longer time is taken for reading a larger block and hence
a longer time before the multimedia stream can be started. Larger requests
increase the response time for aperiodic requests as well since the aperiodic
requests will have to wait for a longer time behind the current real-time re-
quest that is being served. The improved efficiency of these techniques needs
to be weighed against the higher buffer requirements and the higher latency for
starting a stream.

3.3 Performance Evaluation


In this section, we compare the three scheduling algorithms CSCAN, EDF and
SCAN-EDF through simulations. We present the simulation model used to
obtain these results.

Simulation model
A disk with the parameters shown in Table 8.1 is modeled. It is assumed
that the disk uses split-access operations or zero latency reads. In split-access
operation, the request is satisfied by two smaller requests if the read-write
head happens to be in the middle of the requested data at the end of the seek
operation. The disk starts servicing the request as soon as any of the requested
blocks comes under the read-write head. For example, if a request asks for
reading blocks numbered 1,2,3,4 from a track of eight blocks 1,2,. .. 8, and the
read-write head happens to get to block number 3 first, then blocks 3 and 4 are
read, blocks 5,6,7,8 are skipped over and then blocks 1 and 2 are read. In such
operation, a disk read/write of a single track will not take more than one single
revolution. Split access operation is shown to improve the request response
time considerably in [11]. Split-access operation, besides reducing the average
service time of a request, also helps in reducing the variability in service time.

Each real-time request stream is assumed to require a constant data rate of


150 kB/sec. This roughly corresponds to the data rate requirements for a
CDROM data stream. Each request stream is modeled by an independent
request generator. The number of streams is a parameter to the simulator.
Scheduling in Multimedia Systems 265

Time for one rotation 11.1 ms


Avg. seek 9.4 ms
sectors/track 84
sector size 512 bytes
tracks/ cylinder 15
cylinders/ disk 2577
seek cost function nonlinear
Min. seek time So 1.0 ms

Table 1 Disk parameters used in simulations.

Aperiodic requests are modeled by a single aperiodic request generator. Aperi-


odic requests are assumed to arrive with an exponential distribution. The mean
time between arrivals is varied from 25 ms to 200 ms. If we allow unlimited
service for the aperiodic requests, a burst of aperiodic requests can disturb the
service of real-time requests considerably. It is necessary to limit the number
of aperiodic requests that may be served in a given period of time. A separate
queue could be maintained for these requests and these requests can be released
at a rate that is bounded by a known rate. A multimedia server will have to
be built in this fashion to guarantee meeting the real-time schedules. Hence,
we modelled the arrival of aperiodic requests by a single request generator. In
our model, if the aperiodic requests are generated faster than they are being
served, they are queued in a separate queue.

The service policy for aperiodic requests depended on the scheduling policy
employed. In EDF and SCAN-EDF, they are served using the immediate server
approach [9] where the aperiodic requests are given higher priority over the
periodic real-time requests. The service schedule of these policies allows a
certain number of aperiodic requests each period and when sufficient number
of aperiodic requests are not present, the real-time requests make use of the
remaining service period. This policy of serving aperiodic requests is employed
so as to provide reasonable response times for both aperiodic and periodic
requests. This is in contrast to earlier approaches where the emphasis has been
only on providing real-time performance guarantees. In CSCAN, aperiodic
requests are served in the CSCAN order.

Each aperiodic request is assumed to ask for a track of data. The request size
for the real-time requests is varied among 1, 2, 5, or 15 tracks. The effect
of request size on number of supportable streams is investigated. The period
between two requests of a request stream is varied depending on the request
266 CHAPTER 8

size to support a constant data rate of 150 kB/sec. The requests are assumed
to be uniformly distributed over the disk surface.

Two systems, one with deadlines equal to the request periods and the second
with deadlines equal to twice the request periods are modeled. A comparison
of these two systems gives insight into how performance can be improved by
deferring the deadlines.

Two measures of performance are studied. The number of real-time streams


that can be supported by each scheduling policy is taken as the primary mea-
sure of performance. We also look at the response time for aperiodic requests.
A good policy will offer good response times for aperiodic requests while sup-
porting large number of real-time streams.

Each experiment involved running 50,000 requests of each stream. The max-
imum number of supportable streams n is obtained by increasing the number
of streams incrementally till n + 1 where the deadlines cannot be met. Twenty
experiments were conducted, with different seeds for random number genera-
tion, for each point in the figures. The minimum among these values is chosen
as the maximum number of streams that can be supported. Each point in the
figures is obtained in this way. The minimum is chosen (instead of the average)
in order to guarantee the real-time performance.

3.4 Results

Maximum number of streams


Figure 4 shows the results from simulations. The solid lines correspond to a
system with extended deadlines ( =2p) and the dashed lines are for the system
where deadlines are equal to request periods.

It is observed that deferring deadlines improves the number of supportable


streams significantly for all the scheduling policies. The performance improve-
ment ranges from 4 streams for CSCAN to 9 streams for SCAN-EDF at a
request size of 1 track.

When deadlines are deferred, CSCAN has the best performance. SCAN-EDF
has performance very close to CSCAN. EDF has the worst performance. EDF
scheduling results in random disk arm movement and this is the reason for poor
Scheduling in Multimedia Systems 267

26
""
~ 25
24
~ 23
~3t 22
21
,.g 20
";I 19
§
..
18
17
.5 16
""
:.:2: 15 o EDF
14 + CSCAN
13 x SCAN-EDF
12
11 ..... Extended deadlines
10 - •• Nonextended deadlines
9
8
7
6

Request size (# of tracks)

Figure 4 Perfonnance of different scheduling policies.

performance of this policy. Figure 4 clearly shows the advantage of utilizing


seek optimization techniques.

Figure 4 also presents the improvements that are possible by increasing the
request size. As the request size is increased from 1 track to 15 tracks, the
number of supportable streams keeps increasing. The knee of the curve seems
to be around 5 tracks or 200 kbytes. At larger request sizes, the different
scheduling policies make relatively less difference in performance. At larger
request sizes, the transfer time dominates the service time. When seek time
overhead is a smaller fraction of service time, the different scheduling policies
have less scope for optimizing the schedule. Hence, all the scheduling policies
perform equally well at larger request sizes.

At a request size of 5 tracks, i.e., 200 kbytes/buffer, minimum of2 buffers/stream


corresponds to 400 kbytes of buffer space per stream. This results in a demand
of 400 kbytes * 20 = 8Mbytes of buffer space at the I/O system for supporting
20 streams. If deadlines are deferred, this corresponds to a requirement of 12
Mbytes. When such amount of buffer space is not available, smaller request
sizes need to be considered.
268 CHAPTER 8

At smaller request sizes, deferring the deadlines has a better impact on per-
formance than increasing the request size. For example, at a request size of 1
track and deferred deadlines (with buffer requirements of 3 tracks) EDF sup-
ports 13 streams. When deadlines are not deferred, at a larger request size of
2 tracks and buffer requirements of 4 tracks, EDF supports only 12 streams. A
similar trend is observed with other policies as well. A similar observation can
be made when request sizes of 2 and 5 tracks are compared.

A periodic response time


Figure 5 shows the response time for aperiodic requests. The figure shows
the aperiodic response time when 8, 12, 15, 18 real-time streams are being
supported in the system at request sizes of 1, 2, 5, and 15 tracks respectively.
It is observed that CSCAN has the worst performance and SCAN-EDF has
the best performance. With CSCAN, on an average, an aperiodic request has
to wait for half a sweep for service. This may result in waiting behind half
the number of real-time requests. In SCAN-EDF, EDF, aperiodic requests
are given higher priorities by giving them shorter deadlines (100 ms from the
issuing time). In these strategies, requests with shorter deadlines get higher
priority. As a result, aperiodic requests typically wait behind only the current
request that is being served. Among these three policies, the slightly better
performance of SCAN-EDF is due to the lower arm utilizations.

From Figures 4 and 5, it is seen that SCAN-EDF performs well under both mea-
sures of performance. CSCAN performs well in supporting real-time requests
but does not have very good performance in serving the aperiodic requests.
EDF, does not perform very well in supporting real-time requests but offers
good response times for aperiodic requests. SCAN-EDF supports almost as
many real-time streams as CSCAN and at the same time offers the best re-
sponse times for aperiodic requests. When both the performance measures are
considered, SCAN-EDF has better characteristics.

Effect of aperiodic request arrival


Figure 6 shows the effect of aperiodic request arrival rate on the number of
sustainable real-time streams. It is observed that' aperiodic request arrival rate
has a significant impact on all the policies. Except for CSCAN, all other policies
support less than 5 streams at an inter-arrival time of 25 ms. Figure 6 shows
that the inter-arrival time of aperiodic requests should not be below 50 ms if
more than 10 real-time streams need to be supported at the disk. CSCAN
Scheduling in Multimedia Systems 269

o EDF
+ CSCAN
X SCAN-EDF

100

Figure 5 Aperiodic response time with different scheduling policies.

treats all requests equally and hence higher aperiodic request arrival time only
reduces the time available for the real-time request streams and does not alter
the schedule of service. In other policies, since aperiodic requests are given
higher priorities, higher aperiodic request arrival rate results in less efficient
arm utilization due to more random arm movement. Hence, other policies see
more impact on performance due to higher aperiodic request arrival rate.

Multiple data rates


Figure 7 shows the performance of various scheduling policies when requests
with different data rates are served. The simulations modeled equal number of
three different data rates of 150 kB/sec, 8 kB/sec and 176 kB/sec with aperiodic
requests arriving at a rate of 200ms. The performance trends are similar to the
earlier results.

A more detailed performance study can be found in [12] where several other
factors such as the impact of a disk array are considered.
270 CHAPTER 8

.... Extended deadlines

~
- - - Nonextended deadlines
24
~
~ o EDF
.,S:!
21 + CSCAN

~
.2
X SCAN-EDF

-;; 18

§ 15

......,
:::E 12

,•.
9

.' , ,
,,
6
CitJ'
3
,,
I2l
0
20 50 100 200
Aperiodic int. arr. time ms.

Figure 6 Effect of aperiodic request arrival rate on the number of streams.

3.5 Analysis of SCAN-EDF


In this section, we will present an analysis of SCAN-EDF policy and show how
request service can be guaranteed. We assume that the disk seek time can be
modeled by the following equation s( m) = So + m * Sl, where s( m) is the seek
time for m tracks, So is the minimum seek time. This equation assumes that
the seek time is a linear function of the number of tracks. This is a simplifying
assumption to make the analysis easy (in simulations earlier, we used the actual
measured seek function of one of the IBM disks). The value of Sl can be chosen
such that the seek time function s( m) gives an upper bound on the actual seek
time. Let M denote the number of tracks on the disk and T the track capacity.
We will denote the required data rate for each stream by C. We also assume
that the disk requests are issued at a varying rate, but always in multiples of
track capacity. Let kT be the request size. Since C is the required data rate
for each stream, the period for a request stream p = kT / C. If r denotes the
data rate of the disk in bytes/sec, r =
T/(rotation time). Disk is assumed
to employ split-access operation and hence no latency penalty. This analysis
assumes that there are no aperiodic requests. These assumptions are made so
that we can establish an upper bound on performance.
Scheduling in Multimedia Systems 271

30
'"
I
.sa
29
2B

-;
.sa
27
26
";;I 25
§ 24
.51 23
~ 22
21 0 EOP
20 + CSCAN
X SCAN-EOP
19
18
17
16
15
0

Figure 7 Performance of various policies with multiple data rates.

SCAN-EDF serves requests in batches. Each batch is served in a scan order


for meeting a particular deadline. We assume that the batch of n requests are
uniformly placed over the disk surface. Hence the seek time cost for a complete
sweep of n requests can be given by 81 * M + n * 80' This assumes that the
disk arm sweeps across all the M tracks in serving the n requests. The read
time cost for n requests is given by n * kr. The total time for one sweep is the
time taken for serving the n requests plus the time taken to move the disk arm
back from the innermost track to the outermost track. This innermost track to
outermost track seek takes 80 + M * 81 time. Hence, the total time for serving
one batch of requests is given by Q = (n * 80 + M * 81 + n * kr) + 80 + M * 81
= n * (80 + kr) + 2M * 81 + 80. The worst-case for a single stream results when
its request is the first request to be served in a batch and is the last request
to be served in the next batch of requests. This results in roughly 2Q time
between serving two requests of a stream. This implies the number of streams
n is obtained when p = 2Q or n = (kT/C - 4M * 81 - 2 * 80)/2 * (80 + kr).
However, this bound can be improved if we allow deadline extension. If we
allow the deadlines to be extended by one period, the maximum number of
streams n is obtained when n = (kT/C - 2M * 81 - 80)/(80 + kr).
272 CHAPTER 8

The time taken to serve a batch of requests through a sweep, using SCAN-EDF,
has little variance. The possible variances of individual seek times could add
up to a possible large variance if served by a strict EDF policy. SCAN-EDF
reduces this variance by serving all the requests in a single sweep across the
disk surface. SCAN-EDF, by reducing the variance, reduces the time taken
for serving a batch of requests and hence supports larger number of streams.
This reduction in the variance of service time for a batch of requests has a
significant impact on improving the service time guarantees. Larger request
sizes, split-access operation of disk arm also reduce the variance in service time
by limiting the random, variable components of the service time to a smaller
fraction.

Figure 8 compares the predictions of analysis with results obtained from sim-
ulations for extended deadlines. For this experiment, aperiodic requests were
not considered and hence the small difference in the number of streams sup-
portable by SCAN-EDF from Figure 4. It is observed that the analysis is very
close to the simulation results. The error is within one stream.

I
r'
26
215
24 .-- - - -- ------+- I
u 23
~ 22
~ 21
=... 20
Iii 19
18
.~ 17
:::E 16 + SCAN·EOl'
115 Simulations
14 - - - Analysis
13
12
11
10
9
8
7
8
15
0
Request size

Figure 8 Comparison of analysis with simulation results.


Scheduling in Multimedia Systems 273

3.6 Effect of SCSI bus contention


In today's systems, disks are connected to the rest of the system through a
peripheral device bus such as a SCSI bus. To amortize the costs of SCSI
controllers, multiple disks may be connected to the system on a single bus.
SCSI bus, for example can support 10 MBjsec (also 20 MBjsec with wider
buses). Since most disks have raw data rates in the range of 3-5 MBjsec, two
to three disks can be attached to a single SCSI bus without affecting the total
throughput of the disks. However, even when the raw data rate of the SCSI bus
may be fast enough to support two to three disks, in a real-time environment,
this shared bus could add delays to individual transfers and may result in missed
deadlines. To study the affect of the SCSI bus contention on the throughput
of the real-time streams in a system, we simulated 3 disks attached to a single
bus. Each of these disks has the same characteristics as described earlier in
Table 1. The raw data rate of these disks is 3.8 MBjsec. This implies that
the total throughput of these disks slightly exceeds the rated bandwidth of the
SCSI bus at 10 MBjsec. However, due to seek and latency penalties paid for
each access, the disks do not sustain the 3.8 MBjsec for long periods of time.

SCSI bus is a priority arbitrated bus. If more than one disk tries to transfer
data on the bus, disk with higher priority always gets the bus. Hence, it is
possible that real-time streams being supported by the lower priority disks
may get starved if the disk with higher priority continues to transmit data.
Better performance may be obtained with other arbitration policies such as a
round-robin policy. For multimedia applications, other channels such as the
proposed SSA by IBM, which operates as a time division multiplexed channel,
are more suitable.

Figure 9 shows the impact of SCSI bus contention on the number of streams
that can be supported. The number of streams supported is less than three
times that of the individual disk real-time request capacity. This is mainly due
to the contention on the bus. At a request size of 5 tracks, the ratio of the
number of streams supported in a three disk configuration to that of a single
disk configuation varies from 2.1 in the system with extended deadlines to 1.8
in the system without extended deadlines. This again shows that deadline
extension increases the chances of meeting deadlines, in this case smoothing
over the bus contention delays. Figure 9 assumes that the number of streams
on the three disks differ at most by one. If the higher priority disk is allowed
to support more real-time streams, the total throughput of real-time streams
out of the three disks would be lower. We observed a sharp reduction in the
number of streams supported at the second and third disks when the number
274 CHAPTER 8

of streams supported at the first disk is increased even by one. For example, at
a request size of 5 tracks and extended deadlines, SCAN-EDF supported 15, 14
and 14 streams at the three disks but only supported 7 streams at the second
and the third disks when the number is raised to 16 at the first disk.

50 X SCSI-multiple disks
'"
~ o Single disk
2:! 45
>< >E·>(·x.x···x ... •••••• X
t>!
.!!.!
~ 40
~
.9
-;; 35
§ ¥_ :Mi ~ X ~ +< - _ ~ - - - - - - -X
.9
><
30
I

'"
::; 25
I

20
x.r01I~
---,
••••••• ·8 ........... , .... , ... , , .... , . :.: '':'
-----
'15
[3 I ~------
15 >I:: --
t:r-
10 { .... Extended deadlines
• - - Nonextended deadlines

50

Figure 9 Performance of SCAN-EDF policy with SCSI bus contention.

Another key difference that is noted is that with SCSI bus contention, there is
a peak in supportable request streams as the request size is increased. With
larger blocks of transfer, the SCSI bus could be busy for longer periods of time
when a disk with lower priority wants to access the bus and thus causing it to
miss a deadline. From the figure it is found that the optimal request size for a
real-time stream is roughly around 5 tracks.

The optimal request size is mainly related to the relative transfer speeds of the
SCSI bus and the raw disk. When a larger block size is used, disk transfers
are more efficient, but as explained earlier, disks with lower priority see larger
delays and hence are more likely to miss deadlines, When a shorter block is
used, disk transfers are less efficient, but the latency to get access to SCSI bus
is shorter. This tradeoff determines the optimal block size.

Most of the modern disks have a small buffer on the disk arm for storing the
data currently being read by the disk. Normally, the data is filled into this buffer
Scheduling in Multimedia Systems 275

by the disk arm at the media transfer rate (in our case, at 3.8 MB/sec) and
transfered out of this buffer at the SCSI bus rate (in our case, at 10 MB/sec).
If this arm buffer is not present, the effective data rate of SCSI bus will be
reduced to the media transfer rate or lower. When the disk arm buffers are
present, SCSI transfers can be intiated by the individual disks in an intelligent
fashion such that the SCSI data rate can be maintained high while providing
that the individual transfers are completed across the SCSI bus as they are
being completed at the disk surface. IBM's Allicat drive utilizes this policy for
transfering in and out of its 512 kbyte arm buffer and this is what is modeled in
our simulations. Without this arm buffer, when multiple disks are configured
on a single SCSI bus, the real-time performance will be significantly lower.

4 NETWORK SCHEDULING
We will assume that time is divided into a number of 'slots'. The length of a
slot is roughly equal to the average time taken to transfer a block of movie over
the multiprocessor network from a storage node to a network node. Average
delivery time itself is not enough in choosing a slot; we will comment later on
how to choose the size of a slot. Each storage node starts transferring a block
to a network node at the beginning of a slot and this transfer is expected to
finish by the end of the slot. It is not necessary for the transfer to finish strictly
within the slot but for ease of presentation, we will assume that a block transfer
completes within a slot.

The time taken for the playback of a movie block is called a frame. The length
of the frame depends on the block size and the stream rate. For a block size of
256 Kbytes and a stream rate of 200 Kbytes/sec, the length of a frame equals
256/200 = 1.28 seconds. We will assume that a basic stream rate of MPEG-l
quality at 1.5Mbits/sec is supported by the system. When higher stream rates
are required, multiple slots are assigned within a frame to achieve the required
delivery rate for that stream. It is assumed that all the required rates are
supported by transferring movie data in a standard block size (which is also
the striping size).

For a given system, the block size is chosen first. For a given basic stream
rate, the frame length is then determined. Slot width is then approximated by
dividing the block size by the average achievable data rate between a pair of
nodes in the system. This value is adjusted for variations in communication
delay. Also, we require that frame length be an integer multiple of the slot
276 CHAPTER 8

width. From here, we will refer to the frame length in terms of number of slots
per frame 'F'.

The complete schedule of movies in the system can be shown by a table as


shown in FigS.10. The example system has 4 nodes, 0, 1, 2, and 3 and contains
5 movies A, B, C, D, and E. The distribution of movies A, B, C, D, E across
the nodes 0, 1, 2, and 3 is shown in Figure lO(a). For example, movie E is
distributed cyclically across nodes in the order of 2, 1, 0, and 3. For this
example, we will assume that the frame length F = 3. Now, if movie needs to
be scheduled at node 0, data blocks need to be communicated from nodes 2,
1. 0 and 3 to node 0 in different slots. This is shown in Figure 10(b) where
the movie is started in slot O. Figure 10( c) shows a complete schedule of 4
requests for movies E, C, B. and E that arrived in that order at nodes 0, 1, 2,
3 respectively. Each row in the schedule shows the blocks received by a node
in different time slots. The entries in the table indicate the movie and the id
of the sending node. Each column should not have a sending node listed more
than once since that would constitute a conflict at the sender. A movie stream
has its requests listed horizontally in a row. The blocks of a single stream are
always separated by F slots, in this case F = 3. Node 0 schedules the movie
to start in time slot O. But node 1 cannot start its movie stream in slot 0
as it conflicts with node 0 for requesting a block from the same storage node
2. Node 2 can also schedule its movie in slot 1. Node 3 can only schedule its
movie in slot 2. Each request is scheduled in the earliest available slot. The
movie stream can be started in any column in the table as long as its blocks do
not conflict with the already scheduled blocks. The schedule table is wrapped
around i.e., Slot 0 is the slot immediately after Slot 11. For example, if another
request arrives for movie E at node 2, we can start that request in time Slot
3, and schedule the requests in a wrap-around fashion in time Slots 6, 9, and 0
without any conflict at the source and the destination. The schedule table has
F N slots, where N is the number of storage nodes in the system.

When the system is running to its capacity, each column would have an entry for
each storage node. The schedule in slot j can be represented by a set (nij. 8ij),
a set of network node and storage node pairs involved in a block transfer in
slot j. If we specify F such sets for the F slots in a frame (j = 1,2, ... F), we
would completely specify the schedule. If a movie stream is scheduled in slot
j in a frame, then it is necessary to schedule the next block of that movie in
slot j of the next frame (or in (j + F) mod F N slot) as well. Once the movie
distribution is given, the schedule of transfer (nij, 8ij) in slot j of one frame
automatically determines the pair (nij, 8ij) in the next frame, 8i(j+F)mod FN
being the storage node storing the next block of this movie and ni(j+F)mod FN
= nij. Hence, given a starting entry in the table (row. column specified), we can
Scheduling in Multimedia Systems 277

10(a). Movie distribution.

Movie/Blocks 0 1 2 3
A 0 1 2 3
B 1 3 0 2
C 2 0 3 1
D 3 2 1 0
E 2 1 0 3

10 (b). Schedule for movie E.

/I S~~ 0 11 1211 E\ 1 4 1 5 /1 E~O 171 a E~3110 111 II


1/

10(c). Complete schedule

Req 0 1 2 3 4 5 6 7 8 9 10 11
0 E.2 E.1 E.O E.3
1 C.2 C.O C.3 C.1
2 B.1 B.3 B.O B.2
3 E.2 E.1 E.O E.3

Figure 10 An example movie schedule.


278 CHAPTER 8

immediately tell what other entries are needed in the table. It is observed that
the F slots in a frame are not necessarily correlated to each other. However,
there is a strong correlation between two successive frames of the schedule and
this correlation is determined by the data distibution. It is also observed that
the length of the table (F N) is equal to the number of streams that the whole
system can support.

Now, the problem can be broken up into two pieces: (a) Can we find a data
distribution that, given an assignment of (nij, 8ij) that is source and destination
conflict-free, can produce a source and destination conflict-free schedule in the
same slot j of the next frame? and (b) Can we find a data distribution that,
given an assignment of (nij , 8ij) that is source, destination and network conflict-
free, produce a source, destination and network conflict-free schedule in the
same slot j of the next frame? The second part of the problem, (b), depends
on the network of the multiprocessor and that is the only reason for addressing
the problem in two stages. We will propose a general solution that addresses
(a). We then tailor this solution to suit the multiprocessor network to address
the problem (b).

4.1 Proposed solution

Part (aJ
Assume that all the movies are striped among the storage nodes starting at node
oin the same pattern i.e., block i of each movie is stored on a storage node given
by i mod N, N being the number of nodes in the system. Then, a movie stream
accesses storage nodes in a sequence once it is started at node o. If we can start
the movie stream, it implies that the source and the destination do not collide
in that time slot. Since all the streams follow the same sequence of source
nodes, when it is time to schedule the next block of a stream, all the streams
scheduled in the current slot would request a block from the next storage node
in the sequence and hence would not have any conflicts. In our notation, a
set (nij, 8ij) in slot j of a frame is followed by a set (nij, (8ij + 1) mod N) in
the same slot j of the next frame. It is clear that if (nij, 8ij) is source and
destination conflict-free, (nij, (8ij + 1) mod N) is also source and destination
conflict-free.

This simple approach makes movie distribution and scheduling stright-forward.


However, it does not address the communication scheduling problem. Also, it
has the following drawbacks: (i) not more than one movie can be started in
Scheduling in Multimedia Systems 279

°
any given slot. Since every movie stream has to start at storage node 0, node
becomes a serial bottleneck for starting movies. (ii) when short movie clips
are played along with long movies, short clips increase the load on the first
few nodes in the storage node sequence resulting in non-uniform loads on the
storage nodes. (iii) as a results of (a), the latency for starting a movie may be
°
high if the request arrives at node just before a long sequence of scheduled
busy slots.

The proposed solution addresses all the above issues (i), (ii) and (iii) and the
communication scheduling problem. The proposed solution uses one sequence
of storage nodes for storing all the movies. But, it does not stipulate that every
movie start at node 0. We allow movies to be distributed across the storage
nodes in the same sequence, but with different starting points. For example
°
°
movie can be distributed in the sequence of 0, 1, 2, ... , N-1, movie 1 can be
distributed in the sequence of 1, 2, 3, ... , N-1, and movie k (mod N) can be
distributed in the sequence of k, k+1, ... , N-1, 0, ... , k-l. We can choose any
such sequence of storage nodes, with different movies having different starting
points in this sequence.

When movies are distributed this way, we achieve the following benefits: (i)
multiple movies can be started in a given slot. Since different movies have
different starting nodes, two movie streams can be scheduled to start at their
starting nodes in the same slot. We no longer have the serial bottleneck at the
starting node (we actually do, but for l/Nth of the content on the server). (ii)
Since different movies have different starting nodes, even when the system has
short movie clips, all the nodes are likely to see similar workload and hence
the system is likely to be better load-balanced. (iii) Since different movies have
different starting nodes, the latency for starting a movie is likely to be lower
since the requests are likely to spread out more evenly.

The benefits of the above approach can be realized on any network. Again, if
the set (nij, 8ij) is source and destination conflict-free in slot j of a frame, then
the set (nij, (8ij + 1) mod N) is given to be source and destination conflict-free
in slot j of the next frame, whether or not all the movies start at node O. As
mentioned earlier, it is possible to find many such distributions. In the next
section, it will be shown that we can pick a sequence that also solves problem
(b), i.e., guarantees freedom from conflicts in the network.
280 CHAPTER 8

Part (b)
The issues addressed in this section are specific to the network of the system.
We will use IBM's SP2 multiprocessor with an Omega interconnection network
as an example multiprocessor. The solution described is directly applicable
to hypercube networks as well. The same technique can be employed to find
suitable solution for other networks. We will show that the movie distribu-
tion sequence can be carefully chosen to avoid communication conflicts in the
multiprocessor network. The approach is to choose an appropriate sequence of
storage nodes such that if movie streams can be scheduled in slot j of a frame
without communication conflicts, then the consecutive blocks of those streams
can be scheduled in slot j of the next frame without communication conflicts.

With our notation, the problem is to determine a sequence of storage nodes


such that given a set of nodes (nij, Sij) that are source, desti-
So, Sl, ... , SN -1
nation and network conflict-free, it is automatically guaranteed that the set
of nodes (nij, S((i+l) mod N)j) are also automatically source, destination and
network conflict-free.

First, let us review the Omega network. Figure 11 shows a multiprocessor sys-
tem with 16 nodes which are interconnected by an Omega network constructed
out of 4x4 switches. To route a message from a source node whose address is
given by POP1P2P3 to a destination node whose address is given by QOq1 Q2Q3,
the following procedure is employed: (a) shift the source address left circular
by two bits (log of the switch size) to produce P2P3POP1, (b) use the switch
in that stage to replace POP1 with QoQ1 and (c) repeat the above two steps for
the next two bits of the address. In general, steps (a) and (b) are repeated as
the number of stages in the network. Network conflicts arise in step (b) of the
above procedure when messages from two sources need to be switched to the
same output of a switch.

Now, let's address our problem of guaranteeing freedom from network conflicts
for a set (nij' S(i+1) mod N j) given that the set (nij, Sij) is conflict-free. Our
result is based on the following theorem of Omega networks.
Theorem: If a set of nodes (ni, sd is network conflict-free, then the set of
nodes (ni, (Si + a)modN) is network conflict-free, for any a.
Proof: Refer to [7].
The above theorem states that given a network conflict-free schedule of commu-
nication, then a uniform shift of the source nodes yields a network conflict-free
schedule.
Scheduling in Multimedia Systems 281

(0000)00
(0000)00
(0001)01 (0001)01

(0010)02 (0010)02

(0011)03
(0011)03

(0100)04 (0100)04

(0101)05 (0101)05

(0110)06 (0110)06

(0111)07 (0111)07

(1000)08 (1000)08

(1001)09 (1001)09

(1010)10 (1010)10

(1011)11 (1011)11

(1100)12 (1100)12

(1101)13 (1101)13

(1110)14 (1110)14

(1111)15 (1111)15

Figure 11 A 16-node Omega network used in IBM's SP2 multiprocessor.

There are several possibilities for choosing a storage sequence that guarantees
the above property. A sequence of 0, 1,2, .... , N-l is one of the valid sequences
- a simple solution indeed! Let's look at an example. The set 8 1 = (0,0),
(1,1), (2,2), ... , (14,14), (15,15) of network-storage nodes is conflict free over
the network (identity mapping). From the above theorem, the set 8 2 = (0,1),
(1,2), (2,3), ... , (14,15), (15,0) is also conflict-free and can be so verified. If 8 1
is the conflict-free schedule in a slot j, 8 2 will be the schedule in slot j of the
next frame, which is also conflict-free.

We have shown in this section a simple round-robin distribution of movie blocks


in the sequence of 0, 1, 2, ... , N-l yields an effective solution for our problem.
This data distribution with different starting points for different movies solves
(a) the movie scheduling problem, (b) the load balancing problem, (c) the
problem of long latencies for starting a movie, and (d) the communication
scheduling problem.
282 CHAPTER 8

Now, the only question that remains to be addressed is how do we schedule the
movie stream in the first place, i.e., in which slot should a movie be started.
When the request arrives at a node ni, we first determine its starting node
So based on the movie distribution. We look at each available slot j (where
ni is ·free and So is free) to see if the set of already scheduled movies do not
conflict for communication with this pair. We search until we find such a slot
and schedule the movie in that slot. Then, the complete length of that movie
is scheduled without any conflicts.

4.2 Other issues

Choosing a slot size


Ideally, we would like all block transfers to complete within a slot. However,
due to variations in delivery time (due to variations in load and contention in
the network), all the block transfers may not finish in the slot they are initiated.
One option is to choose the slot to be large enough that it accommodates the
maximum delivery time for a block. This approach, however, may not use the
network as effectively since it allocates larger amount of time than the average
delivery time for a block. If the slot is chosen to be the average delivery time,
how do we deal with the transfers that take larger than average delivery delays?

Figure 12 shows some results from simulation experiments on a 256-node 4-


dimensional torus network with 100 MB/s link transfer speeds. These results
are only being presented as an example and similar results have to be obtained
for the network under consideration. In the simulations, block arrival rates are
varied until the deadlines for those block transfers could be met by the network.
The figure shows the average time taken for message delivery and the maximum
block delivery time at different request arrival times. It is observed that the
average message delivery time is nearly constant and varies from 2.8 illS to 2.89
ms over the considered range of arrival times. However, the maximum delay
observed by a block transfer goes up from 5.3 ms to 6.6 ms. Even though the
average message completion time didn't vary significantly over the considered
range of arrival rates, the maximum delays are observed to have a higher varia-
tion. If we were to look at only the average block transfer times, we might have
concluded that it is possible to push the system throughput further since the
request inter-arrival time of 4 ms is still larger than the average block transfer
delay of 2.89 illS. If we were to look at only the maximum block transfer times,
we would have concluded that we could not reduce the inter-arrival times to
below 6 illS. However, the real objective of not missing any deadlines forced
Scheduling in Multimedia Systems 283

us to choose a different peak operating point of 4 ms of inter-arrival time (slot


width).

8 0 Max. delivery time


!.. + Average delivery time

.a
.!j 7

~
8- 6

5
~ 8 E)

3
+----+----+-----------+
2

O~3----4~---5~--~8~--~7~--~8~--~--~~~·
Inter-arrival time (rns)

Figure 12 Observed delays in a 4-dim. 256-node system.

It is clear from the above description that we need to carry out some experi-
ments in choosing the optimal slot size. Both the average and the maximum
delays in transferring a block over the network need to be considered. As men-
tioned earlier, the slot size is then adjusted such that a frame is an integer
multiple of the width of the slot. Since the block transfers are carefully sched-
uled to avoid conflicts, it is expected that the variations in communication
times will be lower in our system.

Different stream rates


When the stream rate is different from the basic stream rate, multiple slots
are assigned within a frame to that stream to achieve the required stream rate.
For example, for realizing a 3Mbits/sec stream rate, 2 slots are assigned to the
same stream within a frame. These two slots are scheduled as if they are two
independent streams. When the required stream rate is not a multiple of the
basic stream rate, a similar method can be utilized with the last slot of that
stream not necessarily transferring a complete block.
284 CHAPTER 8

Reducing the stream startup latency


It is possible that when a stream A is requested, the next slot where this stream
could be started is far away in time resulting in a large startup latency. In such
cases, if the resulting latency is beyond certain threshold, an already scheduled
stream B may be moved around within a frame to reduce the requested stream's
latency. If stream B is originally scheduled at time T, then stream B can be
moved to any free slot within T + F - 1 while maintaining guarantees on its
deadlines. Figure 13 shows the impact of such a strategy on the distribution of
startup latencies.

x- X Latency reduction
~ 10000 0 - - 0 No latency reduction
ii
=
!
1000

100

10

Latency

Figure 13 An example of the effectiveness of latency reduction techniques.

When network nodes and storage nodes are different


It is possible to find mappings of network nodes and storage nodes to the mul-
tiprocessor nodes that guarantee freedom from network conflicts. For example,
assigning the network nodes the even addresses and the storage nodes the odd
Scheduling in Multimedia Systems 285

addresses in the network, and distributing the movies in round-robin fashion


among the storage nodes yields similar guarantees in an Omega network.

Node failures
Before, we can deal with the subject of scheduling after a failure, we need to talk
about how the data on the failed data is duplicated elsewhere in the system.
There are several ways of handling data protection, RAID, and mirroring being
two examples. RAID increases the load on the surviving disks by 100% and this
will not be acceptable in a system that has to meet real-time guarantees unless
the storage system can operate well below its peak operating point. Mirroring
may be preferred because the required bandwidths from the data stored in the
system are high enough that the entire storage capacity of a disk drive may not
be utilized. The un-utilized capacity can be used for storing a second copy of
the data. We will assume that the storage system does mirroring. We will also
assume that the mirrored data of a storage node is evenly spread among some
set of f{, K < N, storage nodes.

Let the data on the failed node fa be mapped to nodes rna, mI, ... , mK-I' Be-
fore the failure, a stream may request blocks from nodes 0,1,2, ... , fa, ... N -
1 in a round-robin fashion. The mirrored data of a movie is distributed
among rna, mI, ... , mK-I such that the same stream would request blocks in
the following order after a failure: 0,1,2, ... , rna, ... , N -1, 0,1, 2, ... , mI, ... , N -
I, ... ,0, 1,2, ... , mK-I, ... , N - 1,0,1,2, ... , mo, ... , N - 1. The blocks that would
have been requested from the failed node are requested from the set of mirror
nodes of that failed node in a round-robin fashion. With this model, a failure
increases the load on the mirrored set of nodes by a factor of (1+I/K) since for
every request to the failed node, a node in the set of mirrored nodes observes
11K requests. This implies that f{ should be as large as possible to limit the
load increases on the mirror nodes.

Scheduling is handled in the following way after a failure. In the schedule


table, we allow I slots to be free. When the system has no failures, the system
is essentially idle during these I slots. After a failure, we will use these slots to
schedule the communication of movie blocks that would have been served by the
failed node. A data transfer (ni,/O) between a failed node fa and a network
node ni is replaced by another transfer of (ni' mi) where mi is the storage
node that has the mirror copy of the block that should have been transfered in
(ni '/0)' If we can pack all the scheduled communication with the mirror nodes
into the available free slots, with some appropriate buffer management, then
286 CHAPTER 8

we can serve all the streams that we could serve before the failure. Now, let's
examine the conditions that will enable us to do this.

Given that the data on the failed node is now supported by K other nodes, the
total number of blocks that can be communicated in 1 slots is given by K * I.
The failed node could have been busy during (F N -I) slots before the failure.
This implies that K I 2: F N - I, or 1 2: F N / (K + 1) - (1).

It is noted that no network node ni can require communication from the failed
node fa in more than (FN - I)/N slots. Under the assumptions of system
wide striping, once a stream requests a block from a storage node, it does not
request another block from the same storage node for another N - 1 frames.
Since each network node can support at most (F N - 1)/ N streams before the
failure, no network node requires communication from the failed node fa in
more than (F N - 1)/ N slots. Since every node is free during the 1 free slots,
the network nodes require that 12: (F N -1)/ N, or 1 2: F N /(N + 1) - (2). The
above condition (1) is more stringent than (2).

Ideally, we would like K = N - 1 since this minimizes the load increase on the
mirror nodes. Also, we would like to choose the mirror data distribution such
that if block transfer from the mirrored nodes is guaranteed to be conflict-free
during a free slot j, then it will also be conflict-free in the slot j + F N (the
same free slot in the next schedule table) when the transfers would require
data from the next node in the mirror set. In our notation if the set (ni' mi) is
conflict-free in a free slot j, then we would like the set (ni' m(i+1)modK) to be
conflict-free in slot j + N F.

Schedule of block transfers during the free slots is explained below. A maximal
number of block transfers are found that do not have conflicts in the network.
This set is assigned one of the free slots. With the remaining set of required
block transfers, the above procedure is repeated until all the communication is
scheduled. This algorithm is akin to the problem of finding a minimal set of
matchings of a graph such that the union of these matchings yields the graph.

We can show an upper bound on the number offree slots required. We can show
that at least 4 blocks can always be transferred without network conflicts as long
as the source and destinations have no conflicts, when the Omega network is
built out of 4x4 switches. If a set of four destinations are chosen such that they
differ in the most significant 2 bits of the address, it can be shown that as long
as the source and destinations are different, the block transfers do not collide
in the network. The proof is based on the procedure for switching a block from
a source to a destination and if the destinations are so chosen it can be shown
Scheduling in Multimedia Systems 287

that these four transfers use different links in the network. Since at most F N -I
blocks need to be transferred during the free slots, 1 :s; (F N - 1)/4. This gives
I :s; F N /5. This implies that if the network nodes requiring communication
from the failed node are equally distributed over all the nodes in the system,
we can survive a storage node failure with about 20% overhead.

Network node failures can be handled in the following way. The movie streams
at the failed node are rerouted (redistributed) evenly to the other network nodes
in the system. This assumes that the delivery site can be reached through any
one of the network nodes. The redistributed streams are scheduled as if the
requests for these streams (with a starting point somewhere over the length of
the movie, not necessarily at the beginning) are new requests.

If a combo node fails, both the above procedures for handling the failure of a
storage node and a network node need to be invoked.

Clock Synchronization
Throughout this section, it is assumed that the clocks of all the nodes in the
system are somehow synchronized and that the block transfers can be started
at the slot boundaries. If the link speeds are 40MB/sec, a block transfer of 256
Kbytes requires 6.4 ms, quite a large period of time compared to the precision of
the node clocks which tick every few nanoseconds. If the clocks are synchronized
to drift at most, say 600 us, the nodes observe the slot boundaries within ±10%.
During this time, it is possible that the block transfers observed collisions in
the network. But during the rest of the 90% transfer time, the block transfers
take place without any contention over the network. This shows that the clock
synchronization requirements are not very strict. It is possible to synchronize
clocks to such a coarse level by broadcasting a small packet of data at regular
intervals to all the nodes through the switch network.

Other Interconnection Networks


The proposed solution may be employed even when the multiprocessor system
is interconnected by a network other than an omega network. To guarantee
conflict-free transfers over the network, appropriate data distributions for those
networks have to be designed. For hypercube type of networks that can emulate
an omega network, same data distribution provides similar guarantees as in
Omega network. It can be shown that if movie blocks are distributed uniformly
over all nodes in a hypercube in the same order 0,1,2, ... , n - 1 (with different
288 CHAPTER 8

starting nodes), a conflict free schedule in one slot guarantees that the set of
transfers required a frame later would also be conflict free.

For other lower degree networks such as a mesh or a two dimensional torus,
it can be shown that similar guarantees cannot be provided. For example, in
a two dimensional nxn torus, the average path length of a message is 2* n/4
= n/2. Given that the system has a total of 4 * n 2 unidirectional links, the
average number of transmissions that can be in progress simultaneously is given
by 4*n 2 /(n/2) = 8*n, which is less than the number of nodes n 2 in the system
for n > 8. However, n simultaneous transfers are possible in a 2-dimensional
torus when each node sends a message to a node along a ring. If this is a
starting position of data transfer in one slot, data transfer in the following
frames cannot be sustained because of the above limitation on the average
number of simultaneous transfers through the network. In such networks, it
may be advantageous to limit the data distribution to a part of the system so
as to limit the average path length of a transfer and thus increasing the number
of sustainable simultaneous transfers.

Incremental growth
How does the system organization change if we need to add more disks for
putting more movies in the system? In our system, all the disks are filled
nearly to the same capacity since each movie gets distributed across all the
nodes. If more disk capacity is required, we would require that at least one
disk be added at each of the nodes. If the system has N nodes, this would
require N disks. The newly added disks can be used as a set to distribute
movies across all the nodes to obtain similar guarantees for the new movies
distributed across these nodes. If the system size N is large, this may pose a
problem. In such a case, it is possible to organize the system such that movies
are distributed across a smaller set of nodes. For example, the movies can be
distributed across the two sets 0, 2, 4, 6 and 1, 3, 5, 7 in an 8-node machine
to provide similar guarantees as when the movies are distributed across all the
8 nodes in the system. (This result is again a direct consequence of the above
Theorem 1.) In this example, we only need to add 4 new disks for expansion
as opposed to adding 8 disks at once. This idea can be generalized to provide
a unit of expansion of ]{ disks in an N node system, where ]{ is a factor of N.

This shows that the width of striping has an impact on the system's incremental
expansion. The wider the movies are striped across the nodes of the system,
the larger the bandwidth to a single movie but also the larger the unit of
incremental growth.
Scheduling in Multimedia Systems 289

5 GENERAL DISCUSSION

5.1 Admission Control


Admission control is used to make sure that the system is not forced to operate
at such a point that it cannot guarantee service to the scheduled streams.
Requests are allowed only until a point that the scheduled streams can be
guaranteed to meet their deadlines. Admission control policy can be based on
analysis or through simulations. Each component of the service can be analyzed
and the interaction of these components on the total service can be studied.
Analysis presented in section 3.5 can be used for the disk service component.
The communication component also has to be analyzed similarly.

Alternately, we could determine the maximum number of streams that can be


supported by the system thorough simulations. After determining the capacity
of the system, we could rate the usable capacity of the system to be a fraction
of that to ensure that we don't miss too many deadlines. In a real system, a
number of other factors such as the CPU utilization, the multiprocessor net-
work utilization have to be considered as well for determining the capacity
of the system. Analyzing all these factors may become cumbersome and may
make simulations the only available method for determining the capacity of the
system.

5.2 Future work


A number of problems in the design of a video-on-demand server require further
study.

We presented a preliminary study of tolerating disk failures in this chapter.


More work needs to be done in this area. If it is not possible to guarantee
precise scheduling in the presence of failures, alternative scheduling strategies
during normal operation may be attractive.

When the system is expanded, the newly added disks may have different per-
formance characterisitcs than the already installed disks. How do we handle
the different performance charateristics of different disks?

Providing fast-forward and rewind operations has not been discussed in this
chapter. Depending on the implementation, these operations may result in
varying demands on the system. It is possible to store a second version of the
290 CHAPTER 8

movie sampled at a higher (fast-forward) rate and then compressed on the disk
for handling these operations. Then, fast-forward and rewind operations will
not cause any extra demands on the system resources but will introduce the
problems of scheduling the proper version of the movie at the right time. These
strategies remain to be evaluated.

Acknowledgements
The work reported here has benefited significantly from discussions and inter-
actions with Jim Wyllie and Roger Haskin of IBM Almaden Research Center.

REFERENCES
[1] D.P. Anderson, Y. Osawa, and R. Govindan, "Real-Time Disk Storage
and Retrieval of Digital Audio/Video data", Technical Report UCB/CSD
91/646, University of California, Berkeley, August 1991.
[2] D.P. Anderson, Y. Osawa, and R. Govindan. "A File System for Contin-
uous Media", ACM Transactions on Computer Systems, November 1992,
pp. 311-337.
[3] A. Chervenak, "Tertiary Storage: An Evaluation of New Applications",
Ph.D. Dissertation, University of California, Berkeley, 1994.
[4] H.M. Deitel, "An Introduction to Operating Systems", Addison Wesley,
1984.
[5] R. Haskin, "The Shark Continuous-Media File Server", Proceedings of
IEEE CaMP CON. February 1993.

[6] K. Jeffay, D.F. Stanat, and C.D. Martel, "On Non-Preemptive Schedul-
ing of Periodic and Sporadic Tasks", Proceedings of Real- Time Systems
Symposium, December 1991, pp. 129-139.
[7] D.H. Lawrie, "Access and Alignment of Data in an Array Processor",
IEEE Transactions on Computers, Vol. 24, No. 12, December 1975, pp.
1145-1155.
[8] J.P. Lehoczky, "Fixed Priority-Scheduling of Periodic Task Sets with Ar-
bitrary Deadlines", Proceedings of Real- Time Systems Symposium, Dece-
mebr 1990, pp. 201-212.
Scheduling in Multimedia Systems 291

[9] T.H. Lin and W. Tarng, "Scheduling Periodic and Aperiodic Tasks in Hard
Real-Time Computing Systems", Proceedings of SIGMETRICS, May 1991,
pp. 31-38.

[10] C.L. Liu and J .W. Layland, "Scheduling Algorithms for Multiprogram-
ming in a hard Real-Time Environment", Journal of ACM, 1973, pp. 46-
61.

[11] A.L. Narasimha Reddy, "A Study ofl/O System Organizations", Proceed-
ings of Int. Symposium on Computer Architecture, May 1992.

[12] A.L. Narasimha Reddy and J. Wyllie, "Disk Scheduling in a Multimedia


I/O System", Proceedings of ACM Multimedia Conference, August 1992.

[13] W.K. Shih, J .W. Liu, and C.L. Liu, "Modified rate Monotonic Algorithm
for Scheduling Periodic Jobs with Deferred Deadlines", Technical Report,
University of Illinois, Urbana-Champaign, September 1992.

[14] F.A. Tobagi, J. Pang, R. Biard, and M. gang, "Streaming RAID: A Disk
Storage System for Video and Audio Files", Proceedings of ACM Multi-
media Conference, August 1993, pp. 393-400.

[15] H.M. Vin and P.V. Rangan, "Designing File Systems for Digital Video
and Audio", Proceedings of 13th A CM Symposium on Operating Systems
Principles, 1991.
[16] J. Yee and P. Varaiya, "Disck Scheduling Policies for Real-Time Multime-
dia Applications", Technical Report, University of California, Berkeley,
August 1992.

[17] P.S. Yu, M.S. Chen, and D.D. Kandlur, "Grouped Sweeping Scheduling
for DASD-Based Multimedia Storage Management", Multimedia Systems,
Vol. 1, 1993, pp. 99-109.
9
VIDEO INDEXING AND
RETRIEVAL
Stephen W. Smoliar and HongJiang Zhang
Institute of Systems Science, National University of Singapore
Singapore

1 INTRODUCTION

1.1 Motivation
Video technology has developed thus far as a technology of images, but little
has been done to help us use those images effectively. We can buy a camera
that "knows" how to focus itself properly or compensate for our inability to
hold it steady without a tripod; but no camera knows "where the action is"
during a football game or even a press conference. A camera shot can give us
a clear image of the ball going through the goal posts, but only if we find the
ball for it.

The effective use of video is beyond our grasp because the effective use of its
content is still beyond our grasp. In this Chapter we shall address four areas
in which software can make the objects of video content more accessible:

Partitioning: We must begin by identifying the elemental index units for


video content. In the case of text, these units are words and phrases, the
entities we find in the index of any book. For video we speak of generic
clips which basically correspond to individual camera shots.

Representation and classification: Once a generic clip has been identified,


it is necessary to represent its content. This assumes that we have an
ontology which embodies our objects of interest and that video content
may be classified according to this ontology, but classification is inherently
problematic. This is because it is fundamentally a subjective act of the
294 CHAPTER 9

classifying agent [K+91], which makes it very likely that any given video
may be classified according to multiple ontologies.

Indexing and retrieval: One way to make video content more accessible is
to store it in a database [SSJ93]. Thus, there are also problems concerned
with how such a database should be organized, particularly if its records
are to include images as well as text. Having established how material can
be put into a database, we must also address the question of how that
same material can be effectively retrieved, either through directed queries
which must account for both image and text content or through browsing
when the user may not have a particularly focused goal in mind.
Interactive tools: Most of our experiences with video involve sitting and
watching it passively. For video to be an information resource, we shall
need tools which facilitate interacting with it. These tools will make it
more likely that the functionality of the other three areas in this list will
actually be employed.

1.2 Basic Concepts


The tasks of representation and indexing assume that we are working with
material which is structured. Achieving these tasks thus requires characterizing
the nature of this structure. We call such structural analysis parsing [ZLS95]
for its similarity to recognizing the syntactic structure of linguistic utterances.

If the word is the fundamental syntactic unit of language, then the fundamental
unit of video and film is the shot, defined to be "one uninterrupted run of
the camera to expose a series of frames" [BT93]. The shot thus consists of a
sequence of image units, which are the frames. Often, it is desirable to represent
a shot by one or more of its frames which capture its content; such frames are
known as key frames [O'C91].

Much of the analysis offrames and shots is concerned with the extent to which
they are perceptually similar. This means that it is necessary to define some
quantitative representation of qualitative differences. This representation is
called a. difference metric [ZKS93].

One approach to representation of content involves identifying some charac-


teristic set of content features, such as color, texture, shape of component ob-
jects, and relationships among edges. Properties of these features may then
be utilized in trying to retrieve images through content-based queries. These
Video Indexing and Retrieval 295

queries are processed most efficiently if the images are indexed according to
quantitative representations of their content features, an organization known
as content-based indexing. However, content-based queries rarely can be pro-
cessed as exactly as conventional alphanumeric queries. The result is more likely
to be a set of suitable candidates than a set of images which exactly match what
the user has specified [F+94]. If this number of candidates is sufficiently large,
the user will also need browsing facilities [ZSW95] to review them quickly and
select those which are closest to what he had in mind.

1.3 State of the Art


There is very little currently available by way of systems which manage video
indexing and retrieval. Systems like Aldus Fetch [See92] can handle video
objects, but retrieval is restricted to searching through relatively brief text
descriptions. A key problem is that the volume of most video collections still
strains the capacity of even "very large" databases. On the other hand image
databases are far more feasible, so currently the most practical way ·to deal
with video resources is to construct a database from key frames selected from
all video source material.

While there is now extensive experimental work in the development of systems


which support content-based queries, the only general product to have been
released thus far has been the Ultimedia Manager 1.0 from IBM [Sey94]. This
system integrates text annotations of images with queries based on the features
currently handled by QBIC (IBM's Query By Image Content): color, texture,
and shape. On the other hand developers at Virage have decided to concen-
trate on query support for specific applications, anticipating that problems of
classification and indexing will be more manageable if they are constrained by
a given domain model.
296 CHAPTER 9

2 PARSING

2.1 Techniques

Temporal Segmentation
Assuming that our basic indexing unit is a single uninterrupted camera shot,
temporal segmentation is the problem of detecting boundaries between con-
secutive camera shots. As was observed in Section 1.2, the general approach
to solution has been the definition of a suitable quantitative difference metric
which represents significant qualitative differences between frames. A segment
boundary can then be declared whenever that metric exceeds a given threshold.

One of the most successful of these metrics uses a histogram of intensity levels,
since two frames with similar content will show little difference in their respec-
tive histograms. The histogram is represented as a function Hj(j), where i is
the frame number and j is the code for a specific histogram bin. The simplest
way to define histogram bins is as ranges of intensity values. However, it is also
possible to define bins which correspond to intensity ranges of color compo-
nents, making the histogram a somewhat richer representation of color content
[NT92]. Regardless of how the bins are defined, the difference between the ith
frame and its successor may be computed as a discrete L 1-norm [Rud66] as
follows:
G

SD j =L IHj(j) - H j+1 (j)1 (9.1)


j=l

(G is the total number of histogram bins.)

An alternative to L1 histogram comparison is the following X2 metric proposed


in [NT92] for enhancing differences between the frames being compared:

(9.2)

However, experimental results reported in [ZKS93] showed that this also in-
creases the difference due to camera or object movements. Therefore, the over-
all performance is not necessarily better than that achieved by using Equation
9.1, while Equation 9.2 also requires more computation time.

If a video source is compressed, it would be advantageous to segment that source


directly, saving on the computational cost of decompression and lowering the
Video Indexing and Retrieval 297

overall magnitude of the data which must be processed. Also, elements of a


compressed representation, such as the DCT coefficients and motion vectors
in JPEG and MPEG data streams, are useful features for effective content
comparison [ZLS95]. The pioneering work on image processing based on DCT
coefficients was conducted by Farshid Arman and his colleagues at Siemens
Corporate Research [AHC93]. A subset of the coefficients of a subset of the
blocks of a frame is extracted to construct a vector representation for that
frame:
(9.3)
An L1 difference metric between two frames is then defined as a normalized
inner product:
lit =1 _ lVi • ViHI (9.4)
lVillViHI
(if; is the number of frames between the two frames being compared.)

An alternative L 1 -norm involves comparing the DCT coefficients of correspond-


ing blocks of consecutive video frames [ZLS95]. More specifically, let c/,k(i) be
a DCT coefficient of block I in frame i, where k is the coefficient index (ranging
from 1 through K); then the content difference of block I in two frames which
are if; frames apart can be measured as:

(9.5)

We can say that a particular block has changed across the two frames if its
difference exceeds a given threshold t:

DifJ/ > t (9.6)


If D( i, i + if;) is defined to be the percentage of blocks which have changed,
then, as with other difference metrics, a segment boundary is declared if

D(i,i+if;»Tb (9.7)

where n is the threshold for camera breaks.


In an MPEG data stream motion vectors are predicted and/or interpolated
from adjacent frames by motion compensation, and the residual error after
motion compensation is then transformed into DCT coefficients and coded.
However, if this residual error exceeds a given threshold for certain blocks,
motion compensation prediction is abandoned; and those blocks are represented
by DCT coefficients. Such high residual error values are likely to occur in many,
298 CHAPTER 9

if not all, blocks across a camera shot boundary. If M is the number of valid
motion vectors for each P frame and the smaller of the numbers of valid forward
n
and backward motion vectors for each B frame, and is a threshold value close
to zero, then
M<n (9.8)
will be an effective indicator of a camera boundary before or after (depending
on whether interpolation is forward or backward) the B/P frame [ZLS95].

For MPEG sources it is possible to combine data from both DCT coefficients
and motion vectors in a hybrid technique. The first step is to apply a DCT
comparison, such as the one defined by Equation 9.7, to the I frames with a large
skip factor c/J to detect regions of potential gradual transitions, breaks, camera
operations, or object motion. The large skip factor reduces processing time by
comparing fewer frames. Furthermore, gradual transitions are more likely to be
detected as potential breaks, since the difference between two more "temporally
distant" frames could be larger than the break threshold, n. The drawbacks
of using a large skip factor, false positives and low temporal resolution for
shot boundaries, are then recovered by a second pass, only applied to the
neighborhood of the potential breaks and transitions, in which the difference
metric of Equation 9.8 is applied to all the Band P frames of those selected
sequences to confirm and refine the break points and transitions detected by
the DCT comparison.

Unfortunately, single threshold techniques tend to degrade when confronted


with grp,dual transitions, such as those implemented by special effects like fades,
wipes, and dissolves [ZKS93]. However, a method which employs two thresh-
olds, called twin-comparison, is equally reliable when applied to both pixel
and compressed representations [ZLS95]; but it also identifies frames involving
camera operations, such as panning, tilting, and zooming, as false positives.
However, because each of these operations has its own characteristic pattern
of motion vectors [ZKS93], they may be detected by examining those vectors.
An alternative approach is to examine spatiotemporal slices of video sequences
[AT94]. These images also provide characteristic patterns for camera operation,
and they may be used for abstraction and representation of shot content.

Key Frame Extraction


Perhaps the easiest way to represent a camera shot is to "abstract" it as a set
of images, extracted from the shot as representative frames, usually called key
frames [O'C91]. The process of key frame extraction can be robustly automated
and parameterized in such a way that the number of key frames extracted for
Video Indexing and Retrieval 299

each shot depends on features of the shot content, variations in those features,
and the camera operations involved [ZSW95j . The technique is based on the
same sort of difference metric used for temporal segmentation. In this case
computation takes place within a single shot. The first frame is proposed as a
key frame, and consecutive frames are compared against that candidate. A two-
threshold technique, similar to twin-comparison, identifies a frame significantly
different from the candidate; and that frame is proposed as another candidate,
against which successive frames are compared. Users can specify the density of
the detected key frames by adjusting the two threshold values.

Model-Based Parsing
Automatic extraction of "semantic" information of general video programs is
outside the capability of current signal analysis technologies [PPS94j. On the
other hand "content parsing" may be possible with a structure model based on
domain knowledge. Such a model may represent spatial order within individual
images and/or temporal order across a sequence of shots.

Television news can usually be so modelled: there tends to be spatial structure


within the anchorperson shots and temporal structure in the order of shots
and episodes. Figure 1 shows an example of temporal structure: It is a simple
sequence of news items (possibly interleaved with commercials), each of which
may include an anchorperson shot at its beginning and/or end . Individual
news shots are, as a rule, not easily classified by syntactic properties, with the
possible exception of certain regular features, such as weather, sports, and/or
business. However, those regular features tend to be very consistent in their
spatial properties. Anchorperson shots are similarly consistent, as may be
seen in Figure 2 [Z+95aj . Parsing thus relies on classifying each camera shot
according to categories which, while relatively coarse, can still capture much of
the significant semantic content of the program material.

Figure 1 The temporal structure of a typical news program


300 CHAPTER 9

Figure 2 The spatial structure of a typical anchorperson shot

Algorithms used Nd Nm Nj Np Nz
Gray-level comparison 101 8 9 2 1

X 2 gray-level comparison 93 16 9 2. 3

Color code comparison 95 14 13 1 2

Table 1 Detecting gradual transitions with three types of twin-comparison


algorithms applied to a documentary video: N d , the total number of transitions
correctly detected; N m , the number of transitions missed; N f , the number of
transitions misdetected; N p, the number of "transitions" actually due to cam-
era panning; N z, the number of "transitions" actually due to camera zooming

2.2 Examples

Temporal Segmentation
Some experimental results which summarize the efficacy of twin-comparison
are presented in Table 1 [ZKS93). These figures are based on applying three
difference metrics to a documentary video. The first two rows of Table 1 show
the results of applying difference metrics 9.1 and 9.2, respectively, to gray-level
histograms. The third row shows the result of applying difference metric 9.1 to
histograms of a six-bilt color code.
Video Indexing and Retrieval 301

Table 1 shows that histogram comparison, whether the bins are based on gray-
level or color intensity, is highly accurate in detecting both camera breaks and
gradual transitions. Color gives the best results: besides the high accuracy, it
is also the fastest of the three algorithms. (This is primarily because the color
histogram requires only 64 histogram bins, since the code is based on the two
high-order bits of the red, green, and blue intensities; gray-level, on the other
hand, is represented as an 8-bit value, yielding a 256-bin histogram. In other
words six bits of a color code appear to provide more effective information than
eight bits of gray level.) Finally, the rightmost two columns in Table 1 account
for transitions detected by twin-comparison which are actually due to camera
operation and are detected by analysis of motion content.

This documentary was also compressed in MPEG, and Figure 3 shows sequences
of the three difference metrics discussed in Section 2.1. A good example of the
advantage of the hybrid approach is break 2 in Figure 3(a), which is missed
by the DCT correlation algorithm but is clearly recognized by motion vector
comparison. The images compared across this break (as well as those across
break 3 for control) are illustrated in Figure 4. Figure 5, on the other hand, is
the camera pan corresponding to segment Tl in Figure 3(b), which is identified
as a gradual transition by twin-comparison (Tt is the second threshold for twin-

4
0.6 5

0.4

0.2

o
120 240 360 480 600 120 840 Q60

Figure 3 (a) Video parsing results on a test video compressed in MPEG with
1 intra-picture, 1 predictively coded frame and 4 bi-directional predictively
coded frames in every 6 frames: DCT coefficient correlation using Arman's
difference value III; difference values are computed between successive intra-
picture frames. Tb is the threshold for detecting a camera break. T t is the
second threshold used in twin-comparison.
302 CHAPTER 9

1000
3 4

800 1

600 - ~.
400

200

o
120 240 360 490 1300 no ~40 gao
Figure 3 (continued)
(b) Pair-wise block comparison of DCT coefficients based on difference value
D(i,i + q,); difference values are computed between successive intra-picture
frames.

250

200

150

11)(1

50
Ttl - - - 1-

240 360 4~0 6(10


r-
O~~-+~~+-~-+~~+-~-+~~+-~-+~~+-~
120 720 840 Q60
s----\J--
Figure 3 (continued)
(c) Motion-based comparison based on M, the number of valid motion vectors;
motion vectors are examined in both types of predictively coded frames.

comparison) and subsequently identified as a pan by motion vector analysis.


Video Indexing and Retrieval 303

Figure 4 Transitions for breaks 2 and 3 as indicated in the graphs. Break


3 (below) was detected by DCT coefficient correlation, while break 2 (above)
was missed but was detected by motion vector comparison.

Video Length (sec.) Nd Nm Nj Nk


Stock footage 1 451.8 35 1 1 116
Stock footage 2 1210.7 78 1 5 271
Singapore 173.8 31 1 0 71
Dance 2109.1 90 17 4 205

Table 2 Segmentation and key frame extraction applied to three types of


source video: N d , the total number of transitions correctly detected; N m , the
number of transitions missed; N f' the number of transitions misdetected; Nit,
the number of key frames extracted

Key Frame Extraction


Table 2 [ZSW95] lists the results of three tests which apply key frame ex-
traction to the results of temporal segmentation. The first test was based on
304 CHAPTER 9

Figure 5 A camera pan corresponding to segment Tl

two "Stock footage" videos consisting of unedited raw shots of various lengths,
covering a wide range of scenes and objects. "Singapore," the second test, is
travelogue material from an elaborately produced documentary which draws
upon a variety of sophisticated editing effects. Finally, the "Dance" video is
the entirety of Changing Steps, a "dance for television" conceived by choreog-
rapher Merce Cunningham and video designer Elliot Caplan, which contains
shots with fast moving objects (dancers), complex and fast changing lighting
and camera operations. and highly imaginative editing effects, many of which
are far less conventional than those in "Singapore." As shown in Table 2, the
temporal segmentation algorithms perform very well with an accuracy over 95%
(counting both missing and false detection as errors) for the first three video
sequences. Most of the missed detections of shot boundaries in "Dance" are
due to special editing effects in which techniques such as fades and dissolves
are combined with lighting changes in manners which often leave the viewer
unaware that a shot change is taking place.
Video Indexing and Retrieval 305

For all shots which are correctly detected, at least one key frame has been
detected to represent the shot. As was observed in Section 2.1, the density
of key frames is under user control; and these trials yielded an average of
between two and three key frames per shot . The test results also demonstrate
abstraction of camera operations (both panning and zooming) by key frames.
Figure 6 illustrates a "soft" VCR which displays the extracted key frames for
the camera shot currently being viewed. In this particular example those key
frames summarize a shot which zooms out.

Figure 6 Video player for examining extracted key frames

Key frame extraction actually compensated for segmentation errors in "Dance."


Thus, in many of the cases where a shot boundary was missed, one or more
additional key frames were detected to represent the material in the missed
shot. This can be seen in the display in Figure 7 of key frames assumed to be
from a single shot. The key frame at 12:35.15 represents a fade out of the shot
which also includes the key frame at 11:26.18. After this shot fades out, the
shot represented by the key frame at 12:39.09 fades in. Then the key frame
306 CHAPTER 9

Figure 7 Compensation for temporal segmentation errors by extracted key


frames

Program Ntotal Np NA NAJ NAm


SBC1 549 92 23 0 1
SBC2 409 80 31 3 0

Table 3 Experiment results in detecting anchorperson shots: Ntotal, total


number of shots in the news program; N p • number of potential anchorperson
shots identified by the temporal variation analysis; N A. number of anchorperson
shots finally identified; N AJ. false positives; N Am. missed anchorperson shots;
SBCl, first half-hour broadcast of Singapore Broadcasting Corporation (now
Television Corporation of Singapore) news; SBC2, second half-hour broadcast
of SBC news

Program N Ns Nm NJ
SBCl 20 18 2 0
SBC2 19 18 1 0

Table 4 Experiment results in detecting news items: N, number of news


items manually identified by watching the programs; N s. news items identified
by the system; N m , news items missed by the system; N J' news items falsely
identified by the system

at 12:53.09 captures the middle of a missed dissolve transition. Key frame


detection is thus more robust than shot boundary detection.

Model-Based Parsing
Table 3 [Z+95a] lists the numbers of video shots and anchorperson shots iden-
tified by the news parsing system discussed in Section 2.1. Over 95% of the
anchorperson shots were detected correctly. The missing anchorperson shot in
SBCl was due to the anchorperson not facing the camera. There are a few
false positives in both programs, where an interview scene was falsely detected
as an anchorperson shot.
Video Indexing and Retrieval 307

Table 4 [Z+95a] lists the numbers of news items identified by the news pars-
ing system and the numbers manually identified by watching the programs.
The system identifies news items with over 95% accuracy. The missed news
items resulted from the assumption that each news item in the program starts
with an anchorperson shot followed by a sequence of news shots. However,
this condition can be violated by both news items which are only read by an
anchorperson without news shots and by news items which start without an
anchorperson shot. Those limitations are difficult to overcome with only image
analysis techniques, and content analysis would benefit from the texts of either
teleprompter scripts or closed captions.

3 REPRESENTATION AND
CLASSIFICATION
3.1 Techniques

Image Features
While key frame extraction constitutes an approach to representation-the
abstraction of motion video into static images-there remains the question of
how the content of those images may be represented. One of the most appealing
approaches is to work with descriptions based on properties which are inherent
in the images themselves: the patterns, colors, textures, and shapes of image
objects, and related layout and location information. For many applications
such queries may be either supplemental or preferable to text, if not either
necessary or easier to formulate. For example, when it is necessary to verify
that a trade mark or logo has not been used by another company, the easiest
way is to query an image database system for all images similar to the proposed
pattern [K+ 91]. Search is driven by identifying specific features of that pattern
which need to match images in the database.

There are two ways in which retrieval based on image features may be ap-
proached. We call the first model-based, because it assumes some a priori
knowledge-the model-of how images are structured. As we saw in Section
2.1, models can be very useful in classifying anchorperson shots in news broad-
casts or similar corpora of highly stereotyped material. However, the most
important content of a news program is the collection of news stories being
308 CHAPTER 9

presented. These are not at all as stereotyped, so it is not as feasible to develop


models which will adequately represent key frames extracted from those stories.

The second approach requires a more general model of which features should be
examined and how different instances of those features should be compared for
proximity. On the basis of results which have been established to date, those
features which have been seen to be most effective have been color, texture,
shape of component objects, and relationships among edges which may be
expressed in terms of line sketches. Extensive research has been carried out
to address how those features may be represented; and currently the most
comprehensive results may be found in the designs of IBM's QBIC [F+94], the
Photobook project at The MIT Media Laboratory [PPS94], and the SWIM
(Show What I Mean) project at the Institute of Systems Science [G+94].

Frame-Based Management of Text


If text is to be used for representation, then it is likely to be most effective if
usage is suitably constrained. In many ways a database may be regarded as
one of the best ways to constrain the presentation of text material. Conse-
quently, information concerned with the content of video source material which
is maintained in a database [SSJ93] would serve as a well-structured presenta-
tion of text descriptions. However, it is also important that these descriptions
capture semantic properties of the content [PPS94]; and perhaps the most im-
portant semantic property is classification with respect to a collection of topical
categories-the property which governs the construction of most book indexes.
Such a representation may be implemented as a frame-based system in order
to represent such topical categories as a class hierarchy [SZW94]. This hier-
archical structure would reflect the indented structure found in the indexes of
many books.

3.2 Examples of Representation Based on


Image Features
Let us now consider examples of how key frames from a video may be repre-
sented on the basis of color, texture, shape, and edge features.
Video Indexing and Retrieval 309

Color
Color histograms may be defined for key frames just as they are defined for
frame comparison in temporal segmentation. They may be again compared
with an L1-norm. If Q and I are histograms corresponding, for example, to
a query image and an image in a database, then a suitable L 1-norm is the
following:
N
2: min (Ii , Qi) (9.9)
;=1
This value may then be normalized by the total number of pixels in one of the
two histograms:
(9.10)

This value will always range between 0 and 1. Previous work has shown that
this metric is fairly insensitive to change in image resolution, histogram size,
occlusion, depth, and view point [SB91].

The biggest disadvantage of a histogram is that it lacks any information about


location. This problem may be solved by dividing an image into sub-areas and
calculating a histogram for each of those sub-areas. Increasing the number of
sub-areas increases the information about location; but it also increases the
memory required to store histograms and the time required to compare them.
A viable approach seems to be to partition an image into a 3 x 3 array of 9
sub-areas; thus, 10 histograms need to be calculated, one for the entire image
and one for each of the 9 sub-areas [G+94].

Texture
Texture has long been recognized as being as important a property of im-
ages as is color, if not more so, since textural information can be conveyed
as readily with gray-level images as it can in color. Nevertheless, there is an
extremely wide variety of opinion concerned with just what texture is and how
it may be quantitatively represented [TJ93]. The so-called "Tamura features"
[TMY76] are particularly notable for their quantification of psychological at-
tributes. These features are usually known by the names coarseness, direction-
ality, and contrast. Coarseness is a measure of the granularity of the texture. It
is derived from moving averages computed over windows of different sizes; one
of these sizes gives an optimum fit in both the horizontal and vertical direc-
tions and is used to calculate the coarseness metric. Directionality is computed
310 CHAPTER 9

from distributions of magnitude and direction of gradient at all pixels. The


quantification of contrast is based on the statistical distribution of pixel in-
tensities. Other effective representations of texture include the Simultaneous
Autoregressive (SAR) model [KK87], and Wold features [PPS94].

Shape
Color and texture are useful features in representing both scenes and the objects
within them. However, there are also properties which may only be defined over
individual objects. For purposes of indexing and retrieval, the most important
of these features are geometric properties of objects, including shape, size, and
location. Quantitative representations of these properties may be based on
standard techniques in digital image processing [G+94].

Sketch Features
One of the more intuitive approaches to describing an image in visual terms for
the sake of retrieving it is to provide a sketch of what is to be retrieved. The
richness of features in such a sketch will, of course, depend heavily on the skill
of the user who happens to be doing the sketching, since a well-drawn sketch
can make good use of techniques such as coloring and shading. However, as
sort of a "lowest common denominator ," we may assume that a sketch is a
simple line drawing which captures some basic information about the shapes
and orientations of at least some of the objects in the image.

Under this assumption the features of an image which would guide any attempt
at retrieval would be some set of edges associated with a reasonably abstract
representation ofthe image. One technique for constructing such an edge-based
representation has been developed by the Electrotechnical Laboratory at MITI
in Japan [K+92]. The measurement of similarity is then applied to an edge-
based representation of each image, compared against a pixel representation of
a sketch.

An example of experimental results using sketch-based indexing and retrieval


is shown in Figure 8. As one can see, the top five candidates agree very well
with the 'query image, indicating the effectiveness of the technique. The major
drawback of this approach is that similar sketch patterns with different ori-
entation or scale will not be retrieved when compared with the query image.
Overcoming this problem requires more sophisticated edge representation and
matching algorithms.
Video Indexing and Retrieval 311

Figure 8 An example of sketch-based image retrieval

4 INDEXING AND RETRIEVAL


4.1 Techniques
A major problem in dealing with large image databases is efficiency of retrieval.
One of the key issues in achieving such efficiency is the design of a suitable in-
dexing scheme. Content-based image retrieval can only be effective if it entails
reducing the search from a large and unmanageable number of images to a few
that the user can quickly browse . If image features , such as those discussed in
Section 3.2, are pre-computed before images are added to the database, then
each image will be represented by a feature vector which corresponds to a point
in a multi-dimensional feature space. Thus, fast query processing requires a
multi-dimensional indexing technique which will function efficiently over a large
database. There are three popular approaches to multi-dimensional indexing:
R-trees (particularly the R*-tree); linear quadtrees; and grid files. However,
these methods tend to explode geometrically as the number of dimensions in-
creases; so, for a sufficiently high dimensionality, the technique is no better
than sequential scanning [F+94].

This problem is best solved by filtering the search space by a preprocessing


step which may allow some false hits, but not false dismissals. Mathematically,
this preprocessing requires mapping each feature vector -X into a vector X'
--+
in
a more suitable space, in which all distances will underestimate the distances

-
in the actual search space [F+94]. This mapping will guarantee that a query
based on X' space will not miss any actual hit , but may contain some false
--+
hits . This reduces the problem to defining a suitable X' space.
312 CHAPTER 9

4.2 Examples of Query Processing


A query may be formulated as an example image, either created or selected
by the user. The example may provide color and texture patterns, edge rela-
tionships, layout or structural descriptions, or other graphical information. To
accommodate these options in an interface, there are three basic approaches:
template manipulation, drawing, and selection.

Query by Visual Templates


Visual templates may be used when a user wants to retrieve images which
consist of some known color and/or texture patterns, such as a sunny sky, sea,
beach, lawn, or forest. To support this. a set of pre-defined templates are stored
and can be assigned to temp/ate maps. A template map is based on the same
division of an image into a 3 x 3 array of 9 sub-areas which was discussed in
Section 3.2. Different templates may be assigned to any subset of these sub-
areas, and the resulting assignment serves as a query image. This approach is
especially important if the user cannot provide any sort of painting or sketch
and cannot find a useful example image. The map also has the advantage that
unassigned areas serve as "don't care" specifications. Thus, Figure 9 illustrates
an example of a query which is formulated strictly with respect to the color
content of the left and right periphery. As a result, the retrieved images exhibit
considerable variation in their central areas.

Drawing a Query
A natural way to specify a visual query is to let the user paint or sketch an
image. A feature-based query can then be formed by extracting the visual
features from the painted image. The sketch image shown in Figure 8 is an
example of such a query: the user draws a sketch with a pen-based device or
photo manipulation tool. The query may also be formed by specifying objects
in target images with approximate location. size, shape, color and texture on a
drawing area using a variety of tools, similar to the template maps, to support
such paintings. This is one of the interface approaches currently being pursued
in QBIC [Sey94]. In most of the cases, coarse specification is sufficient, since a
query can be refined based on retrieval results. This means it is not necessary
for the user to be a highly skilled artist. Also. a coarse specification of object
features can absorb many errors between user specifications and the actual
target images.
Video Indexing and Retrieval 313

Figur II A menu of pie imar for query by vi u tempi ea

Figur 9 (continud)
A qu ry ima,e comp ad from Ielectioll8 {rom the menu
314 CHAPTER 9

Query by Visual Examples


A user may still have difficulty specifying an effective query, perhaps for want of
a clear idea of what is desired. Such a user may formulate an incomplete query
which retrieves a preliminary set of images. However, it may be that none of
these are correct; but there is one which is visually similar to the desired target.
In such cases the user can select that image as an example, and the system will
search for all images which consist of similar features. To support such query
by example, two options should be provided in the interface: a query may be
based on either the entire example image or a specific region of that image.

Incomplete Queries
Supporting incomplete queries is important for image databases, since user
descriptions often tend to be incomplete. Indeed, asking for a complete query
is often impractical. To accommodate such queries, search should be confined
to a restricted set of features (a single feature value, if necessary). All query
formation environments presented above can provide the option of specifying
which features will be engaged in the search process. For instance, a query can
be formed based on template selection that will retrieve all images containing
a red car, regardless of whether the car is in the center or any other part of
the image and whatever size it may be. Furthermore. the user should have the
option of modifying the feature vector which drives the query (modifying the
shade of red of the car, for example) [F+94].

From Image to Video


Because these retrieval techniques apply to images, they can only serve to
retrieve key frames. However, users are likely to be interested in the video
context from which these key frames were extracted. Thus, the time code of a
retrieved key frame can be used to cue a video player, such as the one illustrated
in Figure 6. In this environment the user may play the original video, with the
option of stopping at each key frame, or may sequence through the key frames
in a rapid browsing mode [ZSW95]. Other techniques for browsing either a
shot or an entire video will now be reviewed.
Video Indexing and Retrieval 315

5 INTERACTIVE TOOLS

5.1 Micons
An icon is a visual representation of some unit of information. If that infor-
mation is time-related, then representing it as an icon requires being able to
account for the temporal dimension. The video icon (also called a "micon" or
"movie icon") is such a representation of a video source [A +92] . It is constructed
according to a very simple principle: If a single frame of video is represented by
a two-dimensional array of pixels, then the video itself may be represented by
a three-dimensional volume, where the third dimension represents the passage
of time. Thus, a micon may be constructed by stacking successive frames of
a video, one behind the other, enabling the user to see the frame on the front
and the pixels along the upper and side faces (Figure 10) .

Figure 10 A micon of the takeoff of Apollo 11

It is important to observe the extent to which the pixel traces recorded on the
top and side faces of a micon can bear information as useful as frame images.
The video source material for Figure 10 is a brief clip of the ascent of the
rocket which launched the Apollo 11 mission to the moon. As the rocket rises,
it cuts across the top row of pixels in each frame. The sequence of pixels in
316 CHAPTER 9

this row in successive frames thus contains a "trace" of the horizontal layers
of the rocket which cross in each frame. The top face of the micon thus serves
as a "tomogram" [AT94] in which the body of the rocket is reconstructed in a
spatiotemporal image.

Figure 11 A spatiotemporal image illustrating traces of dancers' legs

Tomogr~ms may also be constructed from the interior of a micon, as well as


its outer faces. An example is illustrated in Figure 11 [Z+95b]. In this case the
source is an excerpt from the Changing Steps video; and the spatiotemporal
image has been constructed by taking a horizontal slice through the micon just
above the dancers' ankles. This particular tomogram is valuable because it
Video Indexing and Retrieval 317

yields leg traces of all the dancers, providing a spatiotemporal representation


of the "floor-plan" for the choreography.

If a shot includes camera action, such as panning and zooming, then the spatial
coordinates of a micon will no longer correspond to those of the physical space
being recorded. However, a Hermart transform [A +92] may be used to construct
a micon which is more like "extruded plastic" than a rectangular solid. As the
camera zooms in and out, the size of the frame images shrink and expand,
respectively; and if the camera pans to the right, then the entire frame will be
displaced to the right a corresponding distance. The resulting micon is then
accurately embedded into its two spatial coordinates.

5.2 Hierarchical Browsing


Hierarchical browsers are designed to provide random access to any point in
a given video: a video sequence is spread in space and represented by frame
icons which function rather like a light table of slides. As shown in Figure 12,
at the top of the hierarchy, a whole video is represented by five key frames,
each corresponding to a segment consisting of an equal number of consecutive
camera shots. Anyone of these segments may then be subdivided to create
the next level of the hierarchy. This approach was inspired by the Hierarchical
Video Magnifier [MCW92], which treats a video source as a sequence of frames
which may be subdivided into consecutive subsequences, each of equal length.
Any of these subsequences may then be similarly subdivided, and the process
can continue until the user gets to the single frame level. Unfortunately, it is
only at this lowest level that the user is capable of detecting boundaries between
camera shots. Also, the images selected for display follow some objective rule,
such as the frame in the middle of the subsequence being represented. Our own
browser was designed to exploit parsing knowledge. As we descend through
the hierarchy, our attention focuses on smaller groups of shots, single shots,
the representative frames of a specific shot, and finally a sequence of frames
represented by a key frame [ZSW95].
318 CHAPTER 9

Figure 12 A hierarchical browser of a full-length video

6 CONCLUSION

6.1 Open Research Problems

Audio
As any film-maker knows, the audio track provides a very rich source of in-
formation to supplement our understanding of any video [BT93]. Thus, any
attempt to work with video content must take the sound-track into account
as well. Logging a video requires decomposing our auditory perceptions into
objects just as we do with our visual perceptions. Unfortunately, we currently
know far less about the nature of "audio objects" than we do about correspond-
ing visual objects [Smo93]. Nevertheless, there are definitely the beginnings of
models of audio events, some of which are similar to models used in image-based
content parsing. For example, in a sports video, very loud shouting followed by
Video Indexing and Retrieval 319

a long whistle might indicate that someone has scored a goal, which should be
recognized by content analysis as an "event." Clearly, however, audio analysis
is a new frontier which needs considerable exploration within the discipline of
multimedia.

Parsing Models
In retrospect it is not surprising that the news program material discussed in
Section 2.1 is relatively easy to parse. Its temporal and spatial predictability
make it easier for viewers to pick up those items which are of greatest interest.
However, because more general video program material tends to be concerned
with entertainment, which, in turn, is a matter of seizing and holding viewers'
attention, the success of a program often hinges on the right balance between
the predictable and the unpredictable. Thus, the very elements which often
make a program successful are those which would confound attempts to au-
tomate parsing. On the other hand those unpredictable elements can only be
detected and appreciated in the context of knowledge of what is predictable or
anticipated. What this means is that a variety of other video programs should
also admit of parsing models. Some of these models may not be as detailed as
those which characterize news programs, but they will still be based on spatial
or temporal features which assist the viewer in appreciating the nature of the
content. Consequently, a major area of research will involve identifying those
tools and techniques which may be employed in modeling different kinds of
video program material.

6.2 Vision of the Future


The vision of the future which this technology anticipates is one of digital
libraries whose resources are readily accessible from computer workstations, no
matter how remote those workstations may be from the resources themselves.
Thus far the emphasis on developing digital libraries has been on the text
content of books collected by current libraries, but the digital libraries of the
future must be able to accommodate multimedia source material as readily as
those current libraries accommodate text. Software tools for parsing the content
of video sources, generating index structures, and enabling both focused query
retrieval and casual browsing will provide the necessary infrastructure for a
digital library which can accommodate video source material as readily as it
accommodates text [Z+95b]. These tools will be enhanced as researchers apply
them to case studies based on different video sources.
320 CHAPTER 9

Acknowledgments
Figures 1, 2, and 11 are reproduced with the permission of Springer-Verlag, the
first two having appeared in Volume 2, Number 6 of the journal Multimedia Sys-
tems and the third having appeared in the book Advances in Digital Libraries.
Figures 3, 4, and 5 are reproduced with the permission of Kluwer Academic
Publishers, having first appeared in Volume 1, Number 1 of the journal Mul-
timedia Tools and Applications. The images in Figures 7 and 12 have been
used with the permission of the Cunningham Dance Foundation, and those in
Figure 8 were provided with the kind permission of the Television Corporation
of Singapore.

REFERENCES
[A +92] A. Akutsu et al. Video indexing using motion vectors. In Visual Com-
munications and Image Processing '92, pages 1522-1530, Boston, MA,
November 1992. SPIE.
[AHC93] F. Arman, A. Hsu, and M.-Y. Chiu. Image processing on compressed
data for large video databases. In Proceedings: A CM Multimedia 93, pages
267-272, Anaheim, CA, August 1993. ACM.
[AT94] A. Akutsu and Y. Tonomura. Video tomography: An efficient method
for camerawork extraction and motion analysis. In Proceedings: A CM
Multimedia 94, pages 349-356, San Francisco, CA, October 1994. ACM.

[BT93] D. Bordwell and K. Thompson. Film Art: An Introduction. McGraw


Hill, New York, NY, fourth edition, 1993.
[F+94] C. Faloutsos et al. Efficient and effective querying by image content.
Journal of Intelligent Information Systems, 3:231-262, 1994.

[G+94} Y. Gong et al. An image database system with content capturing and
fast image indexing abilities. In Proceedings of the International Confer-
ence on Multimedia Computing and Systems, pages 121-130, Boston, MA,
May 1994. IEEE.
[K+91] T. Kato et al. A cognitive approach to visual interaction. In Interna-
tional Conference on Multimedia Information Systems '91, pages 109-120,
SINGAPORE, January 1991. ACM, McGraw Hill.
Video Indexing and Retrieval 321

[K+92] T. Kato et al. A sketch retrieval method for full color image database:
Query by visual example. In Proceedings: 11th International Conference on
Pattern Recognition, pages 530-533, Amsterdam, HOLLAND, September
1992. IAPR, IEEE.

[KK87] A. Khotanzad and R. L. Kashyap. Feature selection for texture recog-


nition based on image synthesis. IEEE Transactions on Systems, Man,
and Cybernetics, 17(6):1087-1095, November 1987.

[MCW92] M. Mills, J. Cohen, and Y. Y. Wong. A magnifier tool for video data.
In Proceedings: CHI'92, pages 93-98, Monterey, CA, May 1992. ACM.

[NT92] A. Nagasaka and Y. Tanaka. Automatic video indexing and full-video


search for object appearances. In E. Knuth and L. M. Wegner, editors, Vi-
sual Database Systems, II, volume A-7 of IFIP Transactions A: Computer
Science and Technology, pages 113-127. North-Holland, Amsterdam, THE
NETHERLANDS, 1992.

[O'C91] B. C. O'Connor. Selecting key frames of moving image documents:


A digital environment for analysis and navigation. Microcomputers for
Information Management, 8(2):119-133, June 1991.

[PPS94] A. Pentland, R. W. Picard, and S. Sclaroff. Photobook: Tools for


content-based manipulation of image databases. In W. Niblack and
R. Jain, editors, Symposium on Electronic Imaging Science and Technol-
ogy: Storage and Retrieval for Image Video Databases II, pages 34-47, San
Jose, CA, February 1994. IS&TjSPIE.

[Rud66] W. Rudin. Real and Complex Analysis. McGraw-Hill Series in Higher


Mathematics. McGraw-Hill, New York, NY, 1966.

[SB91] M. J. Swain and D. H. Ballard. Color indexing. International Journal


of Computer Vision, 7(1):11-32, 1991.
[See92] A. N. Seeley. User Guide: Aldus Fetch Version 1.0. Aldus Corporation,
Seattle, WA, first edition, November 1992.

[Sey94] IBM unleashes QBIC image-content search. Seybold Report on Desktop


Publishing, 9(1), September 1994.

[Smo93] S. W. Smoliar. Classifying everyday sounds in video annotation. In


T.-S. Chua and T. L. Kunii, editors, Multimedia Modeling, pages 309-313,
SINGAPORE, November 1993.
322 CHAPTER 9

[SSJ93] D. Swanberg, C.-F. Shu, and R. Jain. Knowledge guided parsing in


video databases. In Symposium on Electronic Imaging: Science and Tech-
nology, San Jose, CA, 1993. IS&T/SPIE.
[SZW94] S. W. Smoliar, H. J. Zhang, and J. H. Wu. Using frame technology to
manage video. In Second Singapore International Conference on Intelligent
Systems, pages B189-B194, SINGAPORE, November 1994.
[TJ93] M. Tuceryan and A. K. Jain. Texture analysis. In C. H. Chen, L. F. Pau,
and P. S. P. Wang, editors, Handbook of Pattern Recognition and Computer
Vision, chapter 4.2, pages 235-276. World Scientific, SINGAPORE, 1993.
[TMY76] H. Tamura, S. Mori, and T. Yamawaki. Texture features correspond-
ing to visual perception. IEEE Transactions on Systems, Man, and Cy-
bernetics, 6(4):460-473, April 1976.
[Z+95a] H. J. Zhang et al. Automatic parsing and indexing of news video.
Multimedia Systems, 2(6):256-266, 1995.
[Z+95b] H. J. Zhang et al. A video database system for digital libraries. In N. R.
Adam, B. Bhargava, and Y. Yesha, editors, Advances in Digital Libraries,
Lecture Notes in Computer Science. Springer Verlag, Berlin, GERMANY,
1995. To appear.
[ZKS93] H. J. Zhang, A. Kankanhalli. and S. W. Smoliar. Automatic parti-
tioning of full-motion video. Multimedia Systems, 1(1):10-28, 1993.
[ZLS95] H. J. Zhang, C. Y. Low, and S. W. Smoliar. Video parsing and brows-
ing using compressed data. Multimedia Tools and Applications, 1(1):91-
113, February 1995.
[ZSW95] H. J. Zhang, S. W. Smoliar, and J. H. Wu. Content-based video
browsing tools. In A. A. Rodriguez and J. Maitan, editors, Symposium on
Electronic Imaging Science and Technology: Multimedia Computing and
Networking 1995, San Jose, CA, February 1995. IS&T/SPIE.
Index

• Admission control 133.289 • Earcons 105·106


·adaptive 140 • Entropy encoder 50
-deterministic 133-134 • Ethemet 149-151, 166
·statistical 134-135 -fast 151 , 166, 173
• ATM (Asynchronous Transfer Mode) ·isochronous 151 , 166
146,156-162,165-168,171,173,203 -priority 151. 166
-adaptation layers 163-164 ·switched 151. 166. 173
-celis 158 • FDCT 50
-connections 159 • FOOl 149. 154-155, 166,203
-header 158 • HDTV 49, 72, 147, 205-206
-protocols 162
• Hierarchical browsing 317·318
• Audio class 9,1 1
• Hierarchical database schema
• BaseObject class 4,6 205
• Block-matching techniques 77 • Huffman coding 53
• CCITI video format 67 • Huffman encoder 53
• Clip object 14 • Hypertext models 187
• Composite class 9,15 . ,DCT 56, 69
• Composite timeline 20 • Image databases 239
• Compression measures 57 • Image features 307
• Compression techniques 46 -color 309
-classification 46 -tex1ure 309
• Constrained block allocation 206 -shape 310
• Content-based image retrieval 311 ·sketch 310
• OCT coefficients 50, 52, 297-298, • Image indexing and retrieval 311
301·303 • Information superhighways 167,
• Disk arrays 129 169
• Disk scheduling 259 • Interactive tools 315
·CSSCAN 260-261 , 265-271 • Interactive video 246
-EDF 259, 261 , 264-271 -multiple perspective 246-249
-SCAN 133 138" 260-261 • Intermedia synchronization 200
·SCAN·EDF 261-272 -content-based 200
• Distance leaming 109 • I temet 168-169, 171-172
n
• Domain knowledge 244 • ISDN 67-68 ,1 49,1 55-156,166
324

• JPEG 7, 46-47, 49, 54, 61-66, 297 • Network scheduling 275


-codec 50 • Object composition model 27
-encoder 51 • Object hierarchy 28
-sequential 59-60 • Object playback model 32
• Key frames 294 • Parsing 294, 296
-extraction 298, 303, 305-307 -model-based 297, 306
• LAN (Local Area Network) -models 319
145-146,152,167-169 -video 301
• Media block 129 • Px64 46-47, 67
• Media streams 97-98 -decoder 69,71
• Metaphors 92-93 -encoder 69-70
• MHEG 21-24 • Quality of Service (QOS) 161 , 173,
• Motion estimation 76 203, 211
• Motion JPEG (MJPEG) 49, 75 • Query processing 312
• MPEG 7, 46-47,72-77,297-298, • Real-time storage and retrieval 124
301 • Real-time synchronization 182-183
-audio decoder 80, 82 • RGB representation 48, 63
-audio encoder 80, 82 • Root Mean Square (RMS) 58-59
-video decoder 74,81 • Seek optimization 261
-video encoder 74, 81
• Semantic knowledge 241
• MPEG-1 72, 140, 146,255-256,
• SONET 164-165
258
• Sound 98, 101-104
• MPEG-2 125,147,258 -hannonic content 101
• Multimedia databases 37, 204, 223 -spatial location 101
• Multimedia infonnation systems -timing 101
217, 220 -types 103
• Multimedia interfaces 87, 90 • Spatial composition 11-12
-designing 91 • Storage hierarchies 131
• Multimedia objects 1, 17-18, 181 , • Striping unit 129
201
• Synchronization model 181
-composite 19-20 -Dynamic TImed Petri Nets (DTPN)
-distributed 37
193-197,204
-hierarchical classification 201-202
-fuzzy 197
-retrieval of 132
-language based 181
• Multimedia storage servers -Object Composition Petri Nets
125-126, 128 (OCPN) 190-193, 204
• Multimedia storage systems 123 -Petri nets based 185-187
• Multimedia synchronization 177, -TIme Petri Nets (TPN) 186 .
204,206,209
Index 325

• Temporal composition 11, 13 • Video databases 245


• Temporal segmentation 296, 300 • Vid eo indexing and retrieval
293-294
• Temporal transformations 19
• Video class 7, 9-10
• Temporal Media class 6-8
• Video server 140, 257
• lime Flow Graph (TFG) model
197-200, 204 • Virtual reality 105, 107-108
• Token ring 152, 166 • WAN (W ide-Area Networ1<) 145,
-pri ority 152-153, 166, 173 157 , 167
• Traffic source modeling 212 • YCbCr format 48
• Trellis hypertext 188-189 • YUV representation 48, 63

S-ar putea să vă placă și