Documente Academic
Documente Profesional
Documente Cultură
AND TECHNIQUES
THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE
Consulting Editor
Borko Furht
Florida Atlantic University
edited by
Borko Furbt
Florida Atlantic University
"
~.
PREFACE Xl
1 MULTIMEDIA OBJECTS
Rei Hamakawa and Atsushi Atarashi 1
1 Introduction 1
2 A Class Hierarchy for Multimedia Objects 4
3 Composite Multimedia Object Model by Gibbs et al. 17
4 MHEG 21
5 Object Composition and Playback Models by Hamakawa et al. 26
6 Conclusion 36
REFERENCES 37
5 MULTIMEDIA NETWORKS
Barko Furht and Hari Kalva 145
1 Introduction 145
2 Traditional Networks and Multimedia 149
3 Asynchronous Transfer Mode 156
4 Summary of Network Characteristics 165
5 Comparison of Switching Technologies for Multimedia Com-
munications 166
6 Information Superhighways 167
7 Conclusion 171
6 MULTIMEDIA SYNCHRONIZATION
B. Prabhakaran 177
1 Introduction 177
2 Language Based Synchronization Model 181
3 Petri Nets Based Models 185
4 Fuzzy Synchronization Models 197
5 Content-based Inter-media Synchronization 200
6 Multimedia Synchronization and Database Aspects 204
7 Multimedia Synchronization and Object Retrieval Schedules 206
8 Multimedia Synchronization and Communication Requirements 209
9 Summary and Conclusion 213
Contents vii
REFERENCES 214
INDEX 323
CONTRIBUTORS
Ramesh Jain
University of California at San Diego
San Diego, California
Hari Kalva
Columbia University
New York. New York
B. Prabhakaran
Indian Institute of Technology
Madras. India
P. Venkat Rangan
University of California at San Diego
San Diego, California
PREFACE
Multimedia computing has emerged in the last few years as a major area of
research. Multimedia computer systems have opened a wide range of applica-
tions by combining a variety of information sources, such as voice, graphics,
animation, images, audio, and full-motion video. Looking at the big picture,
multimedia can be viewed as the merging of three industries: computer, com-
munication, and broadcasting industries.
This book is the first book of two-volume books on Multimedia Systems and Ap-
plications. This book comprises nine chapters and covers fundamental concepts
and techniques used in multimedia systems. The topics include multimedia ob-
jects and related models, multimedia compression techniques and standards,
multimedia interfaces. multimedia storage techniques, multimedia communica-
tion and networking, multimedia synchronization techniques, multimedia in-
formation systems, scheduling in multimedia systems, and video indexing and
retrieval techniques.
The second book on Multimedia Tools and Applications covers tools applied in
multimedia systems including multimedia application development techniques,
multimedia authoring systems, and tools for content-based retrieval. It also
presents several key multimedia applications including multimedia publishing
systems, distributed collaborative multimedia applications, multimedia-based
education and training, videoconferencing systems, digital libraries, interactive
television systems, and multimedia electronic message systems.
xii MULTIMEDIA SYSTEMS AND TECHNIQUES
This book is intended for anyone involved in multimedia system design and
applications and can be used as the textbook for a graduate course on multi-
media.
I would like to thank all authors of the chapters for their contributions to this
book. Special thanks for formatting and finalizing the book goes to Donna
Rubinoff from Florida Atlantic University.
Borko Furht
MULTIMEDIA SYSTEMS
AND TECHNIQUES
1
MULTIMEDIA OBJECTS
Rei Hamakawa and Atsushi Atarashi
Cf3C Research Laboratories,
NEC Corporation, 1-1 Miyazaki 4-Chome, Miyamae-ku,
Kawasaki, KANAGAWA 216, Japan
ABSTRACT
This chapter describes multimedia objects. The special suitability to multimedia of
the object-oriented approach has recently become increasingly clear. We first describe
the general concept of multimedia objects, and explain the merits of an object-oriented
approach in multimedia applications. we then summarize recent important research
activities in the field of multimedia objects and briefly discuss those unresolved issues
which are most likely to be subjects of significant future studies.
1 INTRODUCTION
The phrase "multimedia objects" refers to elements of multimedia data, such as
video, audio, animation, images, text, etc., that are used as objects in object-
oriented programming.
Messages Messages are requests sent to objects to get them to perform specific
operations.
Classes A class is a specification of the common structure (the concrete rep-
resentation of state) and behavior of a given set of objects. Objects in the
set covered by a given class may be referred to as instances of that class,
and creating a new object in the set may be referred to as instantiation.
Subclasses and inheritance A class can exist in relationship with a subclass
below it. With respect to its subclass, this original class exists as a su-
perclass. The objects covered by a subclass share state and behavior with
those covered by a superclass. The downward sharing of structure and
behavior is referred to as inheritance.
While increasing advances in the power and reach of graphical interfaces have
given users important new freedom of action, they have also made application
programming a significantly more complex task. No longer bound to a specific
order of actions, such as might be found in procedure-oriented interfaces, users
may perform actions in whatever order they please, and the program must be
capable of reacting to this new unpredictability. Object-oriented programming
is particularly well-suited to such event-driven programming because with
it the programmer can simply treat each button or menu item as a separate
object with its own individual behavior.
Please note that the class hierarchy we present here is a very simple one. De-
tailed issues of implementation, which are very important in real systems, are
unnecessary for our purposes here.
Designing the BaseObject class requires careful attention to the proper level of
abstraction. When the level of abstraction has been properly chosen, we are
fully able to enjoy the many advantages of the object-oriented approach.
piece of code. They need to know none of the details of specific classes of
multimedia objects. When the designers of the class hierarchy are not skillful
enough in their object abstraction, the application programmers have to use
a number of different pieces of code to create the module, each piece being
designed to import objects of a specific class. This can be achieved only after a
painful process of looking at the class definition and determining how to import
objects into documents.
6 CHAPTER 1
class BaseObject {
II object's dimension and location
int width; II width of the object
int height; II height of the object
int xpos; II x-position on display
int ypos; II y-position on display
II methods for drawinglmoving
void drawO;
void move(int deltax,int deltay);
coordinate assigned to the video data. If we start playing the video, frames
are selected and displayed in response to the current value of the temporal
coordinate.
TemporalMedia objects, then, are active objects[42], i.e., those which sponta-
neously try to detect situations in which they are required to perform opera-
tions.
For example, once you say 'start playback' to a TemporalMedia object, you need
not send any further messages until you want to stop or suspend playback. It
is the responsibility of TemporalMedia objects to find frames or samples for
playback, not that of the application programmer or user.
The most critical issue in video data handling is its large size. Although ma-
nipulating uncompressed video data is easy and useful for short lengths, com-
pression is indispensable when we try to handle greater lengths. Compression
schemes include MPEG[ll]' Motion JPEG[31] and H.261[24].
This type of hierarchy has several advantages. First, the application program-
mers can deal with video objects without being concerned about such details
as compression and decompression schemes. They need only write programs to
accord with the definition of the Video class. Secondly, if a new decompression
scheme becomes available, the class hierarchy designers need only define a new
subclass below the appropriate compression scheme class. They do not need
to modify the existing class hierarchy in any other way. Similarly, application
programmers need only add to the existing program a code for the new sub-
class. They need not modify any part of the existing application programs.
The same principle applies when a new compression scheme becomes available.
Multimedia Objects 9
};
};
};
rawAudio
• etc.
It is often the case that we would like to use only a part of an existing temporal
object as a component, or we would like to include as a component an object
being played at a different speed than that for which it was originally designed.
12 CHAPTER 1
One obvious approach would be to edit original objects into new objects sat is-
Multimedia Objects 13
Parallel Composition
Sequential Composition
fying the requirements. This would be, however, extremely inefficient in terms
of both processing time and storage space.
Figure 11 gives sample definitions for objects in the Composite class. It includes
information about components, as well as about methods for manipulating
components.
Original Object
When designing classes for the objects, care must be taken with regard to the
capturing device configuration and the data format.
Multimedia Objects 15
};
Text Class Text objects contain a large amount of text information. They
can be used to hold detailed descriptions of multimedia objects, or to hold help
messages.
Image Class Image objects contain two-dimensional images (or bitmap data).
They can be used to implement video browsers.
Graphics Class Graphics objects are used to draw such graphical objects as
lines, rectangles, and ovals.
board or mouse and for controlling the application. The following are brief
descriptions of most often used GUI objects:
Button class A Button object has a rectangular region to display its graphical
view and a 'callback' function attached to it. When the user moves a mouse into
the rectangular region of a button and clicks, the callback function attached to
it is invoked.
Menu class A menu object holds multiple items and lets the user choose one
of them. A menu object can be used to choose multimedia data for playback.
Field class A field object contains a few lines of text information, and it can
be used to display a short description of multimedia data or to input keywords
for a multimedia data query.
• Source objects have only output ports, and they produce multimedia data
values. One example would be an object which records live audio data and
outputs a sequence of digital audio samples.
• Sink objects have only input ports, and they consume multimedia data
values. One example would be an object which receives a sequence of video
frames and outputs them to a display.
• Filter objects have both input and output ports, and they both produce
and consume multimedia data values. Examples include (1) an object
which duplicates its input and then outputs that through two separate
output ports, or (2) an object which converts one format to another.
o
Multimedia Object
I
Port Dataflow
•
Object time is specific to a given multimedia object. Each object can specify:
1) the origin of its object time with respect to world time, 2) its speed for
processing multimedia data values, and 3) the orientation of its object time
with respect to world time. These specifications are implemented with the
following three temporal transformations: They are Translate, Scale and
Invert.
• Invert flips the orientation of object time back and forth between "for-
ward" and "reverse".
Original
Translate
Scale
Invert
Let us consider the example of creating a new composite object c!, which
performs the following operations:
The dataflow relationships of component objects can be defined with the pre-
viously introduced graphical notation, as seen in Figure 15, which illustrates
the dataflow relationship for object c! for the time interval [t!, t2J . During the
interval, video frame sequences from video! and video2 are processed by the
20 CHAPTER 1
Video 1
Ove
Vide02
, , • I • I .,
to t1 12 t3
digital video effect object dve, which produces a new sequence of video frames
and sends it to the video display object 7 .
4 MHEG
The number of standards, both organized and de-facto, being applied to multi-
media is bewildering, and the MHEG (Multimedia and Hypermedia information
coding Expert Group), operating jointly for the ISO (the International Orga-
nization for Standardization) and the IEC (the International Electrotechnical
Commission), is a working group applying the object-oriented approach to the
development of a standard format for multimedia and hypermedia interchange
[6][20][32]8. MHEG is concerned with the composition of time-based media
objects whose encodings are determined by other standards.
A multimedia object is essentially useless on its own, and only gains usefulness
in the context of a multimedia application; for interchange between applica-
tions, we need an application-independent format for representing objects. The
aim of this standard is the coded representation of final form multimedia and
hypermedia information objects that will be interchanged as units within or
across services and applications (Figure 16). The means of interchange may
include storage, local area network, wide area telecommunications, broadcast
telecommunications, etc. MHEG provides this format in the form of standard
"MHEG objects," which are classified as shown in Figure 179 .
MHEG classes are determined solely on the basis of object attributes (data
items), not on the basis of their operations, and thus, the hierarchy is limited
to attribute inheritance.
Internal
format MHEGObject format
Sender Receiver
MH-0BJECT>
ACTION
LINK
MODEl>
SCRIPT
COMPONENT>
CONTENT>
I MULTIPLEXED CONTENT
COMPOSITE
CONTAINER
DESCRIPTOR
Link Object: Specifies a set of relationships between one object and another;
specifically, it determines that when certain conditions for the first object (com-
monly referred to as a "source") are satisfied, certain actions will be performed
on the second object (commonly referred to as a "target").
4.1 Example
Let assume the following very simple example of a multimedia system so as
better to understand MHEG objects: 10
ContentObject-class :
Object-Number : 1 II Button
Classification : Graphics
Encoding Type JPEG
Original Size 70pt, 20pt
ContentObject-class :
Object-Number : 2 II Video
Multimedia Objects 25
Classification: Video
Encoding Type MPEG
Original Size 160pt, 120 pt
Original Duration: 10 sec
ContentObject-class:
Object-Number: 3 II Audio
Classification: Audio
Encoding Type MIDI
Original Duration: 10 sec
• Three rt-content-objects
Three rt-content-objects are created from the above three content objects.
rt-content-objects 1.1, 2.1, and 3.1 correspond to content ?bjects 1, 2, and
3, respectively.
Link-class
Object-Number: 4
Link-condition
Source object number: 1.1
Previous-Condition
Status-Value: not-selected
Current-Condition
Status-Value: selected
Link-Effect Action object Number 5
Action-class
Object-Number: 5
Target Object Set: 2.1, 3.1
Synchro-Indicator: parallel
Synchronized-Actions
Action
Object2.1 Run
Action
Object3.1 Run
Set-Button-Style:
Target object: Obj1.1
Initial-State: selectable
not-selected
All objects described the above are created by the MHEG engine. When a user
presses the "G UIDE" button, rt-content object 1.1, link object 7 and action
object 8 are activated, and video (rt-content object 2.1) and audio (rt-content
object 3.1) are played simultaneously.
Object Object
Mate rial
• Composition Playback User
Model Model
1. Temporal glue
As in T£X[22] , the typesetting system intended for the creation of beautiful
books, glue is an object which can stretch or shrink in two dimensional
positional space. This glue can be extended into temporal space, making
it "temporal glue", and introduced into multimedia object models (see
Figure 20).
Each multimedia object will then have glue attributes (normal, stretch,
and shrink) in three dimensional space (2-dimensional position and time).
It is also possible to provide a special object, called a Glue object, which
does not exist as an entity in itself, but which has only glue attributes.
2. Object hierarchy
The object composition model employs a hierarchical structure composed
of multimedia objects (Figure 21).
The complete layout of composite objects, such as the time length of each
object, is determined when the highest ranking composite object is deter-
mined. When any multimedia object is edited, the attributes of all related
composite objects are automatically recalculated to conform to the change.
3. Relative location
In one common approach to constructing multimedia objects, the timeline
model, individual multimedia objects are located on an absolute timeline
scale (see Figure 22). The object composition model differs from the time-
line model in that it is unnecessary to decide the precise time line location
for each object. Only the relative locations among objects in time and
space need be defined. Once objects are composed, their absolute loca-
tions (in both time and space) are calculated automatically.
Properties
General information about multimedia data, such as data type, location,
etc.
28 CHAPTER 1
,
Stretch
I I
II '
Maximum size . i.
I
'! I Object I
.
"
Normal
.:
Normal size Object
Shrink
Minimum size
Composite Object
Hierarchy
Information about how objects are combined .
Multimedia Objects 29
Track 1
Track 2
Track 3
Track 4
TrackS
o Timeline
Glue Attributes
Values of temporal glue (i.e., normal, stretch, and shrink sizes) , as well as
spatial glue attributes.
We may note here that the concepts which most lend this model its character-
istic nature are the concepts of relative location and temporal glue.
t
y
Objx - SEBox(ObjA ObjB)
Objy - TBBox(Objx, ObjC)
Objz - LRBox(ObjD, Objy)
Overlay This is used to overlay one object with another object in the time-
dimension.
When playing a video object and an audio object simultaneously, the operation
is as follows:
Loop This is a type of glue used to repeat an original object for a designated
length of time.
In this model, because such static media as texts, still pictures, etc. do not
contain information regarding the temporal dimension, loop is used to add
temporal glue attributes to their other attributes when they are employed with
dynamic media (audio, video etc.) in composite objects.
Additionally, the following two methods are provided to facilitate working with
objects:
Mark This function serves to mark an object at a certain point in time, and to
add to the object a title which indicates some feature of object-content relevant
to that point in time.
Mark(Obj,Time,Title)
Constraint This function attaches constraints to objects and is used primarily
for synchronization, so as, for example, to ensure that a given audio object
always ends at the same instant as a given video object, etc.
Constraint (Condition)
A constraint may be attached to an object with regard to its start, end, or
a point marked on it. For example, Constraint(Obj1.start=Obj2.start),
Constraint(Objl.markl=Obj2.end).
The time length of each object is determined when the highest ranking com-
posite object has been determined (Figure 25).
The time length of this highest ranking composite object is the normal time
length of its glue attributes l l .
11 See [16] for a more detailed description of calculation methods.
32 CHAPTER 1
Composite Object
GluePro~
Composite Object
Actual Locations ~ ~
(x,y,t) ~ ~
Display
'.
;· .'.
D viewer Class
start , stop ,
pause , resume
i
composit!ji
..... .....
..........
..... .....
.....
context Class
Media Class
The Media class represents multimedia data. All objects created in the object
composition model belong to this class, or its subclasses. (Sound is an example
of a Media subclass used to manage digital sound data.)
The Context class keeps track of the playback status of each object, such as
"What is the current video data frame?", "What is the current status of audio
data?", "Where is data displayed on the window?". etc .
34 CHAPTER 1
The Viewer class has the information required for display, such as the posi-
tion coordinates for a window, etc. It also provides convenient interfaces to
programmers, such as play, stop, and pause. Viewer is a general management
class, implemented to manage both audio and video data. It has no subclasses.
.......... .. , ............. .
~iS?t~~('.: .:.:::::::~::~::.::::::::::~::::::::::::?:::;:
:::::::;:;'::;:::;:::;::::::',
, 101 \:~HI01iliY
~ ~:<:~.: }::
Dynamic
~--""''"--
Dynamic
-
.."..,..,('1 . . . . . .
t rtIIt""h'JNII
... _,., •
... _...JJ_ . . .-,_.,
,n,.J.IJ,.ItW""""'4f/*/i;f/, ' ...... ,.... ,~,.".
... --~-..,-
r............ ....
,,,... ..... _ttf"'·
~ ,tU~ ....,.,~".,."
Cld4Of ..
,*,-"-t-
12 Xavier can be obtained via anonymous ftp from interviells. stanford. edu in
/pub/contrib.
36 CHAPTER 1
6 CONCLUSION
We have described the basics of multimedia objects and important research
activities in the area. Among the several commercial products that adopt
an object-oriented approach in handling multimedia data are Quicktime[2]
and CommonPoint[40]. Efforts other than MHEG at defining international
standards for multimedia data handling include PREMO[17], HyTime[33] and
HyperODA[21].
Object Composition
One particularly important issue for the design of composite objects is the
question of how best to describe or define the temporal relationships among
their component objects. This is analogous to the issue of how to define similar
temporal relationships that must be specified in such other fields as database
design, cognitive science, and computational geometry.
Much research in this area is based on the interval-based temporal logic pro-
posed by Allen[l]. Allen has shown that there are thirteen primitives necessary
to represent any interval relationship 13. Little et al. extended such interval
relationships to n-ary and reverse temporal relationships[25]. The n-ary model
is particularly elegant in that, without requiring very many levels of hierar-
chy, it captures an arbitrary number of component objects having a common
temporal relationship.
Weiss[43][44] has proposed a data model called algebraic video for composing,
searching, and playing back digital video presentations, and has demonstrated
a prototype system which can create new video presentations with algebraic
combinations of these segments.
The Berkeley CMT (Continuous Media Toolkit) tries to address these issues[35j14.
It introduces a simple mechanism for creating distributed objects, a mechanism
for synchronizing distributed objects, and a best-effort protocol for realtime
multimedia data transmission on top of the existing UDP /IP protocol.
As we use more and more multimedia data, we need to find more efficient ways
to store and retrieve it, and much research is being conducted in the field of
multimedia databases.
Oomoto and Tanaka[29] have proposed a video-object data model and defined
several operations, such as interval projection, merge, overlap, etc. for com-
posing new video objects. They have also introduced the concept of "interval-
inclusion inheritance," which describes how a video object inherits attribute/value
pairs from another video object.
Further intensive research is necessary with regard to the questions of how best
to construct multimedia databases and how best to integrate multimedia object
technology with them.
REFERENCES
[1] Allen J .F, "Maintaining Knowledge about Temporal Intervals," Commu-
nications of the ACM, Vol.26. No.n, pp. 832-842, 1983.
14 See http://vn-plateau . cs. berkeley. edu/ for more information on the Berkeley
Continuous Media Toolkit. The documents and software can be obtained from
ftp://mm-ftp.cs.berkeley.edu/pub/multimedia/.
38 CHAPTER 1
[5] Buchanan M.C., and Zellweger P.T., "Automatic Temporal Layout Mech-
anisms," The proceedings of the ACM Multimedia 93, pp. 341-350, 1993.
[6] Buford, J .F.K, ed. "Multimedia Systems," Addison-Wesley, 1994.
[7] Champeaux, D. de, Lea, D., and Faure, P., "Object-Oriented System De-
velopment," Addison-Wesley, 1993.
[8] Coleman, D., et al., "Object-Oriented Development - The Fusion Method,"
Prentice Hall, 1994.
[9] Coplien, J .0. "Advanced C++ Programming Styles and Idioms," Addison-
Wesley, 1992.
[10] Ferrari, D., Banerjer, A. and Zhang, H. "Network Support for Multimedia
- A discussion ofthe Tenet Approach," Technical Report TR-92-072, Inter-
national Computer Science Institute, University of California at Berkeley,
1992.
[11] Le Gall, D. "MPEG: a video compression standard for multimedia applica-
tions," Communications of th ACM, April 1991, Vol. 34, No.4, pp. 46-58.
[12] Gibbs, S. "Composite Multimedia and Active Objects," The proceedings
of OOPSLA 91 (Conference on Object-Oriented Programming Systems,
Languages, and Applications), pp. 87-112.
[13] Gibbs, S., Breiteneder, C., and Tsichritzis., "Data Modeling of Time-Based
Media," the proccedings of the ACM-SIGMOD 1994, Conference on Man-
agement Data, pp. 91-102.
[14] Gibbs, S., and Tsichritzis, D.C., "Multimedia Programming - Objects,
Environments and Frameworks," Addison-Wesley, 1995.
[15] Hamakawa, R., Sakagami, H., and Rekimoto, J., "Audio and Video Exten-
sions to Graphical User Interface Toolkits," Third International Workshop
on Network and Operationg System Support for Digital Audio and Video,
Lecture Notes in Computer Science 712, Springer-Verlag, pp. 399-404, 1992
Multimedia Objects 39
[16] Hamakawa, R., and Rekimoto, J., "Object Composition and Playback
Models for Handling Multimedia Data," Multimedia Systems, Springer-
Verlag, Vol.2 1994, pp. 26-35.
[17] Herman, I.,Carson., G.S. et al. "PREMO: An ISO Standard for a Pre-
sentation Environment for Multimedia Objects," The proceedings of the
ACM Multimedia 94, pp. 111-118, 1994.
[18] ISO/IEC, "Specification of Abstract Syntax Notation One (ASN.1)," 2nd
ed, IS 8824, 1990.
[23] Linton, M., Vlissides, J. and Calder, P., "Composing user interfaces with
InterViews", Computer, Feb. 1989. pp. 8-22.
[29] Oomoto, E., and Tanaka, K., "OVID: Design and Implementation of a
Video-Object Database System," IEEE Transactions Knowledge and Data
Engineering, August 1993, pp. 629-643.
[30] Patel, K., Smith, B.C. and Rowe, L.A. "Performance of a Software MPEG
Video Decoder," The proceedings of ACM Multimedia 93, pp. 75-82.
[31] Pennebaker, W.B. "JPEG still image data compression standard," Van
Nostrand Reinhold, New York, 1992.
[33] DeRose, S.J., and Durand, D.G. "Making Hypermedia Work - A User's
Guide to HyTime," Kluwer Academic Publishers, 1994.
[34] van Rossum, G. "FAQ: Audio File Formats," can be obtained from anony-
mous ftp at ftp://ftp. cwi .nl/pub/audio as files AudioFormats.part[12],
1995.
[35] Rowe, L.A., Patel, K., et al., "MPEG Video in Software: Representation,
Transmission and Playback," The proceedings. ofIS&TjSPIE 1994, Inter-
national Symposium on Elec. Imaging: Science and Technology.
[41] Watabe, K., Sakata, S., et al. "Distributed Desktop Conferencing System
with Multiuser Multimedia Interface," IEEE Journal on Selected Areas in
Communications, Vol. 9 , NO.4, pp. 531-539, 1991.
[43] Weiss, R, Duda, A., and Gifford, D.K., "Content-Based Access to Algebraic
Video" The proceedings of the International Conference on Multimedia
Computing and System, pp. 140-151, 1994.
[44] Weiss, R, Duda, A., and Gifford, D.K., "Composition and Search with a
Video Algebra," IEEE Multimedia, pp. 12-25, Spring, 1995.
[45] Woelk, D., Kim, W., and Luther, W., "An Object-Oriented Approach
to Multimedia Databases," The proceedings of the ACM-SIGMOD 1986,
Conference on Management Data, pp. 311-325.
2
COMPRESSION TECHNIQUES
AND STANDARDS
Borko Furht
Department of Computer Science rind Engineering,
Florida Atlantic University,
Boca Raton, Florida, U.S.A.
ABSTRACT
This chapter covers multimedia compression techniques and standards: (a) JPEG
compression standard for full-color still image applications, (b) H.261 standard for
video-based communications, and (c) MPEG standard for intensive applications of
full-motion video, such as interactive multimedia. We describe all the components
of these standards including their encoder and decoder architectures. Experimental
data are also presented.
1 INTRODUCTION TO MULTIMEDIA
COMPRESSION
• Relatively slow storage devices which do not allow playing back uncom-
pressed multimedia data (specifically video) in real time, and
44 CHAPTER 2
• The present network's bandwidth, which does not allow real-time video
data transmission.
• 500 maps (in average 640x480x16 bits = 0.6 MB/map) - total 0.3 GB,
• 60 minutes of stereo sound (176 KB / second) - total 0.6 G B,
• Text 2:1,
• Maps 10:1,
• Animation 50:1,
Figure 1 gives the storage requirements for the encyclopedia before and after
compression. When using compression, storage requirements will be reduced
from 111.1 GB to only 2.96 GB, which is much easier to handle.
Compression Techniques and Standards 45
100<B 82.8<B
23.4<B
10<B
3<B
1<B
O.1<B
Figure 1 Storage requirements for the encyclopedia before and after com-
pression.
The final reason for compression of multimedia data is the limited bandwidth
of present communication networks. The bandwidth of traditional networks
(Ethernet, token ring) is in tens of Mb/sec, which is too low even for the
transfer of only one motion video in uncompressed form. The newer networks,
such as ATM and FDDI, offer a higher bandwidth (in hundreds of Mb/sec
to several Gb/sec), but only few simultaneous multimedia sessions would be
possible if the data is transmitted in uncompressed form.
Modern image and video compression techniques offer a solution of this prob-
lem, which reduces these tremendous storage requirements. Advanced com-
pression techniques can compress a typical image ranging from 10:1 to 50:1.
Very high compression ratios of up to 2000:1 can be achieved in compressing
of video signals.
The lossy techniques are classified into: (1) prediction based techniques, (2)
frequency oriented techniques, and (3) importance-oriented techniques. Pre-
diction based techniques, such as ADPCM, predict subsequent values by ob-
serving previous values. Frequency oriented techniques apply the Discrete Co-
sine Transform (OCT), which relates to fast Fourier transform. Importance-
oriented techniques use other characteristics of images as the basis for compres-
sion. For example, DVI technique uses color lookup tables and data filtering.
The hybrid compression techniques, such as JPEG, MPEG, and px64, com-
bine several approaches, such as DCT and vector quantization or differential
pulse code modulation. Recently, standards for digital multimedia have been
established based on these three techniques, as illustrated in Table 1.
Compression Techniques and Standards 47
R
Gray values
R=G=B
The YUV representation is more natural for image and video compression. The
exact RGB to YUV transformation, defined by the CCIR 601 standard, is given
by the following transformations:
where Y is the luminance component, and U and V are two chrominance com-
ponents.
U
Cb = 2" + 0.5 (2.4)
V
Cr = 1.6 + 0.5 (2.5)
In this way, chrominance components Cb and Cr are always in the range [0,1].
At the input of the encoder, the original unsigned samples, which are in the
range [0, 2P -1], are shifted to signed integers with range [_2P - 1 , 2P - 1 -1]. For
example, for a grayscale image, where p=8, the original samples with range [0,
255], are shifted to range [-128, +127].
Then, the source image is divided into 8x8 blocks, and samples from each block
are transformed into the frequency domain using the Forward Discrete Cosine
Transform using the following equations:
7 7
C(u) C(v) " " f (
F( u, V ) = -2- ) (2x+1)U11' (2y+1)V71' (2.6)
. -2- LJ LJ x, Y cos 16 cos 16
x=Oy=O
Compression Techniques and Standards 51
Source
lrMgo_ JPEG Encoder
• hi bloch
JPEG Decoder
Irw.,.,
OCT
where:
1
C(u) = V2 for u=O
1
C(v) = V2 for v=O
The F(O, 0) coefficient is called the "DC coefficient". and the remaining 63 co-
efficents are called the "AC coefficients". For a grayscale image. the obtained
DCT coefficients are in the range [-1024, +1023], which requires additional 3
bits for their representation, compared to the original image samples. Several
fast DCT algorithms are proposed and analysed in [PM93, HM94].
For a typical 8x8 image block, most spatial frequencies have zero or near-zero
values, and need not to be encoded. This is illustrated in the JPEG example,
presented later in this section. This fact is the foundation for achieving data
compreSSIOn.
In the next block, quantizer, all 64 DCT coefficients are quantized using a
64-element quantization table, specified by the application. The quantization
reduces the amplitude of the coefficients which contribute little or nothing to
the quality of the image, with the purpose of increasing the number of zero-
value coefficients. Quantization also discards information which is not visually
significant. The quantization is performed according to the following equation:
F(U, V)]
Fq(u, v) = Round [ Q(u, v) (2.7)
A set of four quantization tables are specified by the JPEG standard for
compliance testing of generic encoders and decoders; they are given in Table 2.
In the JPEG example, presented later in this section. a quantization formula
is used to produce quantization tables.
After quantization, the 63 AC coefficients are ordered into the "zig-zag" se-
quence, as shown in Figure 4. This zig-zag ordering will help to facilitate
the next phase, entropy encoding, by placing low-frequency coefficients, which
are more likely to be nonzero, before high-frequency coefficients. When the
coefficients are ordered zig-zag, the probability of coefficients being zero is an
increasing monotonic function of the index. The DC coefficients, which repre-
Compression Techniques and Standards 53
8 65 8 12 20 26 30 9 9 12 24 50 50 50 50
6 67 10 13 29 30 28 9 11 13 33 50 50 50 50
7 78 12 20 29 35 28 12 13 28 50 50 50 50 50
7 911 15 26 44 40 31 24 33 50 50 50 50 50 50
9 11 19 28 34 55 52 39 50 50 50 50 50 50 50 50
12 18 28 32 41 52 57 46 50 50 50 50 50 50 50 50
25 32 39 44 52 61 60 51 50 50 50 50 50 50 50 50
36 46 48 49 56 50 52 50 50 50 50 50 50 50 50 50
16 17 18 19 20 21 22 23 16 16 19 22 26 27 29 34
17 18 19 20 21 22 23 24 16 16 22 24 27 29 34 37
18 19 20 21 22 23 24 25 19 22 26 27 29 34 34 38
19 20 21 22 23 24 25 26 22 22 26 27 29 34 37 40
20 21 22 23 24 25 26 27 22 26 27 29 32 35 40 48
21 22 23 24 25 26 27 28 26 27 29 32 35 40 48 58
22 23 24 25 26 27 28 29 26 27 29 34 38 46 56 69
23 24 25 26 27 28 29 30 27 29 35 38 46 56 69 83
sent an average value of the 64 image samples, are encoded using the predictive
coding techniques, as illustrated in Figure 5.
Finally, the last block in the JPEG encoder is the entropy coding, which
provides additional compression by encoding the quantized DCT coefficients
into more compact form. The JPEG standard specifies two entropy coding
methods: Huffman coding and arithmetic coding. The baseline sequential
JPEG encoder uses Huffman coding, which is presented next.
The Huffman encoder converts the DCT coefficients after quantization into a
compact binary sequence using two steps: (1) forming intermediate symbol se-
54 CHAPTER 2
Horizornalfrequency
DC AC 01 AC 07
Vertical
frequency
77
Sample
Difference
block;_1 block;
DCI-DCI-1
Previous
sample
DCi-1
quence, and (2) converting intermediate symbol sequence into binary sequence
using Huffman tables.
• Symbol-2 (AMPLITUDE).
°
ceding the nonzero AC coefficient. The value of RUNLENGTH is in the range
to 15, which requires 4 bits for its representation.
°
SIZE is the number of bits used to encode AMPLITUDE. The number of of
bits for AMPLITUDE is in the range of to 10 bits, so there are 4 bits needed
to code SIZE.
--.--
0,0,0,0,0,0,
6
476
(6,9) (476)
The symbol (0,0) means 'End of block' (EOB) and terminates each 8x8 block.
(1,4) (12)
In the JPEG sequential decoding, all the steps from the encoding process are
inversed and implemented in reverse order, as shown in Figure 3. First, an
entropy decoder (such as Huffman) is implemented on the compressed image
data. The binary sequence is converted to a symbol sequence using Huffman
tables (VLC coefficients) and VLI decoding, and then the symbols are converted
into DCT coefficients. Then. the dequantization is implemented using the
following function:
where Q(u,v) are quantization coefficients obtained from the quantization ta-
ble.
7 7
1 ""
F(x,y) = 4[L..,L..,C(u)C(v)F(u,v)cos
(2x +16l)U1r cos (2y +16l)V1r ] (2.9)
u=Ov=O
where:
for u =0
C( u) = 1 for u>0
1
C(v) = V2 for v =0
C( v) = 1 for v >0
The last step consists of shifting back the decompressed samples in the range
[O,2P -1].
There is a trade-off between the compression ratio and the picture quality.
Higher compression ratios will produce lower picture quality and vice versa.
Quality and compression can also vary according to source image characteris-
tics and scene content. A measure for the quality of the picture, proposed in
[WaI95], is the number of bits per pixel in the compressed image (Nb). This
58 CHAPTER 2
measure is defined as the total number of bits in the compressed image divided
by the number of pixels:
According to this measure, four different picture qualities are defined [Wal91],
as shown in Table 3.
(2.12)
where:
The RMS shows the statistical difference between the original and decom-
pressed images. In most cases the quality of a decompressed image is better
with lower RMS. However, in some cases it may happen that the quality of a
decompressed image with higher RMS is better than one with lower RMS.
for(i = 0; i < N; i + +)
for(j = O;j < N;j + +)
Q[i]U] = 1 + [(1 + i + j) x quality];
The parameter quality specifies the quality factor, and its recommended range
is from 1 to 25. Quality = 1 gives the best quality, but the lowest compression
ratio, and quality = 25 gives the worst quality and the highest compression
ratio. In this example, we used quality = 2, which generates the quantization
table shown in Figure 6d.
3 5 7 9 11 131517 61 -3 2 0 2 0 0 -1
5 7 9 11 131517 19 4-4200000
7 9 11 1315171921 -1 -2 0 0 -1 0 -1 0
911131517192123 00100000
1113151719212325 00000000
13 15 17 19 21 23 25 27 o 0 -1 0 0 0 0 0
15171921 23252729 00000000
1719212325272931 00000000
Huffman table used in this example is proposed in the JPEG standard for
luminance AC coefficients [PM93], and the partial table, needed to code the
symbols from Figure 6g, is given in Table 4.
(RUNLENGTH, CodeWord
SIZE)
(0,0) EOB 1010
(0,1) 00
(0,2) 01
(0,3) 100
(1,2) 11011
(2,1 ) 11100
(3,1) 111010
(4,1) 111011
(5,2) 11111110111
(6,1 ) 1111011
(7,1) 11111010
Note that the DC coefficient is treated as being from the first 8x8 block in the
image, and therefore it is coded directly (not using predictive coding as all the
remaining DC coefficients).
The described sequential JPEG algorithm can be easily expanded for compres-
sion of color images, or in a general case for compression of multiple-component
images. The JPEG source image model consists of 1 to 255 image components
[WaI91, Ste94], called color or spectral bands, as illustrated in Figure 7.
Co top
samples .. !~. • • • •
]€ • • • •
.J •
• • : J. " line
y • • •
left YI
- x,
bottom
x
Figure 7 JPEG color image model.
For example, both RGB and YUV representations consist of three color com-
ponents. Each component may have a different number of pixels in the hor-
izontal (X;) and vertical (Y;) axis. Figure 8 illustrates two cases of a color
image with 3 components. In the first case, all three components have the
same resolutions, while in the second case they have different resolutions.
Block diagrams of the encoder and decoder for color JPEG compression are
identical to those for grayscale image compression, shown in Figure 3, except
the first block into encoder is a color space conversion block (for example, RGB
Compression Techniques and Standards 63
t
y
t
y
! !
<II XI'..
y
r
y y
! !
Figure 8 A color image with 3 components: (a) with same resolutions, (b)
with different resolutions.
to YUV conversion), and at the decoder side the last block is the inversed color
conversion, such as YUV to RGB.
Qua\uy Level- 10
Figure 9 An example of JPEG compression using FAU 's JPEG codec. The
co dec was developed by D. Schenker [F+95).
Compression Techniques and Standards 65
Figure 10 User interface for FAU's JPEG codec and compression results for
quality factors I, 5, and 10.
66 CHAPTER 2
Output from !be program for compression using DC. AC •• aDd ACo. Only
Figure 11 FAU's JPEG codec results for quality factor 25, and three and
one OCT coefficients.
Compression Techniques and Standards 67
The px64 Kbps compression standard is intended to cover the entire ISDN
channel capacity (p = 1, 2, ... 30). For p = 1 to 2, due to limited available
bandwidth, only desktop face-to-face visual communications (videophone) can
be implemented using this compression algorithm. However, for p ;::: 6, more
complex pictures are transmitted, and the algorithm is suitable for video con-
ferencing applications.
Luninance (Y)
288 352 144 176
~hroninance (Cb) 144 176 72 88
~hroninance (Cr)
144 176 72 88
Cr = Br = 3 Mbits/sec = 47
BA 64 Kbits/sec
Assuming a frame rate of 30 frames/sec, the required bandwidth for the trans-
mission of videoconferencing data becomes:
Compression Techniques and Standards 69
Cr = Br = 36.4 Mbitsfsec = 57
BA 640 Kbitsfsec
Algorithm Structure
The px64 video compression algorithm combines intraframe and interframe
coding to provide fast processing for on-the-fly video. The algorithm consists
of:
The algorithm begins by coding an intraframe block using the DCT transform
coding and quantization (intraframe coding) , and then sends it to the video
multiplex coder. The same frame is then decompressed using the inverse quan-
tizer and IDCT, and then stored in the picture memory for interframe coding.
During the interframe coding, the prediction based on the DPCM algorithm
is used to compare every macro block of the actual frame with the available
macro blocks of the previous frame, as illustrated in Figure 13. To reduce the
encoding delay, only the closest previous frame is used for prediction.
Then, the difference is created as error terms, DCT-coded and quantized, and
sent to the video multiplex coder with or without the motion vector. At the
final step, entropy coding (such as Huffman encoder) is used to produce more
70 CHAPTER 2
Video Compressed
Video in Multiplex video stream
~--~~--~Co~r
(Huffman
encoder,
Motion
Vector
compact code. For interframe coding, the frames are encoded using one of the
following three techniques:
A typical px64 decoder, shown in Figure 14, consists ofthe receiver buffer, the
Hufffman decoder, inverse quantizer, IDCT block, and the motion-compensation
predictor which includes frame memory.
Reproduced
Received r - - - - ,
bit stream image
VLC Inverse
Buffer
Decoder quantizer
Filter on/off
Motion-
Filter compensation
predictorl
Motion vectors Frame memory
The motion estimation algorithms are briefly discussed in the next section on
the MPEG algorithm.
Each of the layers contains headers, which carry information about the data
that follows. For example, a picture header includes a 20-bit picture start code,
video format (CIF or QCIF), frame number, etc. A detailed structure of the
headers is given in [Lio91].
72 CHAPTER 2
X X X X X X X.X X X X X X X X.X
~X ~X ~X X X ~X ~X ~X X X
X X X X x.X x.xx X X X X.x X.X
t-X ~X X X X X X·X ~x X X X X
,.x x.x X.X JeX ,.x x.x X.X JeX
XXXXXXXXXXXXXXXX
MB= 4Y + Cb+ Cr
~
~
y Cb Cr
GOB =3x11 MBs
1 2 3 4 5 6 7 8 9 10 11
12 13 14 16 16 17 18 19 20 21 22
23 24 26 26 27 28 29 30 31 32 33
1 2
3 4
5 6 SQClF=3GOBs
7 8
9 10
11 12 CIF= 12 GOBs
intended for compression of full- motion video consisting of small frames and re-
quiring slow refreshments. The data rate required is 9-40 Kbps, and the target
applications include interactive multimedia and video telephony. This stan-
dard requires the development of new model-based image coding techniques for
human interaction and low-bit-rate speech coding techniques [Ste94].
The MPEG algorithm is intended for both asymmetric and symmetric applica-
tions. Asymmetric applications are characterized by frequent use of the decom-
pression process, while the compression process is performed once. Examples
include movies-on-demand, electronic publishing, and education and training.
Symmetric applications require equal use of the compression and decompression
processes. Examples include multimedia mail and videoconferencing.
When MPEG standard is conceived, the following features have been identified
as important: random access, fast forward/reverse searches, reverse playback,
74 CHAPTER 2
The MPEG standard consists of three parts: (1) synchronization and multi-
plexing of video and audio, (2) video, and (3) audio.
Frame Structures
In the MPEG standard, frames in a sequence are coded using three different
algorithms, as illustrated in Figure 17.
1 frames (intra images) are self-contained and coded using a DCT-based tech-
nique similar to JPEG. 1 frames are used as random access points in MPEG
streams, and they give the lowest compression ratios within MPEG.
P frames (predicted images) are coded using forward predictive coding, where
the actual frame is coded with reference to a previous frame (lor Pl. This
process is similar to H.261 predictive coding, except the previous frame is not
always the closest previous frame, like in H.261 coding (see Figure 13). The
compression ratio of P frames is significantly higher than of 1 frames.
Compression Techniques and Standards 75
Fa:V(afd pedKltiQ:) . .
Note that in Figure 17, the first 3 B frames (2, 3 and 4) are bidirectionally
coded using the past frame I (frame 1), and the future frame P (frame 5).
Therefore, the decoding order will differ from the encoding order. The P frame
5 must be decoded before B frames 2, 3 and 4, and I frame 9 before B frames
6, 7 and 8. If the MPEG sequence is transmitted over the network, the actual
transmission order should be {I, 5, 2, 3, 4, 9, 6, 7, 8}.
(I B B P B B P B B) (I B B P B B P B B) ...
76 CHAPTER 2
Motion Estimation
The coding process for P and B frames includes the motion estimator, which
finds the best matching block in the available reference frames. P frames are
always using forward prediction, while B frames are using bidirectional predic-
tion , also called motion-compensated interpolation, as illustrated in Figure 18
[A+93bj.
Previous frame
c
B = (A+C)/2
Motion estimation is used to extract the motion information from the video
sequence. For every 16 x 16 block of P and B frames, one or two motion
vectors are calculated. One motion vector is calculated for P and forward
and backward predicted B frames, while two motion vectors are calculated for
interpolated B frames.
Compression Techniques and Standards 77
The MPEG standard does not specify the motion estimation technique, however
block-matching techniques are likely to be used . In block-matching techniques,
the goal is to estimate the motion of a block of size (n x m) in the present
frame in the relation to the pixels of the previous or the future frames. The
block is compared with a corresponding block within a search area of size
(m + 2p x n + 2p) in the previous (or the future) frame, as illustrated in Figure
19a. In a typical MPEG system, a match block (or a macroblock) is 16 x 16
pixels (n = m = 16), and the parameter p=6 (Figure 19b) .
n+2p 28
16
28 16
m+2p
p p -. p=s....
G G
Many block matching techniques for motion vector estimation have been de-
veloped and evaluated in the literature, such as: (a) the exhaustive search (or
brute force) algorithm, (b) the three-step-search algorithm [K +81, L+94], (c)
the 2-D logarithmic search algorithm [JJ81], (d) the conjugate direction search
algorithm [SR85], (e) the parallel hierarchical 1-D search algorithm [C+91],
and (f) the modified pixel-difference classification, layered structure algorithm
[C+94] . These algorithms are described in detail in [FSZ95] .
These block matching techniques for motion estimation obtain the motion vec-
tor by minimizing a cost function. The following cost functions have been
proposed in the literature:
1 n/2 m/2
MAD(dx,dy) = -mn L..,
" L
i:-n/2 j=-m/2
IF(i,j) - G(i + dx,j + dy)1 (2.13)
where:
G(i, j) - represents the same macroblock from a reference frame (past or future),
For a typical MPEG system, m=n= 16 and p=6, the MAD function becomes:
8 8
and
dx = {-6,6},dy = {-6,6}
1 n/2 m/2
MSD{dx,dy) = -
mn
L L [F(i,j) - G(i + dX,j + dy)]2 (2.15)
i:-n/2 j:-m/2
To reduce the computational complexity of MAD, MSD, and CCF cost func-
tions, Ghavani and Mills have proposed a simple block matching criterion,
called Pixel Difference Classification (PDC) [GM90j. The PDC criterion is
defined as:
T( dx, dy, i, j) is the binary representation of the pixel difference defined as:
In this way, each pixel in a macro block is classified as either a matching pixel
(T=l), or a mismatching pixel (T=O). The block that maximizes the PDC
function is selected as the best matched block.
(2.20)
G 1(i, j) is the same microblock in a previous frame, (dz 1 , dyd is the corre-
sponding motion vector, G2 (i,j) is the same microblock in a future frame, and
(dz 2, dY2) is its corresponding motion vector.
The difference between predicted and actual macroblocks, called the error terms
E(i,j), is then calculated using the following expression:
I frames are created similarly to JPEG encoded pictures, while P and B frames
are encoded in terms of previous and future frames. The motion vector is
estimated, and the difference between the predicted and actual blocks (error
terms) are calculated. The error terms are then DCT encoded and finally the
entropy encoder is used to produce the compact code.
space Quantlutlon
Entropy
FOCT encoder
convertor
R08->YUV
Error FOCT
terms
PIS Entropy
encoder
convertor
R08->YUV
Picture
type
Bit
stream VLCand FLC Video out
Buffer Inverse
decoder and
quantizer
IDCT
demultiplexer
Inter/Intra
Motion
vectors
Future
picture store Plctu,.
type
of audio samples at 44.1 KHz with 16 bits/sample, requires a data rate of about
1.4 Mbits/s [Pen93] . Therefore, there is a need to compress audio data as well.
82 CHAPTER 2
The MPEG audio compression algorithm comprises of the following three op-
erations:
• The audio signal is first transformed into the frequency domain, and the
obtained spectrum is divided into 32 non-interleaved subbands.
• For each subband, the amplitude of the audio signal is calculated, as well
as the noise level is determined by using a "psychoacoustic model". The
psychoacoustic model is the key component of the MPEG audio encoder
and its function is to analyze the input audio signal and determine where
in the spectrum the quantization noise should be masked.
• Finally, each subband is quantized according to the audibility of quantiza-
tion noise within that band.
The MPEG audio encoder and decoder are shown in Figure 22 [Pen93, Ste94].
L.......eD Psychoacoustic
model
i
--
MPEG Audio Decoder
Encoded Decoded
r-----
bit stream PSMaudlo
Bit-stream Frequency sample Frequency to
unpacking reconstruction time mapping
The input audio stream simultaneously passes through a filter bank and a
psychoacoustic model. The filter bank divides the input into multiple sub bands,
Compression Techniques and Standards 83
The MPEG audio standard specifies three layers for compression: layer 1 repre-
sents the most basic algorithm and provides the maximum rate of 448 Kbits/sec,
layers 2 and 3 are enhancements to layer 1 and offer 384 Kbits/sec and 320
Kbits/sec, respectively. Each successive layer improves the compression perfor-
mance, but at the cost of greater encoder and decoder complexity.
At the beginning of the sequence layer there are two entries: the constant bit
rate of a sequence and the storage capacity that is needed for decoding. These
parameters define the data buffering requirements. A sequence is divided into
a series of GOPs. Each GOP layer has at least one I frame as the first frame in
GOP, so random access and fast search are enabled. Gaps can be of arbitrary
structure (I, P and B frames) and length. The GOP layer is the basic unit for
editing an MPEG video stream.
The picture layer contains a whole picture (or a frame). This information
consists of the type of the frame (I, P, or B) and the position of the frame in
display order.
84 CHAPTER 2
The bits corresponding to the DCT coefficients and the motion vectors are
contained in the next three layers: slice, macroblock, and block layers. The
block is a (8x8) DCT unit, the macroblock is a (16x16) motion compensation
unit, and the slice is a string of macroblocks of arbitrary length. The slice layer
is intended to be used for resynchronization during a frame decoding when bit
errors occur.
5 CONCLUSION
In this chapter we presented three video and image compression standards and
related algorithms applied in multimedia applications: JPEG, H.261 (or px64),
and MPEG. The most popular use of JPEG image compression technology in-
clude photo ID systems, telecommunications of images, military image systems,
and distributed image management systems.
Four distinct applications of the compressed video, based on H.261 and MPEG
standards, can be summarized as: (a) consumer broadcast television, (b) con-
sumer playback, (c) desktop video, and (d) videoconferencing.
Consumer broadcast television includes home digital video delivery and typi-
cally requires a small number of high-quality compressors and a large number
of low-cost decompressors. Expected compression ratio is about 50:1.
Desktop video, which includes systems for authoring and editing video presen-
tations, is a symmetrical application requiring the same number of encoders
and decoders. The expected compression ratio is relatively small in the range
2-50:1.
Other promissing multimedia compression techniques, which are not yet stan-
dards, include wavelet-based compression, sub band coding, and fractal com-
pression [FSZ95].
REFERENCES
[A+93b] R. Aravind, G. L. Cash, D. C. Duttweller, H.-M. Hang, B. G. Haskel,
and A. Puri, "Image and Video Coding Standards", AT&T Technical Jour-
nal, Vol. 72, January/February 1993, pp. 67-88.
[FSZ95] B. Furht, S. W. Smoliar, and HJ. Zhang, "Video and Image Processing
in Multimedia Systems" , Kluwer Academic Publishers, Norwell, MA, 1995.
1 INTRODUCTION
Every scientist and engineer dreams of reducing a problem to a few basic prin-
ciples. Although our understanding of the human interface will continue to
grow, it is unlikely that we will ever see a time when there is a simple, compre-
hensive model for the user interface. User interfaces become increasingly more
diverse in their construction. Computer users, unless they are programmers,
use languages designed for specific applications or languages based on interac-
tion techniques, such as point-and-click, which require very little training to
use. The interfaces commonly used now by computer users, the mouse, key-
board and screen, will be confined to those with desk jobs, and others will use a
variety of input devices. from voice and pen to virtual reality interfaces. In the
future, many of our interfaces will receive input through observing users rather
than waiting for human action as in interfaces for tracking and teleconferencing.
2 WHAT IS AN INTERFACE?
sending the operating system messages; or, we may be communicating with the
application.
Dannenberg's Piano Tutor has an interface that can hear the notes of a piano
and compare them to notes on a page [16]. The output media for the Piano
Tutor are: video, pre-recorded voice, computer graphics display, and synthe-
sized music. Outputs may include text (remediation) appearing over notes on
a scale. diagrams of various types, voice comments, and music. The inputs are
90 CHAPTER 3
music played by the student and selections made from menus. The inputs and
output are scattered over various physical locations.
The purpose of multimedia interfaces is to give the ability for more expression
to communication between computers and humans. Learning how to use mul-
timedia interfaces will take time. Computer have abilities that people don't
have, of course, or we would never build computers, we would just hire people
to do the job. In addition to speed and endurance, computers have sensory
and computational abilities people don't have. Devices can be fashioned for
computers that give them keener ears, better eyes, etc. In spite of all of this,
we have not been able to design computers that have the human abilities for
abstract thought and basic comprehension.
Multimedia Interfaces: Designing for Diversity 91
3 DESIGNING MULTIMEDIA
INTERFACES
User·interfaces are hard to build. In this section we will examine some of the
reasons why this is the case. To design and build a user interface, the following
steps are generally executed: 1) Identify the needs of the users; 2) Construct
a scenario; 3) Find a metaphor; 4) Provide a design rationale; 5) Design the
system; 6) Build a prototype; 7) Test it; and 8) Modify the design.
For the multimedia interface, there is the added complexity of more component
parts and a more difficult integration. But the greatest difficulty of all is the
design of a multimedia interfaces that takes advantage of multiple senses.
We are only just beginning to learn how to do that. Not only is the interface
itself changing with the new technologies brought about by multimedia, but
the way we design interfaces is changing. Before media are selected or a syntax
and semantics for the interface chosen. what does the designer do?
The problem of understanding the user's needs and work patterns is partic-
ularly acute in the design of multimedia interfaces where there may not be
similar pre-existing tools to study and the user may be trying to simultane-
ously coordinate several different devices required for input or output. The
techniques now known as contextual design began to emerge in the late 1980's
[49]. The characteristic technique that has evolved from these designers is to
observe the user, perhaps questioning, but never interfering in the context of
a working environment, using the methodology of ethnography, a subdiscipline
of anthropology. Bodker [8] argues that the user interface fully reveals itself to
us when it is in use. Interface designers now refer to artifacts, that is, products
of human activity, usually tools with which we do our work. More formally,
"Artifacts are things that mediate the action of a human being toward another
subject or an object." Springmeyer. Blattner & Max [39] used these techniques
to discover how to design tools for scientific data analysis, while Springmeyer
[40] analyzed the process and designed prototype tools using contextual design
methods for her investigation. The border resources are socially shared envi-
ronments of work and establish a genre or a style for the work. Increasingly,
the border of the work context has become more important [10].
What is a metaphor? We have certain experiences in our daily lives with which
we are very familiar. For example. throwing something into a wastebasket.
We can "throw out" or destroy a file by dragging its icon to an icon of a
wastebasket. The theory is that by understanding the characteristics of real-
world objects with which we are familiar. we can quickly learn a computer
interface by analogy. Text editors were designed to be similar to typewriters;
form interfaces were lifted directly from the forms used in offices. The desktop
metaphor was the basis for the interface developed at Xerox PARe and is now
used widely on the Mac, Sun, and Microsoft Windows. However, metaphors
are like old friends that have become an embarrassment because of their bad
manners; user interface designers are quick to disassociate themselves from
Multimedia Interfaces: Designing for Diversity 93
them. Nelson [33] says, "I have never personally seen a desktop where pointing
at a lower piece of paper makes it jump to the top." He believes that the
problem with metaphors is that we wish to design things that are not like
physical objects, and the details of whose behavior may float free.
We must start building the interface from some consistent, coherent model of
how the interface is going to work. Some psychologists believe that learning by
analogy may be the only way humans learn [1]. Some applications are more
obviously associated with a metaphor than others and often the metaphor is
already imbedded in an application. If not, a metaphor is still not a bad place to
start. The problem is coming up with appropriate metaphors that don't mislead
the user [20]. We must examine carefully whether the metaphor has sufficient
structure, if it is suitable to the user's background and level of knowledge, if it
can be easily represented, and if it can it be extended if required. Places where
metaphors fail is mismatch. In order to bridge the gap between a metaphor and
a mismatch, the interface designer often resorts to a composite metaphor [11].
But even this ultimately fails and more mismatches occur. Nardi and Zarmer
[31] discovered through ethnographic studies that "users do not do their work
from full-blown 'mental models,' but instead incrementally develop external,
physical models of their problems. The external models focus cognition and
problem-solving activity."
It has been shown that the greater the extent to which a computer possesses
characteristics that are associated with humans, the greater the extent to which
individuals will use rules derived from human interaction when interacting with
computers [32]. Do we want human faces staring out at us in our interface? Do
we want our computers to ask us how we are feeling today? Like the oversim-
plified tunes of singing commercials used in radio years ago, an oversimplified
character in our interface will annoy. This is a sign of poor design, not of
the evils of anthropomorphism. Brennan [18] concludes that we should stop
worrying about anthropomorphism and work on making systems coherent and
usable. Literature. films. and theater all create characters that draw us into
their virtual worlds. The interactivity of computers is just bringing a new
dimension into the simulation of reality and perhaps we find this threatening.
Anyone who has tried to search videotapes for particular events is familiar with
the difficulty of identifying and searching for shots or cuts, that is, one or more
contiguous frames representing a continuous action in time and space. Now
magnify this problem to searching archives that have thousands of videos or
films. Some machine-readable annotation of video has been attempted, but
only the SMPTE time code accompanies motion picture images as generated.
Logs are sometimes created after the video or film has been completed. The
problem is to create video annotations that allow the user to search for and to
retrieve items of interest. At this point, the philosophy of how to attack the
problem diverges greatly. Some researchers devise systems where annotation
Multimedia Interfaces: Designing for Diversity 97
can be done automatically in machine readable form. Of course, in this case, the
descriptions are limited to changes that can be detected by a machine, such as,
time, space. color, and shape. The IMPACT project [45] creates descriptions of
cut separations. motion of cameras and filmed objects. tracks and contour lines
of objects. existence of objects. and periods of existence recorded automatically.
The basic structure of video information in the IMPACT system is hierarchical,
with scenarios on top. scenes in the middle. and cuts on the bottom. The cuts,
scenes and scenarios are linked by hyperlinks. Teodosio and Bender [43] use
a method of salient stills to extract information that is representative of a
sequence of video frames. Salient still images do not represent one moment in
time. as do photographs or a single video frame, but rather an aggregate of the
temporal changes that occur in a moving image sequence where salient features
are preserved. The resulting image is a composite of images and may not look
like anyone particular frame. It is a complicated technique based on optical
flow and signal processing, however, salient stills can be created without user
intervention.
In Media Streams, the annotations using these categories do not describe video
clips, but are themselves temporal extents describing content within a video
stream. As stream-based annotations they support multiple layers of overlap-
ping descriptions which, unlike clip-based annotations, enable video to be dy-
namically resegmented at query time. Video clips change from being fixed
segmentations of the video stream, to being the results of retrieval queries into
a semantic network of stream-based annotations. Unlike keyword-based video
retrieval systems, Media Streams supports query of annotated video according
to its temporal, semantic. and relational structure. A query for a video sequence
will not only search the annotated video streams for a matching sequence but
98 CHAPTER 3
The ability to hear sound is one of our basic senses. In spite of this, sound has
been slow to progress as an integral part of the human-computer interface.
There are many historical reasons why this is the case: the letters of the al-
phabet when typed into a keyboard were easily interpreted into binary form
for textual displays upon a screen or printed page. Voice input has many dif-
ficulties when used as input medium and corresponds more to the difficulties
of using handwritten input, where errors occur because input is not precise. In
the past, nonspeech audio has been associated with music and has not been
used for conveying information, with certain exceptions such as bugle calls, fog
horns, talking drums, etc., which were not universally known and limited in
scope [5].
Multimedia Interfaces: Designing for Diversity 99
Figure 2 The Icon Space for Media Streams is the interface for the selection,
companding, and grouping of iconic descriptors.
100 CHAPTER 3
Figure 3 The Media Time Line is the core browser and viewer in Media
Streams.
Multimedia Interfaces: Designing for Diversity 101
• harmonic content
- pitch and register (tone, melody, harmony)
- waveshape (sawtooth, square, ... )
- timbre. filters. vibrato. and equalization
- intensity/volume/loudness
- envelope (attack, decay, sustain, release)
• timing
- duration, tempo. repetition rate, duty cycle. rhythm, syncopation
• spatial location
- direction (azimuth, elevation)
- distance/range
• ambience: presence, resonance. reverberance. spaciousness
• representationalism: literal, abstract, mixed.
Sampled sounds are digital recordings of sounds which we can hear. These
sounds have the advantage of immediate recognizability and ease of implemen-
tation into computer interfaces. Synthesized sounds are those sounds which are
created algorithmically on a computer. They can be made to sound similar to
102 CHAPTER 3
real-world sounds through sound analysis (such as Fourier analysis) and trial-
and-error methods. Since synthesized sounds are created algorithmically, it is
easy to modify such a sound in real time by altering attributes like amplitude
(volume), frequency (pitch), or the basic waveform function (timbre). Further-
more, it is easy to add modulation of amplitude or frequency in real time to
create the effects of vibrato or tremelo without changing the basic sound. It
is for these reasons that sound synthesis is so popular in music creation today.
Synthesized sounds offer a high degree of flexibility with a reasonable amount
of ease. A drawback of synthesized sound is that each algorithm used typi-
cally mimics some sounds very well and others not as well. For instance, bell
sounds can be synthesized very well using a ring modulation algorithm but that
same algorithm cannot produce the waveform necessary to make the sound of
a French horn.
Because sampled sounds are digital recordings, any sound that can be heard
can be produced with extremely high accuracy. However, the amount of work
required to attain equal flexibility in modification, compared with synthesized
sounds, is very high. Typically, sampled sounds are modified only in amplitude
(volume) and frequency (pitch). When sampled sounds are altered in frequency,
care must be taken only to modify the sound within a certain range, because
the sound usually loses its unique acoustic characteristics proportional to the
amount of deviation from the originally sampled sound. Other modifications to
the sound, such as modulation. are typically not done because they require too
much computation and they also produce sounds which no longer are identified
with the original source.
Music
A powerful use of music is found in film scores. Music comes to bear in helping
to realize the meaning of the film. in stimulating and guiding the emotional
response to the visuals. Music serves as a kind of cohesive, filling in empty
spaces in the action or dialogue. The color and tone of music can give a picture
more richness and vitality and pinpoint emotions and actions [44] It is the
ability of music to influence an audience subconsciously that makes it truly
valuable to the cinema. Finally, audio can reflect the sounds of the scene in
which the picture is placed. Music specific to particular cultures is used in the
study of history, geography, and anthropology. A scene placed in a geographical
context may be enhanced by local music. Care must be taken when music is
used in programs that are used frequently, because music can be annoying if
the same piece is heard repetitively.
Speech
Speech is required for detailed and specific information. It is through speech
(rather than through other sounds) that we communicate precise and abstract
ideas. Speech may be used as input as well as output in the computer interface.
Recent advances in speech recognition systems have made it possible to use
a natural speech style and to allow casual users to work easily with a speech
system [37]. In spite of this, very little is known about building successful speech
interfaces for two- dimensional displays, let alone three-dimensional interfaces.
A uditory displays
Auditory displays include the interpretation of data into sound, such as the
association of tones with charts, graphs, and algorithms, or sound in scientific
visualization. These auditory display techniques were used to enable the lis-
tener to picture in his or her mind real-world objects or data. An example of
auditory display is SoundGraphs [28], in which points on an x-y graph were
translated into sonic equivalents with pitch as the x-axis and time on the y-axis.
(A nonlinear correction factor was used.) Recently Blattner, Greenberg, and
Kamegai [3] enhanced the turbulence of fluids with sound, where audio was
tied to the various aspects of fluid flow and vortices.
at 2 pm; and, a syntax error has occurred. Auditory signals are detected more
quickly than visual signals and produce an alerting or orienting effect [47].
Nonspeech signals are used in warning systems and aircraft cockpits. Alarms
and sirens fall into this category, but these have been used throughout history,
long before the advent of electricity. Examples are military bugle calls, post-
horns, and church bells that pealed out time and announcements of important
events. Work on auditory icons was done by Gaver [22] and earcons by Blattner,
Sumikawa, and Greenberg [2]. Auditory icons use sampled real-world sounds
of objects hitting, breaking. and tearing for messages. Gaver used the term
"everyday listening" to explain our familiarity with the sounds of common
objects around us. Because most messages are abstract. the auditory icons use
analogy or simple association to map to their meaning. Earcons are musical
fragments. called motives. with varying their musical parameters to obtain a
variety of related sounds. Earcons are described in greater detail in Section
4.3.
earcons, however, ear cons can be any auditory message, such as real-world
sounds. single notes. or sampled sounds of musical instruments.
To display more than one earcon. the temporal location of each ear con with
respect to each other have to be identified. Two primary methods are used:
overlaying and sequencing of ear cons [4]. Some sort of merging or melding
into new sound could be considered. For example, the pitch of two notes
can be combined into a third pitch. Programs typically play audio without
regard to the overall auditory system state. As a result, voices may be played
simultaneously or they may occur with several nonspeech messages, making the
auditory display incoherent. An audio server is being constructed that blends
the sounds of voice, earcons. music. and real-world sounds in a way that will
make each auditory output intelligible [35].
do not have appropriate iconic images and the association of the auditory icon
to its message must be learned. Several experiments have shown that earcons
are preferred over many other types of sonification and can be used successfully.
Brewster, Wright. and Edwards (1993) found earcons to be an effective form
of auditory communication. They recommended these six basic changes in
earcon form to make them more easily recognizable by users: 1) Use synthesized
musical timbres: 2) Pitch changes are most effective when used with rhythm
changes; 3) Changes in register should be several octaves; 4) Rhythm changes
must be as different as possible; 5) Intensity levels must be kept close; and 6)
Successive earcons should have gaps between them.
Earcons are necessarily short because they must be learned and understood
quickly. Earcons were designed to take advantage of chunking mechanisms and
hierarchical structures that favor retention in human memory. Furthermore,
they use recognition rather than recall. The tests run by Brewster, Wright,
and Edwards had no training period associated with them. Tested subjects
heard earcons only once before the test. In spite of this, the subjects could use
them effectively. If earcons are to be used by the majority of computer users,
they must be learned and understood as quickly as possible, taking advantage
of all techniques that may help the user recognize them.
other sensors, such as microphones, interact at a remote site to give the user
the experience of being in another location. Several technologies are related to
VR and telepresence: see-through environments, where the view of the scene is
overlaid with a drawing or other graphics. Desktop VR is displayed on a desktop
and gives the impression of looking at a a virtual scene through a window. VR
is not new. The original ideas go back to 1965 or earlier, but only recently
has technology developed to a point where these ideas can be implemented for
practical use [42]. VR is the ultimate multimedia interface and that is why
it is included in this overview, however, because of its distinctive hardware
and interaction styles. VR is often put in a category of its own. apart from
multimedia.
A teleoperated microsurgical robot is being used for eye surgery in the MSR-1
system [23]. Bi-direction information is relayed between the master unit and
the slave concerning visual, auditory, and mechanical information. Images are
relayed to a surgeon wearing a helmet and to an adjacent screen. There is also a
stereo tone with an amplitude and/or frequency that is a function of the forces
experienced at the tool-tissue interface. The surgeon holds pseudotools, shaft-
shaped like microsurgical scalpels, that project from the left and right limbs of
a mechanical master. The pseudotools control the behavior of the microsurgical
tools that perform the actual surgery. The movements ofthe left and right limbs
of the pseudotools are scaled down by 1 to 100 times in the microsurgical tool,
which is exerted by a microsurgical robot that performs the surgery. The master
and slave subsystems communicate through a computer system that enhances
and augments images. filters hand tremor. performs coordinate transformation
Multimedia Interfaces: Designing fo'r Diversity 109
of the surgeons hand motions. and makes safety checks. The limbs of the
master and robot slave are in one-to-one correspondence. The operator feels
not only the magnitude. but the direction of the micromotion and the resistance
experienced by the slave during operation. The surgeon is able to feel through
the pseudotools the forces experienced by the microtools. Students can train
using a mannequin (see Figure 4).
Figure 4 The microsurgical robot. MSR-l and the associated virtual envi-
rorunent for eye surgery.
Multimedia Interfaces: Designing for Diversity 111
Unlike the personal computers that were donated to schools or made available
at low prices. distance learning is pricey indeed. While the use of computers
in the classroom required teachers to learn computer skills. distance learning
does not necessitate a great deal of technical training on the part of those
using the systems. On the university level. distance learning can provide spe-
cialized courses and resources unavailable at a single educational institution.
For those of us actively involved with bringing distance learning to full-time
students in higher education. the obstacles are fierce: faculty apathy, academic
rules hindering the creation of intercampus programs. coordination difficulties,
and most of all. the enormous expense of obtaining equipment for two-way
video transmission. that is. teleconferencing equipment. But this situation will
change rapidly as communication costs drop and new high-speed networks are
installed. The benefits of this new technology will be magnified even more
as we move distance education into high schools, junior high, and elementary
schools. The one-room schoolhouse can become the one-world schoolhouse.
With a teleconferencing center instead of traditional schoolrooms. classes can
be taught statewide. and materials can be drawn from all over the world. The
local teacher will become a coordinator and a tutor, selecting materials. eval-
uating student progress, and assigning classes for students. The first basic
requirement for all of this is high-speed at reasonable cost.
Assistive computer products are virtually all multimedia. There is a great di-
versity among input/output devices used for these purposes. Much of the work
for assistive technologies comes from university and government research lab-
oratories and uses new research in multimedia, artificial intelligence, wireless
communications. and of course, psychology, physiology, simulation and mod-
eling of physical systems. In the remainder of this section, a few examples of
some highly ingenious interfaces for the disabled are described. The emphasis
here is on novelty to show the diverse problems associated with providing tech-
nical assistance to the disabled. One of the products below has touch output,
while the other responds to weight as input and emits sounds as output.
and closed-captioned television (for entertainment and news). Ralph can give
a person who is deaf/blind more options for receiving information, enhance the
privacy of their conversations, and remove the total dependence on an inter-
preter for communication.
The BBB is an input device that is a plastic. waterproof blanket a baby lies on.
The BBB has imbedded in it 12 switches regularly spaced over a 3 foot by 4
foot grid. The blanket has two layers of polymer with urethane sponge material
between the layers. The switches are connected to a cable that is hooked up to
a Macintosh. As the baby moves, the BBB makes various sounds in response to
the movements (see Figure 6). Normal infants gradually develop the concept of
cause and effect through exploration of their environment. Physically disabled
infants are limited in their ability to explore their environments and to vocalize
and are slow to deduce the nature of interactions with objects. The BBB
enables motor-impaired infants to experiment with their environments and to
develop cause-effect skills found in normal infants.
natural input devices were technically beyond our ability to produce. (There
is a subtle difference between voice and speech: voice is the ability to vocalize,
which enables us to speak; speech is a system of sounds that we interpret for
the purpose of communicaiton.) Times have changed and a wide variety of in-
put devices are now available that take advantage of human senses. Certainly
the success of the mouse was partly responsible for an examination of the use
of gesture in the computer interface. Mouse input is difficult to control, and
nearly impossible to use for handwriting. The pen is considered a natural and
ergonomic computer input device; it is small, unobtrusive, flexible, and can
be manipulated easily by casual users. However, current pen interfaces are
awkward and difficult to use. Oviatt [34] makes the case that pen alone may
never be the universal interface that replaces the keyboard and mouse as the
primary input device. Handwriting is slower than typing, and recognition of
all pen-written symbols is error-prone and ambiguous [13].
Many believe that speech interfaces will be the primary means of communicat-
ing with a computer. However, speech understanding has recognition problems
that make speech a difficult modality for input. Also. speech is not an effective
technology for the input and manipulation of graphical objects. When speech
and gesture are combined through the use of pen and voice, something surpris-
ing happens-they complement each other and overcome the problems experi-
enced by each modality. Oviatt [34] recommends that the following strategy
be adopted in connection with pen and voice interfaces: "Combine naturally
complementary modalities in a manner that optimizes the individual strengths
of each, while simultaneously overcoming each of their weakness."
The integration of the two modalities of speech and gesture have been studied in
the context of collaborative work, where people are communicating among each
other with the assistance of a computer. The integration of speech and gesture
when the problem is limited to a person communicating with a computer is not
understood. Issues in speech input are tied to those of natural language under-
standing. which is plagued with problems such as resolving anaphoric reference,
word sense disambiguation. the attachment of prepositional phrases. and the
lack of ability to reason [13]. If the language used as input is constrained to
a non-natural language. the user may have to go through a learning proce-
dure similar to learning command lines in a text-based interface. Nevertheless,
speaking to a computer has many advantages: speech is faster even than type-
writer input except for the most proficient typist; speech can be precise; and,
speech is spontaneous.
commands. The designers of PenPoint did not recommend the use of voice
input, except as an adjunct tool to PenPoint. Some very early work that com-
bined natural language and pointing in an intelligent interface combined natural
language and pointing using a touch screen to specify graphics and inputs from
touch. A well-known interface called "Put-That-There," manipulated the dis-
play with voice and gesture input. [9] The whole question of manipulating
displays with voice and gesture is being re-examined 15 years after this seminal
article was published. There are some preliminary investigations into a unified
view of language that communicates through verbal, visual, tactile, or gestural
information. Written and spoken natural language is often accompanied by
pictures, gestures, and sounds other than spoken words. Blattner and Milota
[6] are investigating a language that integrates pen and speech input. To make
the system truly useful. it is usable with only pen input, only voice input, or
the combination of both. Pen and voice input may be the primary way we com-
municate with computers in the future, but the issue~ are so poorly understood
that it will be many years before usable pen-voice systems are developed.
5 CONCLUSION
Pandora was left with a box she was instructed not to open. But her curiosity
overcame her good sense and she finally opened the box. Malevolant forces
emerged from the box to plague the world and cause mischief-only hope was
left in the box.
A computer, like a box, is a device for storing information, when one examines
it, a thousand demonic problems come flying out. In this chapter some of the
difficult problems of how to construct interfaces and how we might cope with
them were considered. Whether they will turn malevolent and plague us, or
bring solutions that make our lives easier, happier, and richer remains to be
seen. Hope remains with us.
Acknowledgements
This work was performed with partial support of NSF Grant IRI-9213823 and
under the auspices of the U.S. Department of Energy by Lawrence Livermore
National Laboratory under Contract No. W-7405-Eng-48.
118 CHAPTER 3
REFERENCES
[1] John R. Anderson, "Cognitive Psychology and Its Implications." Second
Edition, New York: W. H. Freeman and Company, 1985.
[2] M.M. Blattner, D.A. Sumikawa, and R.M. Greenberg, "Ear cons and Icons:
Their Structure and Common Design Principles," Human-Computer In-
teraction, Vol. 4, No.1, 1989, pp 11-44.
[3] Meera M. Blattner, Robert M. Greenberg, and Minao Kamegai, "Listen-
ing to Turbulence: An Example of Scientific Audiolization," Multimedia
Interface Design, (M. Blattner and R. Dannenberg, eds), ACM Press, New
York and Addison-Wesley, Reading, Massachusetts. 1992, pp 87-102.
[4] Meera M. Blattner, Albert L. Papp III, and Ephraim P. Glinert. "Sonic
Enhancement of Two-Dimensional Graphic Displays," Auditory Displays:
The Proceeding of the First International Conference on Auditory Display
(G. Kramer, eds), Addison-Wesley, Santa Fe Institute Series, Reading,
Massachusetts, 1994.
[5] M. M. Blattner and R. M. Greenberg, "Communicating and Learning
Through Non-Speech Audio. Multimedia Interface Design in Education,"
A. Edwards and S. Holland (Eds), Springer-Verlag, NATO ASI Series F.
1992. pp 133-143.
[6] Meera M. Blattner and Andre D. Milota, "Multimodal Interfaces using
Pen and Voice Input," submitted for publication,
[7] Meera M. Blattner, "In Our Image: Interface Design in the 1990s," IEEE
Multimedia, Vol. 1, No.1, IEEE Press, 1994, pp 25-36.
[8] Susanne Bodker, "A Human Activity Approach to User Interfaces,"
Human-Computer Interaction, Vol. 4, Lawrence Erlbaum, pp 171-195.
[9] R. A. Bolt, "Put-That-There: Voice and Gesture at the Graphics Inter-
face." ACM Computer Graphics. 14(3), 1980. pp 262-270.
[10] John Seely Brown and Paul Dugid, "Borderline Issues: Social and Material
Aspects of Design." Human-Computer Interaction, Vol. 9, No. 1, Lawrence
Earlbaum, pp 3-36.
Multimedia Interfaces: Designing for Diversity 119
[15] Roger Dannenberg and Meera Blattner, "Introduction to the book," Mul-
timedia Interface Design. (M. Blattner and R. Dannenberg, Eds), ACM
Press. New York and Addison-Wesley. Reading, Massachusetts, 1992, pp
xvii-xxv.
[19] Herbert Dreyfus and Stuart Dreyfus, "Mind over Machine - The Power of
Human Intuition and Expertise in The Era of The Computer." The Free
Press, New York. 1986.
[21] Harriet J. Fell. Hariklia Delta, Regina Peterson, Linda J. Ferrier, Zehra
Mooraj. and Megan Valleau. "Using the Baby-Babble-Blanket for Infants
120 CHAPTER 3
[24] Hiroshi Ishii, Minoru Kobayashi, and Arita, K., "Iterative Design of Seam-
less Collaboration Media," Communications of the ACM (CACM), Special
Issue on Internet Technology, ACM. Vol. 37. No.8, August 1994, pp 83-97.
[25] David L ..Jaffe. "An Overview of Programs and Projects at the Rehabili-
tiation Research and Development Center," ACM ASSETS '94, October
31-November 1, 1994, Marina del Rey, CA, pp 69-76.
[26] Gregory Kramer, editor, "Auditory Display: the Proceedings of the First
International Conference on Auditory Display," Addison-Wesley, Santa Fe
Institute Series, Reading, MA, 1994.
[27] Brenda Laurel. Tim Oren, and Abbe Don. "Issues in Multimedia Interface
Design: Media Integration and Interface Agents," Multimedia Interface
Design, (M. Blattner and R. Dannenberg, eds), ACM Press, New York
and Addison-Wesley, Reading, Massachusetts, 1992, pp 53-64.
[35] Albert L. Papp and Meera M. Blattner, "A Centralized Audio Presenta-
tion Manager," Auditory Displays: Proceedings of the 2nd International
Conference on Auditory Display, 1995, Addison-Wesley, Santa Fe Series,
In Press.
[41] Lucy Suchman, "Plans and Situated Actions: The Problem of Human-
Machine Communication," Cambridge: Cambridge University Press, 1987.
[43] Laura Teodosio and Walter Bender, "Salient Video Stills: Content and
Context Preserved," ACM Multimedia '93, August 1-6, 1993, Anaheim,
California, pp. 39-46.
[44] Tony Thomas, "Music for the Movies," A.S. Barnes and Company, New
York, 1973.
[45] Hirotada Ueda, Takafumi Miyatake, Shiegeo Sumino and Akio Nagasaka,
"Automatic Structure Visualization for Video Editing," ACM INTERCHI
'93, April 24-29, 1993, Amsterdam, pp 137-141.
[46] Benjamin Watson, "A Survey of Virtual Reality in Japan," Presence, Vol.
3, No.1, Winter 1994, MIT Press, ppl-18.
[49] Dennis Wixon, Karen Holtzblatt. and Stephen Knox, "Contextual Design:
An Emergent View of System Design, Human Factors in Computing Sys-
tems," ACM CHI '90, Seattle, WA, April 1-5, 1990, pp 329-326.
4
MULTIMEDIA STORAGE SYSTEMS
Harrick M. Vin* and P. Venkat Rangan**
*Department of Computer Sciences, University of Texas at Austin, USA
ABSTRACT
Multimedia storage servers provide access to multimedia objects including text, im-
ages, audio, and video. Due to the real-time storage and retrieval requirements and
the large storage space and data transfer rate requirements of digital multimedia,
however, the design of such servers fundamentally differs from conventional storage
servers. Architectures and algorithms required for designing digital multimedia stor-
age servers is the subject matter of this chapter.
1 INTRODUCTION
Recent advances in computing and communication technologies have made it
feasible and economically viable to provide on-line access to a variety of infor-
mation sources such as books, periodicals, images, video clips, and scientific
data. The architecture of such services consists of multimedia storage servers
that are connected to client sites via high-speed networks [13]. Clients of such
a service are permitted to retrieve multimedia objects from the server for real-
time playback at their respective sites. Furthermore, the retrieval may be
interactive, in the sense that clients may stop, pause, resume, and even record
and edit the media information if they have permission to do so.
• Large data transfer rate and storage space requirement: Playback of Dig-
ital video and audio consumes data at a very high rate (see Table 1).
Thus, a multimedia service must provide efficient mechanisms for storing,
retrieving and manipulating data in large quantities at high speeds.
• Real-time storage and retrieval: Digital audio and video (often referred
to as "continuous" media) consist of a sequence of media quanta (such as
video frames or audio samples) which convey meaning only when presented
continuously in time. This is in contrast to a textual object, for which spa-
tial continuity is sufficient. Furthermore, a multimedia object, in general,
may consist of several media components whose playback is required to be
temporally coordinated.
The main goal of this chapter is to provide an overview of the various issues in-
volved in designing a digital multimedia storage server, and present algorithms
for addressing the specific storage and retrieval requirements of digital multime-
dia. Specifically, to manage the large storage space requirements of multimedia
data, we examine techniques for efficient placement of media information on
individual disks, large disk arrays, as well as hierarchies of storage devices
(Section 3). To address the real-time recording and playback requirements, we
discuss a set of admission control algorithms which a multimedia server may
Multimedia Storage Systems 125
Therefore, the challenge for the server is to keep enough data in stream buffers
at all times so as to ensure that the playback processes do not starve [8J. In
the simplest case, since the data transfer rates of disks are significantly higher
than the real-time data rate of a single stream (e.g., the maximum throughput
of modern disks is of the order of 3-4 MBytes/s, while that of an MPEG-2
encoded video stream is 0.42 MBytes/s, and un compressed CD-quality stereo
audio is about 0.2 MBytes/s), employing modest amou.nt of buffering will en-
126 CHAPTER 4
able conventional file and operating systems to support continuous storage and
retrieval of isolated media streams.
server must employ flexible placement strategies that minimize copying of media
information during editing.
In order to explore the viability of various placement models for storing digital
continuous media on conventional magnetic disks, let us first briefly review
some of the fundamental characteristics of magnetic disks. Generally, magnetic
disks consist of a collection of platters, each of which is composed of a number
of circular recording tracks (see Figure 1). Platters spin at a constant rate.
Moreover, the amount of data recorded on tracks may increase from the inner-
most track to the outer-most track (e.g., zoned disks). The storage space of
each track is divided into several disk blocks, each consisting of a sequence of
physically contiguous sectors. Each platter is associated with a read/write head
that is attached to a common actuator. A cylinder is a stack of tracks at one
actuator position.
Actuator
Track
Head
.....;::: ,;;>---...---.. . . . .
Platter .I~~~~:::·····
".
' ........•. ~~
Direction
of rotation
In such an environment. the access time of a disk block consists of three com-
ponents: seek time, rotational latency, and data transfer time. Seek time is the
time needed to position the disk head on the track containing the desired data,
and is a function of the initial start-up cost to accelerate the disk head as well
as the number of tracks that must be traversed. Rotational latency, on the
other hand, is the time for the desired data to rotate under the head before it
can be read or written, and is a function of the angular distance between the
current position of the disk head and the location of the desired data, as well as
the rate at which platters spin. Once the disk head is positioned at the desired
128 CHAPTER 4
disk block, the time to retrieve its contents is referred to as the data transfer
time, and is a function of the disk block size and data transfer rate of the disk.
Clearly, the contiguous and random placement models represent two ends of a
spectrum: Whereas the former does not permit any separation between succes-
sive media blocks of a stream on disk, the latter does not impose any constraints
at all. Recently, an efficient generalization of these two extremes, referred to
as the constrained placement policy, have also been proposed [16]. The main
objective of constrained placement policy is to ensure continuity of retrieval,
as well as reduce the average seek time and rotational latency incurred while
accessing successive media blocks of a stream by bounding the size of each me-
dia block as well as the separation between successive media blocks on disk.
Such a placement policy is particularly attractive when the block size must be
small (e.g., when utilizing a conventional file system with block sizes tailored
for text). However, implementation of such a system may require elaborate al-
gorithms to ensure that the separation between blocks conforms to the required
constraints. Furthermore, for constrained latency to yield its full benefits, the
scheduling algorithm must retrieve all the blocks for a given stream at once
before switching to any other stream.
To effectively utilize a disk array, a multimedia server must interleave the stor-
age of each media stream among the disks in the array. The unit of data
interleaving, referred to as a media block, denotes the maximum amount of log-
ically contiguous data that is stored on a single disk (this has also been referred
to as the striping unit in the literature [7]). Successive media blocks of a stream
are placed on consecutive disks using a round-robin allocation algorithm.
Each media block may contain either a fixed number of media units or a fixed
number of storage units (e.g., bytes) [5, 10,26]. If each media stream stored on
the array is encoded using a variable bit rate (VBR) compression technique, the
storage space requirement may vary from one media unit to another. Hence, a
130 CHAPTER 4
The most appealing feature of the variable-size block placement policy is that,
regardless of the playback rate requirements, if we assume that a server services
clients by proceeding in terms of periodic rounds during which it accesses a fixed
number of media units for each client, then the retrieval of each individual video
stream from disk proceeds in lock-step. That is, each client accesses exactly one
disk during each round, and that consecutive disks in the array are accessed by
the same set of clients during successive rounds. In such a scenario, the server
can partition the the set of clients into D logical groups (where D is the number
of disks in the array), and then admit a new client by assigning it to the most
lightly loaded group. Such a simple policy balances load across the disks in
the array, and thereby maximizes the number of clients that can be serviced
simultaneously by the server. However, the key limitations of the variable-
size block placement policy include: (1) the inherent complexity of allocating
and deallocating variable-size media blocks, and (2) the higher implementation
overheads. Thus, although the variable-size block placement policy is highly
attractive for designing multimedia storage servers for predominantly read-only
environments (e.g., video on-demand), it may not be viable for the design of
integrated multimedia file systems (in which multimedia documents are created,
edited, and destroyed very frequently).
In the simplest case, such a hierarchical storage manager may utilize fast mag-
netic disks to cache frequently accessed data. In such a scenario, there are
several alternatives for managing the disk system. It may be used as a staging
area (cache) for the tertiary storage devices, with entire media streams being
moved from the tertiary storage to the disks when they need to be played back.
On the other hand, it is also possible to use the disks to only provide storage
for the beginning segments of the multimedia streams. These segments may
be used to reduce the startup latency and to ensure smooth transitions in the
playback [14].
such a scenario, if a high percentage of clients access data stored in a local (or
nearby) cache, the perceived performance will be sufficient to meet the demands
of continuous media. On the other hand. if the user accesses are unpredictable
or have poor reference locality, then most accesses will require retrieval of in-
formation from tertiary storage devices. thereby significantly degrading the
performance [6, 11, 19].
4 EFFICIENT RETRIEVAL OF
MULTIMEDIA OBJECTS
Due to the periodic nature of multimedia playback, a multimedia server can
service multiple clients simultaneously by proceeding in terms of rounds. Dur-
ing each round, the multimedia server retrieves a sequence of media blocks for
each stream. and the rounds are repeatedly executed until the completion of
all the requests. The number of blocks of a media stream retrieved during a
round is dependent on its playback rate requirement. as well as the buffer space
availability at the client [25]. Ensuring continuous retrieval of each stream re-
quires that the service time (i.e., the total time spent in retrieving media blocks
during a round) does not exceed the minimum of the playback durations of the
blocks retrieved for each stream during a round. Hence, before admitting a
Multimedia Storage Systems 133
''"" = . (Ii)
ffiln
iE[1.n)
-.
R~I
Additionally, let us assume that media blocks of each stream are placed on disk
using random placement policy, and that the multimedia server is employing
the SCAN disk scheduling policy, in which the disk head moves back and forth
across the platter and retrieves media blocks in either increasing or decreasing
order of their track numbers.
T = b * C + (a + 1~~X) *L ki (4.1)
;=1
134 CHAPTER 4
Ensuring continuous retrieval of each stream requires that the total service time
per round does not exceed the minimum of the playback durations of kl , k 2 ,
... , or k n blocks [3. 8, 17, 21, 25, 27]. We refer to this as the deterministic
admission control principle, and can be formally stated as:
n
he + (a + 1:.':.~X) * L kj ~ 'R (4.2)
i=1
Notice, however, that due to the human perceptual tolerances as well as the
inherent redundancy in continuous media streams, most clients of a multime-
dia server are tolerant to brief distortions in playback continuity as well as
occasional loss of media information. Therefore, providing deterministic ser-
vice guarantees to all the clients is superfluous. Furthermore, the worst-case
assumptions that characterize deterministic admission control algorithms need-
lessly constrain the number of clients that can be serviced simultaneously, and
hence, lead to severe under-utilization of server resources.
To clearly explain this algorithm, let us assume that the service requirements of
client i be specified as a percentage Pi of the total number of frames that must
be retrieved on time. Moreover, let us assume that each media stream may be
encoded using a variable bit rate compression technique (e.g., JPEG, MPEG,
etc.). In such a scenario, the number of media blocks that contain Ii frames of
stream Si may vary from one round to another. This difference, when coupled
with the variation in the relative separation between blocks, yields different
service times across rounds. In fact, while servicing a large number of clients,
the service time may occasionally exceed the round duration (i.e., T > 'R).
We refer to such rounds as overflow rounds. Given that each client may have
requested a different quality of service (i.e., different values of pd, meeting
all of their service requirements will require the server to delay the retrieval
of or discard (i.e., not retrieve) media blocks of some of the more tolerant
clients during overflow rounds 1 . Consequently, to ensure that the statistical
quality of service requirements of clients are not violated, a multimedia server
must employ admission control algorithms that restrict the occurrence of such
overflow rounds by limiting the number of clients admitted for service.
To precisely derive an admission control criterion that meets the above require-
ment, observe that for rounds in which T ~ 'R, none of the media blocks need
to be discarded. Therefore, the total number of frames retrieved during such
rounds is given by 2:7=1 Ii. During overflow rounds, however, since a few media
blocks may have to be discarded or delayed to yield T ~ 'R, the total number
of frames retrieved will be smaller than 2:7=1 Ii. Given that Pi denotes the
percentage of frames of stream Sj that must be retrieved on time to satisfy
the service requirements of client i, the average number of frames that must be
retrieved during each round is given by Pi * Ii. Hence, assuming that q denotes
the overflow probability (i.e., P( T > 'R) = q), the service requirements of the
clients will be satisfied if:
n n
q * .1'0 + (1 - q) L Ii :::: L Pi * Ii
i=l i=l
(4.3)
where .1'0 denotes the number of frames that are guaranteed to be retrieved
during overflow rounds. The left hand side of Equation (4.3) represents the
lower bound on the expected number of frames retrieved during a round and
the right hand side denotes the average number of frames that must be accessed
during each round so as to meet the service requirements of all clients. Clearly,
the effectiveness of this admission control criteria, measured in terms of the
1 The choice between delaying or discarding media blocks during overflow rounds is appli-
cation dependent. Since both of these policies are mathematically equivalent, in this paper,
we will analyze only the discarding policy.
136 CHAPTER 4
L
kmG.x
L
kmo. x
empirically derived distribution functions, the probability P(Tk > 'R.), for
various values of k, can be easily computed.
• Client load characterization:
Since Ii frames of stream Si are retrieved during each round, the total
number of blocks B required to be accessed is dependent on the frame
size distributions for each stream. Specifically, if the random variable Bi
denotes the number of media blocks that contain Ii frames of stream Si,
then the total number of blocks to be accessed during each round is given
by:
n
i=l
Since Bi is only dependent on the frame size variations within stream Si,
B/s denote a set of n independent random variables. Therefore, using the
centrollimit theorem, we conclude that the distribution function g8(b) of B
approaches a normal distribution [15]. Furthermore, if 118; and (1'8; denote
the mean and standard deviation of random variable Bi, respectively, then
the mean and standard deviation for B are given by:
n n
Consequently,
(4.6)
where N is the standard normal distribution function. Additionally, since
B;'s denote discrete random variables that take only integral values, they
can be categorized as lattice-type random variables [15]. Hence, using the
central limit theorem, the point probabilities P(B = k) can be derived as:
1 - (k-~f)2
P(B = k) ~ e 2"8 (4.7)
(1'8...fii
Finally, computing the overflow probability q using Equation (4.4) requires the
br br
values of k min and k marc . If in and arc , respectively, denote the minimum
and the maximum number of media blocks that may contain Ii frames of stream
Si, then the values of kmin and kmarc can be derived as:
n n
. -- "
kman L..Jbimin ,. k marc -- "
L..Jbimarc (4.8)
i=l i=l
138 CHAPTER 4
Determination of :Fo
The maximum number of frames :Fo that are guaranteed to be retrieved during
an overflow round is dependent on: (1) the number of media blocks that are
guaranteed to be accessed from disk within the round duration 'R, and (2) the
relationship between the media block size and the maximum frame sizes.
T = h C + (a + 1:::'~X) * k
Since T ::; 'R, the number of media blocks, k d , that are guaranteed to be
retrieved during each round is bounded by:
k 'R-b*C (4.9)
d ::; (a + 1;::'~X)
Now, assuming that I(Sd denotes the minimum number of frames that may
be contained in a block of stream Si, the lower bound on the number of frames
accessed during an overflow round is given by:
1. The mean and the standard deviation of the number of media blocks that
may contain In+1 frames of stream Sn+1 (denoted by TJT3 n +1 and 0"T3 n +1 ,
respectively), to be used in Equations (4.5) and (4.6);
2. The minimum and the maximum number of media blocks that may contain
In+l frames of stream Sn+1 (denoted by b~i1 and b~.tf, respectively), to
be used in Equation (4.8); and
Since all of these parameters are dependent on the distribution of frame sizes
in stream Sn+l, the server can simplify the processing requirements at the time
of admission by precomputing these parameters while storing the media stream
on disk.
These values, when coupled with the corresponding values for all the clients
already being serviced as well as the predetermined service time distributions
will yield new values for q and :Fo. The new client is then admitted for service
if the newly derived values for q and :Fo satisfy the admission control criteria:
n+l n+l
q * :Fo + (1 - q) L Ii ~ L Pi * Ii
i=l i=l
4.3 Discussion
In addition to the deterministic algorithms (which provide strict performance
guarantees by making worst-case assumptions regarding the performance re-
quirements) and the statistical admission control algorithms (that utilize pre-
cise distributions of access times and playback rates), other admission control
algorithms have been proposed in the literature. One such algorithm is the
adaptive admission control algorithm proposed in [22, 23]. As per this algo-
rithm, a new client is admitted for service only if the prediction from the status
quo measurements of the server performance characteristics indicate that the
service requirements of all the clients can be met ·satisfactorily. It is based on
the assumption that the average amount of time spent for the retrieval of each
media block (denoted by TJ) does not change significantly even after a new client
is admitted by the server. In fact, to enable the multimedia server to accurately
predict the amount of time expected to be spent retrieving media blocks during
a future round, a history of the values of TJ observed during the most recent W
rounds (referred to as the averaging window) may be maintained. If TJavg and
140 CHAPTER 4
(J' denote the average and tne standard deviation of 1] over W rounds, respec-
tively, then the time required to retrieve a block in future rounds (Tj) can be
estimated as:
(4.11)
where f is an empirically determined constant. Clearly, a positive value of
f enables the estimation process to take into account the second moment of
the random variable 1], and hence, make the estimate reasonably conservative.
Thus, if ki and aidenote the average number of blocks accessed during a round
for stream Si, and the percentage of frames of stream Si that must be retrieved
on time so as to meet the requirements of client i, respectively, then the aver-
age number of blocks of stream Si that must be retrieved by the multimedia
server during each round can be approximated by ki * ai.
Consequently, given
the empirically estimated average access time of a media block from disk, the
requirements of tolerant clients will not be violated if:
(4.12)
This is referred to as the adaptive admission control criteria. Notice that since
estimation of the service time of a round is based on the measured charac-
teristics of the current load on the server, rather than theoretically derived
values, the key function of such an admission control algorithm is to accept
enough clients to efficiently utilize the server resources, while not accepting
clients whose admission may lead to the violation of the service requirements.
The low-end servers are targeted for a local area network environment and
their clients are personal computers, equipped with video-processing hardware,
connected on a LAN. They are designed for applications such as on-site training,
information kiosks, etc., and the multimedia files generally consist of short video
clips. An example of such a low-end server is the IBM LANServer Ultimedia
product, which can serve 40 clients at MPEG-1 rates [4]. Other systems in this
class include FluentLinks, ProtoComm, and Starworks [21]. As the computing
Multimedia Storage Systems 141
power of personal computers increases, the number of clients that these servers
can support will also increase.
6 CONCLUDING REMARKS
Multimedia storage servers differ from conventional storage servers to the extent
that significant changes in design must be effected. These changes are wide in
scope, influencing everything from the selection of storage hardware to the
choice of disk scheduling algorithms. This chapter provides an overview of
the problems involved in multimedia storage server design and to the various
approaches of solving these problems.
142 CHAPTER 4
REFERENCES
[1] Microsoft Unveils Video Software. AP News, May 17, 1994.
[2] Small Computer System Interface (SCSI-II). ANSI Draft Standard
X3T9.2/86-109, November 1991.
[3] D. Anderson, Y. Osawa, and R. Govindan. A File System for Continuous
Media. ACM Transactions on Computer Systems, 10(4):311-337, Novem-
ber 1992.
[4] M. Baugher et al. A multimedia client to the IBM LAN Server. ACM
Multimedia '93, pp. 105-112, August 1993.
[5] E. Chang and A. Zakhor. Scalable Video Placement on Parallel Disk Ar-
rays. In Proceedings of IS€3T /SPIE International Symposium on Electronic
Imaging: Science and Technology, San Jose, February 1994.
[6] C. Federighi and L.A. Rowe. The Design and Implementation of the UCB
Distributed Video On-Demand System. In Proceedings of the IS€3T/SPIE
1994 International Symposium on Electronic Imaging: Science and Tech-
nology, San Jose, pages 185-197, February 1994.
[9] R. Haskin. The SHARK continuous media file server. Proc. CompCon,
pp. 12-15, 1993.
[10] K. Keeton and R. Katz. The Evaluation of Video Layout Strategies on
a High-Bandwidth File Server. In Proceedings of International Workshop
on Network and Operating System Support for Digital Audio and Video
(NOSSDAV'93), Lancaster, UK, November 1993.
[11] T.D.C. Little, G. Ahanger, R.J. Folz, J.F. Gibbon, F.W. Reeves, D.H.
Schelleng, and D. Venkatesh. A Digital On-Demand Video Service Sup-
porting Content-Based Queries. In Proceedings of the ACM Multimedia '93,
Anaheim, CA, pages 427-436. October 1993.
[19] L.A. Rowe, J. Boreczky, and C. Eads. Indexes for User Access to Large
Video Databases. In Proceedings of the IS&T/SPIE 1994 International
Symposium on Electronic Imaging: Science and Technology, San Jose,
pages 150-161, February 1994.
[20] T. Teorey and T. B. Pinkerton. A Comparative Analysis of Disk Scheduling
Policies. Communications of the A CM, 15(3) :177-184, March 1972.
[21] F.A. Tobagi, J. Pang, R. Baird, and M. Gang. Streaming RAID: A Disk
Storage System for Video and Audio Files. In Proceedings of ACM Multi-
media '93, Anaheim, CA, pages 393-400, August 1993.
ABSTRACT
In a typical distributed multimedia application. multimedia data must be compressed.
transmitted over the network to its destination. and decompressed and synchronized
for playout at the receiving site. In addition. a multimedia information system must
allow a user to retrieve. store, and manage a variety of data types including images,
audio, and video. In this chapter we present fundamental concepts and techniques in
the areas of multimedia networks. We first analyze network requirements to transmit
multimedia data and evaluate traditional data communications versus multimedia
communications. Then, present traditional networks (such as Ethernet, token ring,
FDDI, and ISDN) and how they can be adapted for multimedia applications. We
also descrihe the ATM network. which is well suited for transfering multimedia data.
Finally, we discuss the newtork architectures for current and future information su-
perhighways.
1 INTRODUCTION
In today's communication market, there are two distinct types of networks:
local-area networks (LANs) and wide-area networks (WANs). LANs run on a
premises and interconnect desktop and server resources. while WANs are gen-
erally supported by public carrier services or leased private lines which link
geographically separate computing system elements. Figure 1 illustrates net-
work evolution from 1980s to these days as a function of transmission speed.
Transrrission Applications
Speed (EJt:sls)
1G Private Networlcs
o 100M
EJheme(
1M
AIbIic Nett.wrlcs
1K-L----~----------~-----------.~~
Traditional LAN environments. in which data sources are locally available, can-
not support access to remote multimedia data sources for a number of reasons.
Table 2 contrasts traditional data transfer and multimedia transfer [Fur94).
Multimedia networks require a very high transfer rate or bandwidth even when
the data is compressed. For example. an MPEG-l session requires a bandwidth
Multimedia Networks 147
Multimedia networks must provide low latency required for interactive oper-
ation. Since multimedia data must be synchronized when it arrives at the
destination site. networks should provide synchronized transmission with low
jitter.
Continuous
CHANNEL 0.1
lJTDJ.ZATION
0.01
0.001
2.1 Ethernet
Ethernet (IEEE 802.3 10 Base-T standard) is a local-area network running at
10 Mbps . Ethernet uses heavy coaxial cable that forms a single bus or open
path on which all stations are connected. as illustrated in Figure 3.
150 CHAPTER 5
HUB <,.
Because stations on Ethernet all share the same medium. each station listens
before sending data and during sending in order to prevent interference with
each other. Each station hears the traffic that is broadcast on the network and
copies the data that is addressed to it. If another signal is heard, the station
will either delate its sending. or stop it if sending is already in progress. This
strategy is called "Carrier Sense Multiple Access with Collision Detection", or
CSMA/CD.
Isochronous Ethernet
Isochronous Ethernet (IEEE 802.9 standard) is an extension of traditional Eth-
ernet capable of supporting multimedia applications. In addition to 10 Mbps
packet mode service, isochronous Ethernet offers 6.144 Mbps circuit switched
isochronous data service. Each node consists of two channels: a 6 Mbps circuit
switched channel and a 10 Mbps packet switched channel. The circuit switched
channel is suitable for multimedia applications, because there are minimal jitter
and latency. Even though physical resources are shared, this is equivalent to
two networks. one circuit-switched and another packet-switched.
Switched Ethernet
Switched Ethernet is a modification of traditional Ethernet, in which stations
are connected to the network medium from a communication hub instead of a
conventional tapped connection. Each station is connected to the hub by means
of an Ethernet transmission line. The hub reads the destination address of the
packets on an incoming line and switches them on the appropriate outgoing
line. The hub also contains a buffer in case when the outgoing line is not free.
This connection is equivalent of having a dedicated 10 Mbps connection per
station. Since the connections are switched, each station can communicate as
if it were the only station on the network. In such a way, switched Ethernet
provides the bandwidth needed for multimedia applications.
Fast Ethernet
Another approach to make Ethernet more suitable for multimedia traffic is to
increase its bandwidth to 100 Mbps (IEEE 802.3 100Base-T standard). Fast
Ethernet uses the same CSMA-CD access protocol, and therefore has the same
drawbacks as standard 10 Mbps Ethernet - unpredictable transmission delay
and latency. However, due to improved bandwidth, fast Ethernet can support
much larger number of multimedia streams than standard Ethernet, and there-
fore can be used for small or medium configurations of multimedia stations
[Stu95].
Priority Ethernet
Providing higher priorities to stations transfering multimedia data is another
way of adapting Ethernet for multimedia applications [AS94]. In the prior-
ity Ethernet network, nodes are assigned priorities and each node maintains a
152 CHAPTER 5
priority queue. The priority queue consists of node names and the priority of
data it has to transmit. The priority information is exchanged during the pri-
ority exchange phase. There are three ways of exchanging priority information:
synchronous, modified synchronous, and asynchronous.
Token ring is deterministic - the worst case network delay can be determined.
If there are n stations connected to a ring. the worst case delay is equal to
(n - 1) x longesLpackeUransmission_time.
Since the delay is deterministic. token ring can be used for multimedia applica-
tions. Due to relatively low bandwidth (16 Mbps), one way for adapting token
ring for multimedia applications is to use configuration control. In this method,
a restriction will be placed on the number of stations that can be connected
to a ring. For example. a 16 Mbps ring can support up to 8 stations with an
average bandwidth of 2 Mbps per station. This solution is not practical. since
no more than 8 multimedia stations can be connected to a ring.
for audio). Priority token ring works on existing networks and does not require
configuration control [FM95].
Server
Crimmins [Cri93] evaluated three priority ring schemes for their applicability to
multimedia applications: (1) equal priority for video and asynchronous packets,
(2) permanent high priority for video packets and permanent low priority for
asynchronous packets, and (3) time-adjusted high-priority for video packets
(based on their ages) and permanent low priority for asynchronous packets.
The first scheme, which entails direct competition between video conference
and asynchronous stations, achieves the lowest network delay for asynchronous
traffic. However, it reduces the video conference quality. The second scheme,
154 CHAPTER 5
in which video conference stations have the permanent high priority, produces
no degradation in conference quality, but increases the asynchronous network
delay. Finally, the time-adjusted priority system provides a trade-off between
first two schemes. The quality of this scheme is better than the first scheme,
while the asynchronous network delays are shorter than in the second scheme
[Cri93].
2.3 FDDI
The Fiber Distributed Data Interface (ANSI X3T9.5 standard) provides 100
Mbps bandwidth, which may be sufficient for multimedia applications [Ros86,
Jai93]. The FDDI topology consists of two counter rotating independent fiber
optic rings, as shown in Figure 5. The ring can have a perimeter up to 100 km.
Stations can be connected to both rings (class A stations), or only to one ring
(class B stations).
Class B stations
Class A stations
such as File Transfer Protocol and mail. One of the main characteristics of
FDDI is its fault recovery mechanism -- when a component in the network fails.
other components can reorganize and continue to work.
In the synchronous mode. FDDI has low access latency and low jitter. FDDI
also guarantees a bounded access delay ~nd a predictable average bandwidth
for synchronous traffic. However. due to the high cost. FDDI networks are used
primarily for backbone networks. rather than for networks of workstations.
2.4 ISDN
Integrated Services Digital Network (ISDN) is an access and signaling expansion
to the basic technology of the public switched telephone network. with the
main purpose to support non-voice communications [Lea88, Tha93]. The local
loop connection between subscriber and switch is made in a digital form, with
multiple multiplexed information channels supported per access line.
Present optical network technology can support the Broadband Integrated Ser-
vices Digital Network (B-ISDN) standard. expected to become the key network
for multimedia applications [Cla92 , KJ91. Sak93]. B-ISDN access can be basic
or primary. Basic ISDN access supports 2B + D channels, where the transfer
rate of a B channel is 64 Kbps. and that of a D channel is 16 Kbps. Primary
ISDN access supports 23B+D channels in the US (1.544 Mbps), and 30B+D
channels in Europe (2.048 Mbps).
The two B channels of the ISDN basic access provide 2 x 64 Kbps. or 128
Kbps of composite bandwidth. Three types of connections can be set up over a
B channel: (a) circuit switched. (b) packet switched. and (c) semi permanent.
The semi permanent connection is setup by prior arrangement and is equivalent
to a leased line. The D channel is used for common channel signaling.
ISDN can be well suited for the high-rate applications, which would include
both data applications and videoconferencing. Videoconferencing applications
156 CHAPTER 5
can use part of ISDN capacity for wideband speech, saving the remainder for
purposes such as control, meeting data, and compressed video. Figure 6 shows
the composition of two B channels for multimedia applications [Cla92].
H.261 video
IF.-. 1 Basic channels
I
control
56 Kbps
1·,-1 62.4Kbps
Audio, video
plus basic control
1,.4_1
control/data H.261 video
I
Audio, video
plus control and
48 Kbps 62.4Kbps
low speed data
I
.....
1···-1
Audio plus control
56 Kbps 62.4 Kbps and high speed data
The ATM network provides the following benefits for multimedia communica-
tions:
The ATM network can carry integrated traffic because it uses small fixed size
cells, while traditional networks use variable-length packets, which can be sev-
eral KB of size. The ATM network uses a connection-oriented technology, which
means that before data traffic can occur between two points, a connection needs
to be established between these end points using a signaling protocol. An ATM
network architecture is shown in Figure 7.
MS
MS
Two major types of interfaces in ATM networks are the User-to-Network In-
terface (UNI) , and the Network-to-Network Interface, or Network-to-Node In-
terface (NNI).
bits
7 6 ................................ 0
bytes
5 bytes
GFC I VPI o
VPI I 1
2
VCI I PT ICLP 3
48 bytes 4
HEC
The Generic Flow Control (GFC) field is used for congestion control at the User-
to-Network Interface to avoid overloading. The Virtual Path Identifier/Virtual
Channel Identifier (VPIjVCI) fields contain the routing information of the cell.
The Payload Type (PT) represents the type of information carried by the cell.
The CLP field indicates cell loss priority, i.e. if a cell can be dropped or not
in case of congestion. The Header Error Control (HEC) field is used to detect
Multimedia Networks 159
and correct the errors in the header. ATM does not have an error correction
mechanism for the payload .
These connections are made of Virtual Channel (VC) and Virtual Path (VP)
connections. These connections can be either point-to-point or point-to- multi-
point. The basic type of connection in an ATM is VC. It is a logical connection
between two switching points. VP is a group of VCs with the same VPI value.
The VP and VC switches are shown in Figure 9.
\C1 \C1
\a I.C2
\C3 \C3
VP1 VP1
\C1 \C1
I.C2 I.C2
VP2 VP2 \C3 \C3
Figure 9 Virtual connections in ATM: (a) Virtual Path (VP) switch, and (b)
Virtual Channel (VC) switch.
• The node A sends the cell to the node B. B will change VCl to VC2 and
will send VC2 to the node C.
• At the node C. VC2 is associated with VC3 and sent to the node D.
• At the node D. VC3 is associated with VC4. The node D checks if the
UNI at the terminal node N2 is free. If the UNI is free. the cell with the
label VC4 is given to N2.
• The terminal node N2 uses now VC4 for its connection to the node D.
• D sends this cell to C, which associates VC4 with VC3, and sends it to B.
B associates VC3 with VC2 and sends to A. A associates VC2 with VCl
and delivers the cell to the terminal node N1.
• The connection between terminal nodes Nl and N2 is thus established.
During the call setup phase, a user negotiates network resources such as peak
data rate and parameters such as cell loss rate, cell delay, cell delay variation,
etc. These parameters are called the Quality of Service (QOS) parame-
ters. Connection is established only if the network guarantees to support the
QOS requested by the user. A resource allocation mechanism for establishing
connections is shown in Figure 11 [SVH91]. Once the QOS parameters for a
connection are set up, both the user and the network stick to the agreement.
Call request
Call level
Alternate paths
Connection
level
No
. .-------{Congestion control phase ] - Ce.....
Ves
Connection
establishment
Even when every user/terminal employs the QOS parameters, congestion may
occur. The main cause for congestion is statistical multiplexing of bursty con-
nections. Two modes of traffic operation are defined for ATM: (a) statistical
and (b) non-statistical multiplexing [WK90]. In the general ATM node model
shown in Figure 12, if the sum of the peak rates of the input links does not
exceed the output link rate (EPi ~ L), the mode of operation is called non-
statistical multiplexing; and if E Pi exceeds L it is called statistical multiplexing.
During connection setup, each connection is allocated an average data rate of
that channel instead of the peak data rate. Several such channels are multi-
plexed hoping that all the channels do not burst at the same time. If several
connections of a channel burst simultaneously, congestion might occur.
Peak rates
P1==~~~B7
P2 Link rate L
PN -----1=rIJ1
Node
input ports
agement, that manages each of the ATM layers. and Plane Management for
the management of all the other planes.
....
.....
Management Plane
............ "iii"
:::l
(I)
User 3
III
Plane :::l
III
(Q
(I)
ATM layer
Physical layer
There are five AAL protocol layers, each of which was designed to optimally
carry one of the four classes of traffic, as illustrated in Figure 14.
Connection-
Connection Mode Connection orienmd leu
AALl}'pes 1 2 5 314
3.4 SONET
The Synchronous Optical Network (SONET) is often associated with ATM.
SONET is a set of physical layers, originally proposed by Bellcore in the mid-
1980s as a standard for optical fiber based. transmission line equipment for
the telephone system. SONET defines a set of framing standards which dictate
how data is transmitted across links, together with ways of multiplexing existing
transmission line frames (e.g. DS-l, DS-3) into SONET.
The lowest SONET frame rate. known as STS-l, defines an 8 KHz frame con-
sisting of 9 rows of 90 bytes each. Its rate is 51.84 Mbps. The next highest
rate is STS-3 with 155.52 Mbps rate, and so on. The STS-24 gives the rate of
1.244 Gbps. Table 3 shows various SONET frame rates.
Multimedia Networks 165
..................
~~ ..............................................N'...................y ..•...•.....•.....·.v...•...• ......................·....,. ....•...•....... '.1'....... ......'Y'o' .................:,.:.:w..~....................... :::
Corresponding to each frame rate are optical fiber medium standards. The
STS-l corresponds to the OC-l fiber standard, the STS-3 corresponds to OC-
3, and so on (see Table 3). These standards define fiber types and optical power
levels. The SONET standards were designed to scale to the very high speeds
required for ATM.
4 SUMMARY OF NETWORK
CHARACTERISTICS
Table 4, adapted from [Stu95], evaluates key characteristics of networks de-
scribed in sections 2 and 3. It can be concluded that several networks provide
support for multimedia traffic. however the ATM network is superior due to its
high bandwidth and low transmission delay.
In the next section, we compare the ATM technology with several other switch-
ing technologies (such as STM, SMDS, and Frame Relay) for multimedia ap-
plications.
166 CHAPTER 5
5 COMPARISON OF SWITCHING
TECHNOLOGIES FOR MULTIMEDIA
COMMUNICATIONS
Besides the ATM technology, other switching technologies considered for mul-
timedia communications include:
• Frame Relay.
Table 5 compares the properties of these four switching technologies for multi-
media communications [RKK94]. The properties compared include support for
multimedia traffic, connectivity, performance guarantee, bandwidth, support
for media synchronization, and congestion controL
6 INFORMATION SUPERHIGHWAYS
Information superhighways are large consumer information networks that al-
low millions of users to communicate, interact among themselves, and receive
various services. Today there are very few networks that can be classified as
information superhighways ~. plain old telephone network, worldwide Internet,
and Teletel, the French videotext network.
168 CHAPTER 5
6.1 Internet
Internet is a loose connection of thousands of networks, and an estimated num-
ber of users on Internet is today about 30 million. Internet was developed
by researchers, and there is no global network administration. The following
devices provide the internetworking: repeaters. bridges, routers, and gateways.
Repeaters provide the cheapest and simplest solution for interconnecting be-
tween LANs. A repeater links identical LANs, for example two Ethernets, by
Multimedia Networks 169
simply amplifying the signal received on one cable segment and then retrans-
mitting the same signal to another cable segment.
Routers, like bridges. can effectively extend the size of a network. They pro-
vide even more intelligent solution than bridges. Routers can be connected to
each other via private leased lines, or can be connected to a switch (e.g., SMDS
switch). The Internet routers serve as intermediate store-and-forward devices
which relay messages from source to destination.
Gateways provide the most intelligent. but slowest connection service. They
provide translation services between different computer protocols. such as SNA,
TCP lIP, and DECnet. Gateways allow devices on a network to communicate
with devices on a dissimilar network.
Figure 15 shows how six LANs (both Ethernet and token ring networks) can be
linked together by five routers. and how three Ethernet LANs can be connected
by two bridges.
A major cultural change began in 1992, with the advent of the World Wide
Web (WEB). so today Internet can be declared as the major information super-
highway. There are about 6.6 million Internet host computers in 106 countries,
with 100 million Internet hosts projected by the end of this decade. Internet is
growing at an average rate of over 40 percent in most regions. However, Inter-
net has many drawbacks. including poor support for video transmission. Due
to high load, Internet often experiences slowed performance. Users trying to
access sites on the WEB and other services are often frustrated by long connect
times or lost data.
Ethernet
7 CONCLUSION
In this chapter, we presented basic concepts in multimedia networking and
communication. It is clear that current local area networks, such as Ether-
172 CHAPTER 5
100MANs
1,000 BANs
1,000 Users
TOTAL: 100 ".lIion users
net and token ring, do not provide neither sufficient bandwidth nor required
transmission times needed for multimedia applications. Therefore. in order to
protect investments in existing networks, they have been adapted to support
multimedia traffic. Adapted traditional networks. such as switched and fast
Ethernet and priority token ring, are capable of transmitting multimedia data,
however they still have many limitations and drawbacks.
Multimedia Networks 173
New networking technologies, such as ATM, are much better suited to support
multimedia traffic. They provide both high bandwidth and low and predictable
transmission delay, needed for multimedia applications. Future information su-
perhighways, with the ambitous goal to transport multimedia data throughout
the world, will be built using these new technologies.
There are still many challenges in the field of multimedia communication and
networking. Once multimedia networks are widely established, the Quality of
Service requirements of multimedia applications will become evident and exist-
ing service models must be refined [Lie95]. The future research will concentrate
on various aspects of multimedia communications including: (a) developing ef-
ficient network management techniques necessary for high-speed network, (b)
developing models and techniques for traffic management, (c) developing ad-
mission control algorithms to guarantee QOS requirements, and (d) developing
network intelligent network switches to satisfy these QOS requirements. Net-
work traffic is becoming more complex, and therefore a great challenge is to
develop new models for traffic characterization, particularly the traffic charac-
terization of compressed video source [Lie95].
REFERENCES
[AE92] S. R. Ahuja and J. R. Ensor, "Coordination and Control of Multime-
dia Conferencing". IEEE Communications Magazine, Vol. 30, No.5, May
1992, pp. 38-43.
[AS94] F. Adlestein and M. Singhal, "Priority Ethernets: Multimedia Support
on Local Area Networks", Proceedings of the Int. Conference on Distributed
Multimedia Systems and Applications, Honolulu, Hawaii, August 1994, pp.
45-48.
[B+95] D.E. Blahut, T.E. Nichols, W.M. Schell, G.A. Story, and E.S.
Szurkowski, "Interactive Television", Proceedings of the IEEE, Vol. 83,
No.7, July 1995, pp. 1071-1088.
[Bou92] J. Y. L. Boudec. "The ATM: A Tutorial". Computer Networks and
ISDN Sytems, Vol. 24. 1992, pp. 279-309.
[Jai93] R. Jain, "FDDI: Current Issues and Future Plans", IEEE Communica-
tions Magazine, September 1993, pp. 98-105.
[KJ91] M. Kawarasaki and B. Jabbari, "B-ISDN Architecture and Protocol",
IEEE Journal on Selected Areas in Communications, Vol. 9, No.9, De-
cember 1991, pp. 1405-1415.
[RKK94] R.R. Roy, A.K. Kuthyar, and V. Katkar, "An Analysis of Universal
Multimedia Switching Architectures", AT&T Technical Journal, Vol. 73,
No.6, November/December 1994, pp. 81-92.
1 INTRODUCTION
Multimedia information comprises of different media streams such as text, im-
ages, audio, and video. The presentation of multimedia information to the
user involves spatial organization, temporal organization. delivery of the com-
ponents composing the multimedia objects, and allowing the user to interact
with the presentation sequence. The presentation of multimedia information
can be either live or orchestrated. In live presentation, multimedia objects are
acquired in real-time from devices such as video camera and microphone. In
orchestrated presentation. the multimedia objects are typically acquired from
stored databases. The presentation of objects in the various media streams have
to be ordered in time. Multimedia synchronization refers to the task of coordi-
nating this ordering of presentation of various objects in the time domain. In
a live multimedia presentation, the ordering of objects in the time domain are
implied and are dynamically formulated. In an orchestrated presentation. this
ordering is explicitly formulated and stored along with the multimedia objects.
10 11 12 In
10 11 12 In
moment in time whereas a time interval has a duration associated with it. A
time interval can be defined as an ordered pair of instants with the first instant
less than the second [5]. Time intervals can be formally defined as follows [13].
Let [S,~] be a partially ordered set, and let a, b be any two elements of S such
that a ~ b. The set {xla ~ x ~ b} is called an interval of S denoted by [a,b].
Temporal relations can then be defined based on the start and end time instants
of the involved intervals. Given any two time intervals, there are thirteen ways
in which they can relate in time [5], whether they overlap, meet, precede, etc .
Figure 2 shows a timeline representation of the temporal relations. Six of the
thirteen relations are inverse of the seven relations that are shown in Figure 2.
(v) a slarts b
(I) a betore b
b
11 12
11
a b
(i) a meets b
11 12 (VI)bckJnnga \1
b
a
t2
(IN) a overlaps b t1
a
b (VIi) a equats b
a b
eIV) b rushes a 11
2 LANGUAGE BASED
SYNCHRONIZATION MODEL
Language based synchronization models view multimedia synchronization in
the context of operating systems and concurrent programming languages. Based
on an object model for a multimedia presentation. the synchronization charac-
teristics of the presentation are described using concurrent programming lan-
guages. As discussed in [9], the basic synchronization characteristics of a multi-
media presentation as viewed from the point of view of programming languages
are:
The above characteristics influence the basic features required from a concur-
rent programming language in order to describe a multimedia presentation. For
example. the number of involved objects describes the communication among
the multimedia objects that should be described by the language. Blocking
behavior of a multimedia object describes the nature of synchronization mech-
anism that is required viz. blocked·mode. non-blocked mode and restricted
blocked mode. The non-blocked mode would never force synchronization but
allows objects to exchange messages at specified states of execution. Blocked
182 CHAPTER 6
mode forces an object to wait till the cooperating object(s) reach the specified
state(s). Restricted blocked mode is a concept introduced in [9]. In the context
of multimedia presentation, restricted blocking allows an activity to be repeated
by an object till the synchronization is achieved with the cooperating object(s).
The activity can be presenting the last displayed image for an image object or
playing out a pleasant music in the case of an audio object. Restricted blocking
basically allows the multimedia presentation to be carried out in a more user
friendly manner.
More than two objects may have to be allowed to combine many synchronizing
events. This is especially true in presentations where many streams of infor-
mation are presented concurrently. A programming language should be able
to describe this scenario. The ordering of synchronization events become nec-
essary when two or more synchronization events are combined by some kind
of a relation. The ordering can be pre-defined, prioritized or based on certain
conditions. As multimedia presentations deal with data streams conveying in-
formation to be passed in real-time, synchronization of the events composing
the presentation incorporates real-time features.
• Timeave is the ideal delay between synchronization events i.e., the syn-
chronization should be as close to timeave as possible, though it might
occur before or after timeave. The usual timeave parameter is also "0".
Multimedia Synchronization 183
Expression Meaning
timemin( .... ) Minimum acceptable delay
between two synchronizing events.
timeave( .... ) Ideal delay.
timemax( .... ) Maximum acceptable delay.
timemin( .... ) AND Synchronization:
timemax( .... ) between timemin and timemax.
timemin( .... ) AND Synchronization: closer to
timeave( .... ) timeave, never before timemin.
timemax( .... ) AND Synchronization : closer to
timeave( .... ) timeave, never after timemax.
timemin( .... ) AND Synchronization : closer to
timemax( .... ) AND timeave, between timemin and
timeave( .... ) timemax.
The proposal in [9] also allows the above specified temporal operands to be
combined. The operands time min and timemax are not commutative, thereby
allowing specification of a different delay depending on which multimedia ob-
ject executes its synchronizing operation first. The temporal operands can be
combined using logical operator AND. Table 1 describes the possible temporal
expressions and their associated meaning.
1
Wail For Synch. ;
Display ~l Image ~
•
t... - - - - - - - Ready ForSynch.
Proceed Funher Proceed Further
,
j j
t
I
ahead ofthe image object. However. the target is no delay (with timemin being
0).
to the facilities offered by the programming languages are needed for describing
certain features of multimedia presentation such as real-time specification and
restricted blocking. However, language based specifications can become com-
plex when a multimedia presentation involves multiple concurrent activities
that synchronize at arbitrary time instants. Also, modifying the presentation
characteristics might even involve rewriting the program based specification.
P = { T, P, A } where
The arcs represented by A describe the pre- and post-relations for places and
transitions. The set of places that are incident on a transition t is termed the
input places of the transition. The set of places that follow a transition t is
termed the output places of the transition. A Marked Petri Net is one in which
a set of 'tokens' are assigned to the places of the net. Tokens, represented by
small dots inside the places, are used to define the execution of the Petri net,
and their number and position change during execution. The marking (M) of a
Petri net is a function from the set of places P to the set of nonnegative integers
I, M : P -+ I. A marking is generally written as a vector (ml, m2, .... , m n ) in
which mi = M(p;). Each integer in a marking indicates the number of tokens
residing in the corresponding place (say Pi)'
token,i.e.,
't/(p E InputPlace(t» : M(p) ~ 1.
Firing t consists of removing one token from each of its input place and adding
one token to each of its output place. and this operation defines a new marking.
An execution of a Petri net is described by a sequence of markings, beginning
with an initial marking Mo and ending in some final marking Mf. Marking of
a Petri net defines the state of the net. During the execution, Petri net moves
from a marked state Mi to another state Mj by firing any transition t that are
enabled in the state Mi.
For the purpose of modeling time-driven systems, the notion of time was in-
troduced in Petri nets, calling them as Timed Petri Nets (TPN) [4]. In TPN
models. the basic Petri net model is augmented by attaching an execution time
variable to each node in the net. The node in the Petri net can be either the
place or the transition. In a marked TPN. a set of tokens are assigned to places.
A TPN N is defined as,
N = { T, P, A, D, M } where
A transition is enabled for execution iff each of its input place contain atleast
one token. The firing of a transition causes a token to be held in a locked state
for a specified time interval in the output place of the transition, if the time
intervals are assigned to places. If time intervals are assigned to transitions, the
token is in a transition state for the assigned duration. The execution of a Petri
net process might have to be interrupted in order to carry out another higher
priority activity. Pre-emption of the on-going execution of a Petri net process
can be modeled using escape arcs to interrupt the execution of a process [6]. In
this section, we examine Petri net based models for the purpose of describing
the synchronization characteristics of the multimedia components.
Multimedia Synchronization 187
B is a set of buttons
In the above definition, the structure of the timed Petri net specifies the struc-
ture of the hypertext document. A marking in the hypertext hence represents
the possible paths through a hyperdocument from the browsing point it rep-
resents. The initial marking of a synchronous hypertext therefore describes a
particular browsing pattern. The definition also includes several sets of com-
ponents (contents, windows, and buttons) to be presented to the user going
through the document. Two collections of mappings or projections are also
defined: one from the Petri net to the user components and another from the
user components to the display mechanism.
The content elements from the set C can be text, graphics, still image, motion
video, audio information or another hypertext. A button is an action selected
from the set B. A window is defined as a logically distinct locus of information
and is selected from the set W. PI, the logical projection, provides a mapping
from the components of a Petri net (place and transitions) to the human-
consumable portions of a hypertext (contents, windows and buttons). A content
element from the set C and a window element for the abstract display of the
content from the set W, are mapped to each place in the Petri net. A logical
button from the set B is mapped to each transition in the Petri net. Pd,
the display projection, provides a mapping from the logical components of
188 CHAPTER 6
• All the enabled transitions are identified. The enabled transitions set con-
sists of both timed out ones (i.e., have been enabled for their maximum
latency) and active ones (i.e., have not timed out).
• One transition from the group of active ones and a maximal subset from
the group of timed-out transitions will be chosen for firing.
The implication of the firing rules is such that the document content elements
are displayed as soon as tokens arrive in places. The display is maintained for a
particular time interval before the next set of outgoing transitions are enabled.
After a transition t becomes enabled. its logical button is made visible on the
screen after a specified period of time. If the button remains enabled without
being selected by the reader, the transition t will fire automatically at its point
of maximum latency. The way the hypertext Petri net is structured along with
its timings describes the browsing actions that can be carried out. In effect,
windows displaying document contents can be created and destroyed without
explicit reader actions. control buttons can appear and disappear after periods
of inactivity, and at the same time user interactive applications can be created
with a set of active nodes. Petri nets being a concurrent modeling tool, multiple
tokens are allowed to exist in the net. These tokens can be used to effect the
Multimedia Synchronization 189
We shall consider the guided tour example discussed in [7]. In a guided tour,
a set of related display windows is created by an author. All the windows in
a set are displayed concurrently. A tour is constructed by linking such sets
to form a directed graph. The graph can be cyclic as well. From anyone set
of display windows, there may be several alternative paths. Figure 4 shows
a representation of guided tour using the Trellis hypertext. Here, the set of
windows to be displayed at any instant of time is described by a Petri net
place. A place is connected by a transition to as many places as there are sets
of windows to be displayed. A token is placed in the place(s) representing the
first set of windows to be displayed. The actual path of browsing is determined
by the user going through the information. For example, when the token is in
pl, the information contents associated with pl are displayed and the buttons
for the transitions t1 and t2 are selectable by the user.
freeze and restart actions, scaling the speed of presentation (fast forward or
slow motion playout) and scaling the spatial requirements cannot be modeled.
Another aspect in the Trellis model is that the user can interact with the
hypertext only when the application allows him to do so (i.e., when the buttons
can be selected by the user). Random user behavior where one can possibly
initiate operations such as skip or reverse presentation at any point in time,
are not considered in Trellis hypertext.
COCPN = { T, P, A, D, R. M } where
R : P --+ { rl, r2, ... } represents the mapping from set of places to a set of
resources.
The execution of the OCPN is similar to that of TPNs where the transition
firing is assumed to occur instantaneously and the places are assumed to have
states. The firing rules for the OCPN are as follows :
2. Upon firing the transition ti. a token is removed from each of its input
place and a token is added to each of its output place.
3. A place Pi remains in an active state for a specified duration Tj associated
with the place, after receiving a token. The token is considered to be in a
locked state for this duration. After the duration Ti, the place Pi becomes
inactive and the token becomes unlocked.
Multimedia Synchronization 191
It has been shown in [8, 12], that the OCPN can represent all the possible
thirteen different temporal relationships between any two object presentation.
The Petri nets have hierarchical modeling property that states that subnets of
a Petri net can be replaced by equivalent abstract places and this property is
applicable to the OCPN as well. Using the subnet replacement, an arbitrarily
complex process model composed of temporal relations can be constructed with
the OCPN by choosing pairwise, temporal relationships between process enti-
ties [8]. The OCPNs are also proved to be deterministic since no conflicts are
modeled, transitions are instantaneous, and tokens remain at places for known,
finite durations. OCPNs are also demonstrated to be live and safe, following
the Petri nets definitions for these properties.
(0 (I (2 l3
A
II
12
The OCPN model ignores the spatial considerations required for composition of
multimedia objects. The spatial characteristics can be assigned to the resource
component of the OCPN' model. Modeling the occurrence of spatial clashes
(when two processes require the same window on the screen), however, can-
not be done using the OCPN. Also, the OCPN model does not provide many
facilities for describing user inputs to modify the presentation sequence. For
instance. the user's wish to stop a presentation, reverse it. or skip a few frames
cannot be specified in the existing OCPN architecture. Also, user inputs for
freezing and resuming an on-going presentation. or scaling the speed or spatial
requirements of a presentation cannot be described by the OCPN model.
1. Deference of execution :
For the operation of pre-emption with deference of execution, the remain-
ing duration associated with the pre-empted Petri net place is changed
considering the time spent till its pre-emption.
2. Termination of execution :
The operation of pre-emption with termination is like premature ending
of the execution.
3. Temporary modification of remaining time duration :
The operation of pre-emption with modification of execution time is like
'setting' the time duration associated with the pre-empted Petri net place
to a 'new value', as appropriately determined by the type of user input.
For temporary modification of execution time, the remaining time duration
associated with the place is modified.
4. Permanent modification of execution time duration :
For permanent modification, the execution time duration associated with
the place is modified.
194 CHAPTER 6
nTPN Structure
In the DTPN model, pre-emption is modeled by using escape arcs. Escape
arcs are marked by dots instead of arrow heads and they can interrupt an
active Petri net place. Modification of execution time duration (temporary or
permanent) associated with a place is modeled by modifier arc. Modifier arcs
are denoted by double-lined arcs with arrow heads. A Dynamic Timed Petri
Nets is defined as
DTPN = { T, P, A, D, M, E, C, Cm } where
~ I
j
.------------.
Legend:
" I : _ Arc
ok :~Arc
Ii : Modfiar Tranaition
for describing handling of user inputs such as reverse presentation, freeze and
restart to a single object presentation is the simplest case. These DTPN con-
structions for handling user inputs on single object presentations can be used
in full multimedia presentation.
Legend :
P & T : Pre-emption and
Tenninetion
(P&n
I~ ~I
~r.:J Legend: I
P & T : Pre-emption and I
~~i Termination
P & 0 : Pre-emption and
Deference
(P & T) (P& 0)
(P tk. 1M)
Summary The DTPN modeL its structure and the associated execu-
tion rules, can be adopted by the OCPN where resource utilizations are also
specified. This augmented model can describe multimedia presentations with
dynamic user participations.
1. A{B} : Interval(s) in B will start after all the intervals in A are finished.
2. < A > B : Interval(s) in B will start when one of the intervals (the first)
in A is finished.
1. X(d)Y(d)Z: All the other objects are displayed during the presentation of
the object with the longest presentation duration.
2. X(e)Y(e)Z : The display of all the objects are started simultaneously. The
presentation of all the other objects will be cut off when one of the objects
(first) is finished.
For describing multimedia presentations, the model of interval vectors and the
involved temporal information are maintained in a Flow Graph (TFG). In TFG,
intervals are described by nodes. A TFG is defined by the tuple T FG =
{fl.N, Nt, Ed} where fl.N is a set of nodes for the interval vectors, Nt is a set of
transit nodes and Ed is a set of directed edges. The model fl.x of an interval
vector nx is composed of an interval node N x , representing the interval of n x ,
and 6 node(s), representing its parallel relations to other intervals. N x may
associate none. one or two 6 nodes. An intermission node has no 6 nodes. The
sequential specifications {A}B and < A > B are represented by the transit
nodes in Nt, the sq-node and the tri-node respectively.
Summary : The TFG model for multimedia synchronization can handle rel-
ative and imprecise temporal requirements in teleorchestra applications. The
advantage in the TFG model is that no accurate temporal information, the
duration value or the occurring points, is required. The TFG can represent all
the possible temporal relations involved in a multimedia presentation scenario.
Comparing the TFG model with the earlier Petri nets based ones, the Petri
nets based models rely on the values for the duration of presentation in formu-
lating the temporal specification. Hence, the Petri nets based models cannot
represent the relative synchronization requirements. However, the TFG model
does not address the issue of dynamic user participation during a multimedia
presentation.
200 CHAPTER 6
./
/
5 CONTENT-BASED INTER-MEDIA
SYNCHRONIZATION
Content-based inter-media synchronization is based on the semantic structure
of the multimedia data objects [21]. Here, a media stream is viewed as hierar-
chical composition of smaller media objects which are logically structured based
on their contents. The temporal relationships are then established among the
logical units of the media objects. The logical units of a media stream are the
Multimedia Synchronization 201
COMPOSITE DISTRIBUTED
OBmer MM SYSTEM
,~'
f '
SEGMENTS \ I
,,-,,J
EVENTS
SHOTS
T'put Delay QoS Multi Synchro ATM FDDI FRAME
Require Require Negotiation casting nization RELAY
ments ments
1. Do depth first search on each subtree whose root node is a direct child of
the ROOT node and create an operation schedule for each.
if A runs out of it first then just add B's each object ID to the end of A's
schedule with a : operator. }
else (B's begin[nodeID) f. self)
{ identify the segment that contains the object ID which is given for the
value of B's begin[nodeID}
if it is identified then from the identified object ID do the same adding
operation as the first part of the step 4 (If clause). i.e., the synchronization
schedule is created at this point.}
else (it fails to be identified - this means that a user tries to bring a media
object from the outside of the current object hierarchy and to compose
with the current objects) { error.}
5. Repeat the steps 3, 4, 5 until all the segments are chosen and operated.
We can use the above algorithm to form the synchronization schedule for the
example, lecture on Distributed Multimedia Systems, shown in Figure 12. For
the segment on network requirements of distributed multimedia systems, the
synchronization schedule will be :
throughput requirements: delay requirements: QoS negotiation: mul-
ticasting; synchronization: ATM: FDDI: Frame Relay.
media. For example, the asynchrony in video and audio due to the differences
in speeds of light and sound can be modified using the content-based synchro-
nization scheme. In a similar manner, the scheme can be used to create a new
synchronization specification between video and a dubbed audio track during
movie editing.
6 MULTIMEDIA SYNCHRONIZATION
AND DATABASE ASPECTS
The approaches discussed so far describe effective ways of modeling the tem-
poral requirements of an orchestrated multimedia application. The multime-
dia objects have to be logically structured in a multimedia database and the
structure should reflect the synchronization characteristics of the multimedia
presentation. Also, multimedia data has to be delivered from a storage medium
based on a predefined retrieval schedule. The size and the real-time character-
istics of the multimedia objects necessitate different storage architectures. In
this section, we describe the database schema representation and physical stor-
age representation of multimedia objects with respect to the synchronization
characteristics.
In [12, 8], a hierarchical synchronization schema has been proposed for multi-
media database representation. Two types of nodes - terminal and nonterminal
nodes - have been defined in this approach. The terminal nodes in this model
indicate base multimedia objects (audio, image, text, etc.) and the nodes points
to the location of the data for presentation. The nonterminal nodes have addi-
tional attributes defined for facilitating.database access. The attributes include
timing information and node types (sequential and parallel), allowing the as-
sembly of multimedia objects during the presentation. The timing information
Multimedia Synchronization 205
in the nonterminal node includes a time reference, playout time units, temporal
relationships and required time offsets for specific multimedia objects. Figure
13 shows the hierarchical database schema for the multimedia presentation ex-
ample described in Figure 5.
~()
\ ....J
,
~~~;~
t" t "
b~ ~~ ~G
Figure 13 Hierarchical database schema for Figure 6.5.
be retrieved.at a very high rate for HDTV video objects, e.g., 2Mbytes/second,
from the disks.
Hence, the file system organization has to be modified for handling digital video
and audio files. The aim is to handle multiple huge files as well as simultane-
ous access to different files, given the real-time constraint of data rates upto 2
Mbytes/second. This problem has been studied in detail in [18]. Most of the
existing storage architectures allow unconstrained allocation of blocks on disks.
Since there is no constraint on the separation between disk blocks storing a
chunk of digital video or audio, bounds on access and latency times cannot
be guaranteed. Contiguous allocation of blocks can guarantee continuous ac-
cess, but has the familiar disadvantage of fragmentation of useful diskspace.
Constrained block allocation can help in guaranteeing bounds on access times
without encountering the above problems. For constrained allocation, factors
like size of the blocks (granularity) and separation between successive blocks
(scattering parameter) have to be determined for ensuring guaranteed bounds
[18]. The retrieval schedule of data for multimedia objects can be affected when
the system becomes busy with some other tasks. The allocation or the data
placement schemes also have to take into consideration the factor of contention
with other processes in the system.
7 MULTIMEDIA SYNCHRONIZATION
AND OBJECT RETRIEVAL
SCHEDULES
Orchestrated multimedia presentations might be carried out over a computer
network thereby rendering the application distributed. In such distributed
presentations, the required multimedia objects have to be retrieved from the
server(s) and transferred over the computer network to the client. The commu-
nication network can introduce delays in transferring the required multimedia
objects. Other conditions such as congestion of a database server at a given
Multimedia Synchronization 207
time and locking of the data objects by some other application also have to
be considered. Retrieval of the multimedia objects have to be made keeping
in mind these delays that might be introduced during the presentation. A
retrieval scheduling algorithm has to be designed based on the synchroniza-
tion characteristics of the orchestrated presentation incorporating features for
allowing the delays that might be encountered during the presentation.
In [12, 15], a multimedia object retrieval scheduling algorithm has been pre-
sented based on the synchronization characteristics represented by the OCPN.
Characterizing the properties of the multimedia objects and the communica-
tion channel. the total end-to-end delay for a packet can consist of the following
components.
• Propagation delay, Dp
• Transfer delay, D t , proportional to the packet size
• Variable delay, D v , a function of the end-to-end network traffic.
The multimedia objects can be very large and hence can consist of many pack-
ets. If an object consists of r packets, the end-to-end delay for the object is :
De = Dp + rDt + L:j=l DVj
Control time Ti is defined as the skew between putting an object i onto the
communication channel and playing it out. Considering the end-to-end delay,
De, the control time 11 should be greater than De. The various timing param-
eters, the playout time (11"), the control time (T) and the retrieval time (tP) are
as shown in Figure 14. The retrieval time for an object i (tP;) or the object
production time at the server is defined as <Pi = 1I"i - 1£.
The above retrieval schedule is for a single object in a media stream. When
multiple objects are retrieved, the timing interaction of the objects have to be
considered. An optimal schedule for multiple objects is defined as one which
minimizes their control times [12, 15]. The constraints associated with the
determination of the minimum control time for a set of objects are as follows.
• An object cannot be played out before arrival, i.e., 1I"i ~ tPi + T;.
• The minimum retrieval time between successive objects, i.e., <Pi-l ~ tPi -
T;-l + Dp.
T
< :>
t Time
I >
¢ IT
retrieval times for the other objects can be worked out backwards based on the
schedule <Pm for the final object. The scheduling algorithm given in [12, 15] is
summarized as follows.
<Pm = '1rm - Tm
for i = to m-2
if (<Pm-i < '1rm -i-l - Dp) <Pm-i-l = <Pm-i - 'li-l + Dp
else <Pm-i-l = '1rm -i-l - Tm - i - l
end.
scheduling algorithm works on the fact that these playout time instances are to
be satisfied given the end-to-end delay De involved in the transfer of multimedia
objects. In the TFG model, the playout instances (71') are for the multimedia
objects are described in a fuzzy or relative manner. This fuzzy temporal re-
lations have to be converted into absolute schedules for retrieving the objects
from a server. In [20], retrieval scheduling algorithms have been designed for
the TFG specification of the synchronization characteristics.
8 MULTIMEDIA SYNCHRONIZATION
AND COMMUNICATION
REQUIREMENTS
In a distributed orchestrated multimedia presentation, the communication net-
work provides support for transfer of data in real-time. The synchronization
characteristics of an orchestrated presentation define a predictable sequence of
events that happen in the time domain. This sequence of events can be used by
the network service provider to understand the behavior of the orchestrated ap-
plication in terms ofthe Quality of Service (QoS) requirements and the network
load that might be generated [17,22,23]. The QoS parameters are character-
ized by a set of parameters such as the end-to-end throughput, the end-to-end
delay and packet loss probabilities. The offered network load is characterized
by the traffic that might be generated by the application.
The BI strategy minimizes the buffer space requirements of the client and the
QoS derived based on this strategy gives the preferred values for the client.
The QoS requirements derived, based on the B2 strategy, specifies the accept-
able values for the client. In a distributed orchestrated presentation, multi-
media objects composing the presentation are retrieved from the server. The
retrieval of objects do not impose any real-time demands from the network
service provider. Hence, for orchestrated presentations, it is sufficient if the
network service provider guarantees the required throughput.
For objects such as still images. we can determine the average throughput re-
quired by the application with the B I buffering strategy, assuming that at most
one object in each multimedia stream is buffered by the application on the
client system before the presentation. For objects such as video that consists
of a set of frames to be presented at regular intervals, the application follow-
ing the BI strategy can buffer atmost one frame of the video object. Let us
consider a multimedia stream with an object Oi at a playout time instant ti
and let the object size be Zo;. Since only one object is assumed to be buffered
before the playout time instant for the stream, the retrieval of objects can be
started only after the immediately preceding playout time instant ti-l and has
to be completed before ti. Hence, the preferred throughput c~;;/ is [23] :
cpref
app
[t i-I, t i ] >
- t
Zo.
t
i - i-I
For calculating the acceptable QoS parameters, we should find out the minimum
values that are required by the application. A minimum value of throughput
(c~;~) implies that the application will be following the B2 buffering strat~gy
of buffering more than one object (in the stream for which throughput is being
calculated) before its presentation. When more than one object in a stream is
buffered by the application, the start of the presentation is delayed by the time
required for retrieving Bma:c bits using the acceptable throughput of c~;~. If
Bma:c is the maximum buffer space available and L~=1 Zo. is the total size of
all objects to be retrieved, then the acceptable throughput is [23] :
Accoplol>l. Th_.hp•• -
Pref.rre4 Throa,hpat -+-
J
!
f 10
.~~------~----~------~----~------~~
·10 o ..
Figure 15 Throughput for stream V ill Figure 6.5.
212 CHAPTER 6
25000
BIStra_IJ -
20000
1•
g 15000
10000
5000
-.2 -.1 •1 .2 .3
n •• (.......)
.. .6 .1
2.1.+06
2.. 06
1
i 1.9•• 06
o
e
....; 1.8.+06
1.7.+06
1.6• .06
REFERENCES
[1] J.L. Peterson, Petri Net Theory and The Modeling of Systems, Prentice-
Hall Inc., 1981.
ABSTRACT
Infoscopes will be the microscopes and telescopes of the information systems of the fu-
ture. The emergence of information highways and multimedia computing has resulted
in exponential growth in the availability of multimedia data. Most information in
computers used to be alphanumeric. Increasingly information has been appearing in
graphic, image, audio, and video forms. Many approaches are being proposed for stor-
ing, retrieving, assimilating, harvesting, and prospecting information from disparate
data sources. Infoscopes will allow users to access information independent of the lo-
cations and types of data sources and will provide a unified picture of information to
a user. Due to their ability to represent information at different levels of abstractions
these systems must recover and assimilate information from disparate sources. In this
chapter, we discuss requirements of these emerging information systems. We discuss
basic architecture and data models for these systems. Finally, we briefly present a
few examples of early infoscopes.
1 INTRODUCTION
Most information on computers used to be alphanumeric. Technological changes
have brought a major change in the nature of computer-based information sys-
tems in the last few years. Increasingly the information has started appearing
in graphic, image, audio, and video forms. The increased capability of com-
puters to deal with multimedia is resulting in an exponential growth in the
availability of data in every form. The World Wide Web clearly demonstrates
what advances in communications, networking, and computing power can do.
218 CHAPTER 7
Just a few years ago, it was difficult to imagine that net surfing would become
so common and it would be possible to navigate through so many places so
easily while sitting at your own desk! Now the WWW is the fastest growing
aspect of computing. It is clear that the WWW has become a model of modern
information systems.
We commonly mistake data for information. Information starts with data, but
it must be recovered. Data is NOT information; it is a source of information.
Data represents facts or observations. Data comes in many forms. The form,
or type, of data depends on the source used to acquire facts or observations.
Text, images, video, and sound, are all examples of data. Information is task
dependent and is derived from the data in a particular context using knowledge.
Multimedia computing is the ability of computers to deal with data of many
disparate types.
Better tools to produce and manage data combined with human desire to use
information has resulted in tremendous data explosion. This data explosion
has resulted in high information anxiety in modern society. In most cases,
including while surfing on the WWW, we suffer from data overload and become
confused, disoriented, and inefficient. Commonly people think that they have
information overload; what they really have is data overload. When there is
too much data, the human cognitive processes to recover information out of
that ocean of data become overloaded. In the last decade, advanced techniques
that help in production, communication, storage, and even display of data have
seen significant advances, but progress in techniques for recovering information
out of data has been slow.
A picture is worth a thousand words. This statement is true for all forms
of data. The information extracted from the data depends on the observer
and the context in addition to the data itself. The same data can provide
different, sometimes conflicting, information. Current database systems have
mechanisms that result in very rigid semantics. This semantics in the current
database system come from the database designers and the users. In both
cases, the tools used in current database systems to associate semantics to the
data is more or less fixed at the time of database design. Such databases were
satisfactory in the early days of the information revolution. Now tools must be
developed to provide a rich semantic environment for users.
It would be impossible to cope with the explosion of multimedia data, unless the
data is organized in such a way that we can retrieve the information rapidly on
demand. A similar situation occurred for numeric and other structured data,
and led to the creation of computerized database management systems. In
InfoScopes 219
databases, large amounts of data are organized into fields and key fields are used
to index the databases, making searches very efficient. These database systems
have revolutionized modern society. These systems, however, are limited by the
fact that they work well only with numeric data and short alphanumeric strings.
Since so much information is in non-alphanumeric form (such as images, video,
and speech), as a natural extension to the ideas in databases, researchers started
exploring the design and implementation of image databases. But creation of
mere image repositories, as practised in current image databases, is of little
value unless there are methods for rapid retrieval of images based on their
content, ideally with an efficiency that we find in today's databases. We should
be able to search image databases with image-based queries, in addition to
alphanumeric queries. The fundamental problem is that images, video, and
other similar data differ from numeric data and text in format, and hence
they require totally different techniques of organization, indexing, and query
processing. We need to consider the issues in visual information management,
rather than simply extending the existing database technology to deal with
images. We must treat images as one of the central sources of information
rather than as an appendix to the main database.
In this paper, we discuss some ofthese issues and then present the basic archi-
tecture, data model, and interaction environment for the information systems
that will be essential in the coming information age. 1 We call these systems
infoscopes. These systems allow a closer and detailed view of the data to an ob-
server who wants to extract information. Like telescopes, microsopes, and now
1 We demonstrate these in the context of visual information, forms of data also.
220 CHAPTER 7
A representation is always task dependent. The bit stream is good for storage
and communication of data because our present day computers and communi-
cation devices are truly adept at dealing with bits. A bit stream without the
knowledge of its structure is just a bit stream, however. It is useless to hu-
mans. For our use, the bit stream must be converted to a form that one of our
senses can understand. A picture displayed as a sequence of bits, a sequence
of digits, or a wave form is usually useless to us. It must be displayed as a
picture. The same is true with any other form of information. Most of us have
'seen' sound waves; few, if any, can make sense out of those waves that in an
appropriate form are obvious to most. Thus, to make a sense of a data stream,
it should be presented in the proper format on a proper device. To accomplish
thIS, information or metadata related to the source and format must be tagged
InfoScopes 221
with every bit stream. For interfacing with humans, computer must know how
to send a stream to the proper device and adjust the parameters of the device
for 'impedance matching' with human senses. In other words, with every bit
stream there should be information that enables the computer to interpret the
bit stream in the context of interfacing with humans.
Images, video, audio, and other information representation have a large volume
of data. Technology is progressing rapidly to deal with the required storage and
bandwidth problems. These information sources represent low-level informa-
tion. When considered as a bit stream with the meta information, the explicit
semantic information content in these sources is very low. This poses a serious
problem in accessing these information sources. Humans are very efficient in
abstracting information and then interacting with humans and other devices
at a high level. This allows high bandwidth interactions among humans and
between human and machines. Multimedia systems currently have this seman-
tic bottleneck. Traditionally, most attention has been focused on storage and
communication of multimedia information. We are soon reaching the point
where multimedia systems will add a great deal to the data overload on peo-
ple. Techniques must be developed to add semantics to the data acquired from
disparate sources in disparate forms.
Database
Multimedia Databases
Real World
Imagls
end 0111.
media
objects
Computer's woI1d
Commonly there are two different places in database systems where semantics
is introduced. While designing the database, the designer includes semantics
in the form of relationships and attributes. These relationships form the first
level of semantics in databases. The second, and possibly more important, level
of semantics is provided by the user, or the application programmer. Based on
the knowledge of the semantics introduced by the designer and the knowledge
of the domain, a user or application programmer develops procedures to get
the information required from the database.
refers to entities in images as if they were real entities. This means that these
information systems must deal with one added level of data abstraction, and
this must be done transparently to users.
3 OPERATIONS IN INFOSCOPES
An information system should allow storage, communication, organization, pro-
cessing, and envisioning of information. It should facilitate interactions by using
natural interactions. Natural interactions include multimedia input and output
devices and use of high-level domain knowledge by a user.
The system should allow powerful navigation tools. Unlike early database users,
users of infoscopes will not articulate their queries in well-defined, crisp lan-
guages like SQL. These users will use vague natural language, and that should
be understood by the system to let a user navigate through the system. The
nature of queries will be fuzzy not due to the laziness of the user, but due to the
nature of information and the size of the database. A general query environ-
ment will be like one shown in Figure 2. A user looking for certain information,
say about a person who he vaguely recalls, will go to an infoscope and specify
whatever important things he remembers about the person. This specification
may be she has big eyes, wide mouth, long hair, and a small forehead. Based
on this information, candidate people's pictures may be shown. The user can
then select the closest person and modify his query by modifying the photo,
either by specifying features or by using graphical and image editing tools. This
refines the query and is sent to the system to provide new candidates to satisfy
the query. Thus, a query is incrementally formulated starting with the original
vague idea. This process will terminate when the user is satisfied.
Due to the nature of data, several levels of abstraction in the data, and temporal
changes in the data, the types and nature of interactions in such systems will
be richer than those in a database or image processing system. We loosely refer
to all interactions initiated by a user as queries. The types of queries in such
systems can be defined in the following classes:
226 CHAPTER 7
Describe the
target face in general ,...----,
descripo\ eterm . sex
age
hair color
Ane\ query
The user ch a
. generated. fa eand alters aspecific
feature.
Figure 2 This figures shows that the queries in infoscopes will be incremental
in nature. These queries will facilitate navigation and browsing of data.
search. A major difference in these queries will be the fact that similarity
will become a central operation, rather than conventional matching. Tech-
niques to evaluate similarity are an active research topic in many fields
of science and technology [22]. Many approaches have been proposed to
compare several attributes to evaluate similarity of two objects. In addi-
tion to the decision on what attributes to select, a very difficult decision
is how to combine those attributes. Methods to combine attributes are
domain dependent and subjective. It is clear, however, that in dealing
with images, and similar data sets, similarity rather than matching will be
a key function in searching.
2. Browse. A user may have a vague idea about the attributes of an entity,
relationships among entities in an image, or overall impression of an image.
Such ideas are formed due to the overall appearance of the image rather
than very specific objects and relations among them. In such cases, the user
may be interested in browsing the database based on an overall impression
or appearance of images, rather than searching for a specific entity. The
system should allow formulation of fuzzy queries to browse through the
database. In browsing mode, there is no specific entity that a user is
looking for. The system should provide datasets that are representative
of all data in the system. The system should crlso keep track of what has
been shown to the user. Some mechanism to judge interest of the user in
the data displayed should be developed and this interest should be logged
to determine what to display next.
3. Construct Solutions or Design. Most databases and information sys-
tems are designed assuming that a user will articulate his queries in one
attempt. Many applications require an environment in which a user can
incrementally introduce constraints and use the sequence of constraints to
articulate his query. Each constraint reduces the search space and the user
can browse through this reduced space to decide what constraint to intro-
duce next. The system interface should facilitate sequential introduction
of constraints and evaluation of results of each constraint introduction.
These constraints may be symbolic or pictorial. In some cases, user may
want to select an image from the database and modify it to specify his
requirements. Many design problems, including artistic design, are based
on cbnstructing solutions by introducing constraints sequentially.
4 INFOSCOPE ARCHITECTURE
As is clear from the queries and the nature of the data and information, info-
scopes must combine features of databases, image understanding systems, and
knowledge-based systems. The interfaces for these systems will require careful
considerations and will depend on the type of data and queries. In many cases,
information from multiple disparate sources must be combined to synthesize
InfoScopes 229
the answer and then adequate visualization methods must be used to present
the answer.
Interactive
Query Module
Data (Image)
Processing Module
Knowledge Module
These ideas have been used to develop several systems for the retrieval of images
and video information in our group [7, 3, 24, 25, 10]. Here we discuss each of
the system components briefly. In the following discussion, we will discuss this
architecture in the context of images and video, but our concepts are applicable
to any kind of data.
InfoScopes 231
Domain Knowledge
:Domain
:Independent
I
I
~------------------------~
4.1 Database
This component provides the storage mechanism for the actual data as well as
the features of the data. Features evaluated at the time of insertion as well
as meta features are stored in the database with their value and a reference
to the image containing it. Similarly, every image in the database references
the features which it contains. This corresponds to the VIMSYS data model
discussed in [7]. Data is represented in the database at different levels of the
hierarchy, which allows well- defined relationships between the actual images,
image objects, and the real-world domain objects which they represent. In
232 CHAPTER 7
addition to storing the actual image data, segmented regions of the image
which pertain to domain objects are also identified and stored. This provides an
effective mechanism for the computation of an additional feature of a domain
object, since it is not necessary to relocate the object in the image. Issues
related to compression, storage management, indexing, and other database
aspects are relevant in the design of the database.
Computer vision techniques are required to analyze the data and input it into
the system. Unfortunately computer vision techniques can automatically ex-
tract features only in certain limited cases. In many applications, it may be
required to develop semiautomatic techniques for extracting features. Domain
InfoScopes 233
knowledge plays a very important role in defining the processes used for auto-
matic feature extraction. Most research in computer vision has been concerned
with development of general purpose and automatic techniques. In infoscope ap-
plications, one may require techniques tuned to particular applications. These
techniques may be based on some basic image processing tools, but should use
domain knowledge to define and compute domain-dependent features. Also,
these techniques may be designed to assist a user rather than do everything
automatically.
4.3 Interface
This module is used interactively by the user to retrieve information from the
database. A user will articulate his request using symbolic and visual tools
provided by the system. Also, the system must decide the best display meth-
ods. In these systems, the role of the system is not just to provide requested
information, but this information must be provided in the most appropriate
format.
During the retrieval process, a similarity value is assigned to data which satisfies
the constraints of the generated query. This value can then be used to rank the
results to be displayed to the user. Several factors can be incorporated into the
calculation of this similarity value. After the results of the query are displayed,
the user can generate a new query by using either the contents of these images,
newly-specified feature values, or both.
A query will be usually specified by specifying several features and their relative
weights. The system must use this weighted feature distance to judge the
similarity of the example image ·with the target images in the database. A
serious problem in evaluating similarity distances is the very subjective and
application-dependent nature of similarity functions. Many studies have been
234 CHAPTER 7
performed to find the nature of similarity functions and many alternatives exist.
It is not very clear which function should be used in which situation [22].
5 KNOWLEDGE ORGANIZATION
Although specific processes and information will differ greatly between differ-
ent domains, the types of information and the tasks required will be similar in
many applications of infoscopes. The implementation of the knowledge module
provides a consistent method for incorporating this common information into
the system, while allowing a more general architecture to be developed. By
recognizing this separation of knowledge, we will develop a system architecture
which will not be limited to a specific domain.
In any application, the domain objects are those objects (possibly unrelated)
that can be considered to be a unique entity type in the real world. Each of
these objects may be composed of other objects (a composite object), or itself
make up another object. These are the elements which the system and the user
must be able to identify and manipulate in order to represent the information
portrayed in the image. For example in a face-identification application, the
most obvious object is a FACE, which is comprised of several other objects.
The objects that comprise a FACE are LEFT..EYE, RIGHT..EYE, NOSE, etc. This
relationship of objects corresponds to the Domain Event layer of the VIMSYS
model. Regardless of the domain or the objects chosen to represent the domain,
certain aspects of how to process these objects must be maintained. Domain
InfoScopes 235
objects and events must be related to image objects. Image objects are deter-
mined by considering what can be computed from images and what is required
for the given application. The set of image objects represents the alphabet used
by the system.
The functions that make use of this knowledge can be divided into the three
categories listed below:
confidence in assigning a value to this feature. Each feature value may also
change in a unique manner when altered by the user. These changes will
be based on the global statistics of each individual feature. Once images
are retrieved, additional knowledge must be used to rank the images by
order of similarity. This makes use of information about the importance
of each individual feature in differentiating between objects and images.
1. F ti. This set contains the features which are commonly referred to as
meta-features. Some of these features can be automatically acquired from
the associated information on images. These features may include the
size of the image, photographer, date taken, resolution and similar other
information. This group also contains other features that can be called
user- specified. Values are assigned to these features by the user at the
time of insertion. Many of these features can be read by the system either
from the header, filename, or other similar sources. These features can not
be directly extracted from images.
2. F d. This set contains the features which are derived directly from the im-
age data at the time of insertion of the images in the database. Values
are automatically calculated for these features using automatic or semiau-
tomatic functions. These features are called derived features and include
features that are commonly required in answering queries. These features
are stored in the database.
3. F c. This set contains the features whose values are not calculated until
they are needed. Routines must be provided to calculate these values when
they become necessary. These features may be computed from data at the
query time. These features are called query-only features or computed
features
The first two types offeatures are actually stored in the database. Metadata can
be frequently read from other sources or should be manually entered. Which
InJoScopes 237
The system interface encourages users to formulate his queries using metadata
and derived features as much as possible. It reluctantly allows use of computed
features. To access data, the system can purge the search space significantly
using metadata and derived features and then apply computed features to only
this reduced set of images. This strategy allows flexibility while maintaining a
reasonable response time. The system may be able to predict wait time using
number of images from which computed features must be extracted.
6 INTERFACES
We discussed the types of operations that will be required in an infoscope.
Many of these operations can not be conveniently performed using traditional
interfaces. Here we discuss some general issues in designing interfaces for in-
foscopes. Interactions with infoscopes are likely to be multimedia. Due to the
nature of the data and several abstraction levels, it is expected that users will
require multimodal interface mechanisms.
the example. Effectively, the system must rank all data with respect to
the example and then display pictures that are closest to the example.
Interestingly, this approach has been a very popular approach in the image
databases that are being designed [17].
In QPE, features and similarity measures must be clearly defined for use in
retrieving images. Similarity judgement has been a difficult problem and
continues to attract attention of several researchers [22]. The most inter-
esting fact about similarity measures is that they are domain dependent
and very subjective. Assuming that we have identified a measure that is
acceptable to a user for his or her domain, we face some interesting prob-
lems in QPE. All images are compared to the example to evaluate their
similarity. This is possible in those cases where the size of the database is
such that computations can be done in reasonable time. When the size of
the database grows such that it is not possible to accomodate all data in
main memory and such computations become impractical, one must resort
to indexing techniques.
Indexing techniques for spatial data have been developed [11, 17, 21].
These techniques are very limited when it comes to addressing the problem
of similarity indexing. Techniques like TV-trees [9] are a good step in the
right direction but lack several important features [26].
• Query Canvas: Queries may be formulated by starting with an exist-
ing picture, scanning a new picture and modifying these by using visual
and graphical tools available in common picture editing programs, such as
Adobe Photoshop. One may cut-and-paste from several images to artic-
ulate a query in the form of an image. It is also possible to start from a
clean image and then draw an image using different tools. The basic idea
in this approach is to provide a tool to define a picture that may be used
in a QPE. This approach allows a user to define a picture that they are
looking for.
• Image Part Queries: In many cases, a user may point to an object,
or circle an area in an image and request all images that contain similar
regions. These queries appear very easy, and will be very easy, if complete
segmentation of images is performed and then all region properties are
stored. Most image database systems store only global characteristics of
image. In these cases, one is looking for all images that are a superset
of the region attributes. Once all such images are retrieved, some other
filtering techniques could be developed to solve this problem.
• Semantic Queries: All the above queries were based on image attributes.
In most applications, an image database is likely to be prepared for a spe-
cific domain-dependent application, such as human faces, icefloe images, or
InfoScopes 239
retinal images. It is important that users can then interact using domain-
dependent terms. It is common that people may describe a person using
terms like big eyes, wide mouth, small ears, rather than the corresponding
image objects. An infoscope must be able to respond to these queries.
Semantic queries require extensive use of domain knowledge. Domain
knowledge is required both in defining features that will be used by the
system and in interpreting user queries. Most image database systems ei-
ther considered domain knowlede implicitly by defining features or ignored
it [5]. The role of explicit knowledge in image databases is discussed in
[7, 24, 25].
• Object Related Queries: These queries are semantic queries that ask
for presence of an object. These queries may deal with three-dimensional
objects. Since three-dimensional objects are difficult to recognize using
automated techniques, these queries may become very complex. Three-
dimensional object recognition is a very active research area in machine
vision. Queries based on recognizing objects in a query image may be,
therefore, very difficult to execute.
• Spatio-temporal Queries: In video sequences, and in many other appli-
cations where pictures are obtained over a long period, a user may want to
get answers to some spatia-temporal events and concepts. Answers to such
questions may require complete analysis of all video sequences and storing
some important features from there. Considering the fact that methods
to represent temporal events are not well developed yet, this area requires
much research before one can design a system to deal with spatio- temporal
queries at the natural language level.
7 EXAMPLE SYSTEMS
In this section we discuss different levels in infoscopes and point to some existing
systems that have been implemented.
Many systems have been designed using image-only attributes. QBIC from
IBM [5] uses color, texture, and manually segmented shapes. QBIC was the first
complete system to demonstrate the efficacy of simple attributes in appearance-
based retrieval of images from a reasonably sized database. The use of shape
in QBIC is problematic, however. Shape is defined for individual segments
which must be obtained manually. This also creates an artificial situation in
the database because for each manually obtained segment, one must consider a
separate record in the database. Thus if an image has N objects, the database
must contain N+1 records; one for the image, and one each for N segments.
Shape measures on complete images are not satisfactory because shape is de-
fined for an image region. Some heuristics have been proposed, but much
remains to be done in this area.
Texture poses a bit more difficult problem. Most systems use global measures of
texture and try to assign some texture attribute to images. These attributes are
then used for evaluating similarity of texture in images. Most images contain
different types of texture in different parts of the image. The global texture
attributes, therefore, could be misleading. These systems use only the first
two levels of the VIMSYS data model. Since both these, IR and 10, levels are
domain independent, these image databases are domain independent. Users of
these systems must supply the semantics in these systems. This semantics can
be provided by using color and texture attributes of objects of interest. One
may filter using these attributes and then use domain-dependent features on
remaining images to retrieve desired information.
distance functions. This system treats keywords also like features by using
a thesaurus to compute distances between keywords in the query and stored
images. The weights of the features can be changed to retrieve similar images
using different similarity functions. We show a screen shot of this system in
Figure 5. This shot shows all images retrieved as similar to the example image,
which is the best matching image and hence appears as the first image in
similar images. If the images are created using the query canvas, shown in
Figure 6, then one can articulate a query by cutting and pasting and some
other image manipulation operations. It must be mentioned that this system
has no domain-level knowledge.
Interestingly, even without any domain knowledge in this system, users very
quickly learn to retrieve images of their choice by using an example image and
appropriate weights of the features provided in the system. The system uses
color, texture, and structure as features of an image. In color, both global
colors, and automatically segmented segments and their locations, defined as
composition, are used. For texture several properties are computed using stan-
dard texture features and are combined to represent an overall measure of the
texture. Structure addresses shapes and location of edge segments. It is in-
teresting to see that these purely image-based features when combined with
handdrawn queries on a canvas or an image selected for QPE perform quite
effectively in retrieving semantically relevant objects. This strongly suggests
that by defining a pictorial alphabet and suitable rules to use this alphabet, it
may be possible to develop powerful domain dependent systems.
Xenomania was an interactive system for the retrieval of face images and in-
formation. It allows a user to locate a specific person in the database, and
retrieve the person's image and other information. The user can describe the
target face in general terms, like shape of eyes, nose, length of hair, to begin
the location process and retrieve the initial results. After that, the target face
may be described using these general terms, or by using the actual image con-
tents of the retrieved faces. All aspects of the architecture described above are
242 CHAPTER 7
Figure I) A screen shot of Pinpoint showing the query window and all images
retrieved using QPE. Notice that a user can adjust weights of features and the
feedback to the user is instantaneous.
InfoScopes 243
Figure 6 The query canvas provides a user to articulate a query using vi-
sual means. One can cut and paste from images and use image manipulation
programs to articulate a query.
244 CHAPTER 7
We chose the interactive face identification problem because of the lack of well-
defined image objects and features, and the heavy dependence on both pre-
defined domain knowledge and extensive user participation. Although much
work has been done towards modeling of facial features, it is still very difficult
to accurately extract and evaluate these features over a variety of faces and sit-
uations, and even more difficult to assign semantic attributes to these features
which are meaningful to human users. This application exploits the demands
for both extensive, predefined domain knowledge, and user-incorporated knowl-
edge at every step of processing.
Xenomania relied very heavily on previous research in the field of face recogni-
tion. Much work has been done in the psychological aspects which provided a
basis for our initial implementation. Many automatic face recognition systems
have also been developed. The Xenomania project is not a face recognition sys-
tem, but rather an image database system used for interactive face retrieval.
Some face- recognition systems have approached the problem from strictly an
image processing point of view, with little or no emphasis on descriptive rep-
resentation of faces. These systems do not incorporate the user for describing
the face, or guiding the query refinement once the recognition process has been
initiated. The most successful face reocgnition system is based on eigenfaces
[19, 20). This system is also influenced by image recognition approaches. In
eigenface-based system, one can specify an image and the system will retrieve
all images that are similar to that. It may be interesting to combine eigenfaces
with the descriptive approach used in Xenomania.
Domain Knowledge
As in any image management application, we are faced with the difficulty of
determining which attributes are important for each domain object, and how
to accurately represent these attributes in the system. However, this is an
attractive problem from our point of view, because it gives us the opportu-
nity to investigate different types of object and feature representations. For
instance, there are several attributes about an eye that may be important. In-
dividual eye attributes such as area and width will be necessary, as will relative
attributes such as the width of the eye compared to the height of the eye. Spa-
tial attributes such as distance between the left eye and the right eye are also
InfoScopes 245
important and must be incorporated into the system. Other objects, such as
eyebrows, may require entirely different attributes than those for eyes to be
maintained in the system. We have based much of our initial implementation
on research that has been done to evaluate which facial features and attributes
are best suited for face identification and differentiation.
Many image databases are likely to be for specific applications and hence will
require strong domain knowledge. The domain objects should be described
using the image alphabet or image objects in the VIMSYS model. This task will
require close interactions among database designers, image processing experts,
and domain experts.
Video is the most impressive medium for communicating and recording events
in our life. Its use is limited, however, by its basically sequential nature. To
access a particular segment of interest on a tape, one must spend significant
time is searching for the segment. Video databases have potential to change
the way we access and use video.
By storing each individual shot in the database, one can then access any indi-
vidual frame based on the content of the shot. Each shot can be analyzed to
find what is contained in each shot. Frames in each shot can be analyzed to
find events in it. By segmenting videos into shots and analyzing those shots,
one can extract information that can be put into a database. This database
can then be searched to find sequences of interest.
system, role of knowledge in such a system, and all other aspects have been
presented in [23, 24, 9, 9). It must be mentioned here that many other systems
of this type are being implemented in other places.
The architecture for the video database is composed of the four major com-
ponents discussed above. The input module is further divided into two major
components: a sequence segmentation subsystem and a feature detection sub-
system. The knowledge module has a video object schema definition subsystem
to help a user enter knowledge into the system for a specific application. The
video object schema definition subsystem provides tools to model the video ob-
ject schema for an application based on the operator's available in the input
and query processing systems. Based on the video object schema, the fea-
ture detection subsystem analyzes a video frame sequence to extract structure
and the semantic information about each objects of interest in the video. The
extracted objects and related semantic information are then stored into the fea-
ture database. According to the video object schema definition, a user query
interface is automatically customized. A user can also navigate the video object
schema defined from the video object schema definition subsystem as well as
its associated video object data through the user query interface.
provide you only one choice for a particular segment of video: you can see it or
skip it. In the case of TV broadcast, viewers have essentially no control. Even
in those events where multiple cameras are used, a viewer has no choice except
the obvious one of viewing the channel or using the remote control and going
channel surfing. We believe that with the increased bandwidth, and advances
in several areas of technology, the time has come to address issues involved in
providing real interactive video and TV systems. Incidentally, the most seri-
ous limitation of modern television pointed out by George Gilder[6] is that the
viewers really have no choice. We believe that MPI video goes in the direction
of liberating video and TV from the traditional single-source broadcast model
and puts the viewer in the driver's seat.
We demonstrate our concepts in the context of a sporting event [14]. Our model
allows viewers to be active; they may request their preferred camera positions
and angles or they may ask questions about contents described in the video.
Our system will automatically determine the best camera and view to satisfy
the demands of the viewer. We believe these new functions are the key to
make MPI video a revolutionary new media. To make such functions possible,
however, much advancement of technology is required in the fields of computer
vision, multimedia databases and human interface design, see [23,24, 9, 8] and
[27,28,16,2,4,1].
Let us assume that an episode is being recorded, or being viewed in real time.
In a simplest and most obvious case, the episode can be recorded using multiple
cameras strategically located at different points. These cameras may provide
different perspectives of the episode. One camera view is very limited. Using
computer vision and related techniques, it may be possible to take individual
camera views and reconstruct the scene. These individual camera scenes can
then be assimilated into a model that represents the complete episode. We call
this model an environment model. The environment model has a global view of
248 CHAPTER 7
the episode and also knows where the individual cameras are. The environment
model can be used by the system to allow a user to view what they want and
from where they want.
1. Specific Perspective: One may want to view the episode from a specific
perspective. The user may even specify the individual camera, or may
specify a general location for the camera.
4. Virtual Camera: It is possible that the viewer may request to view the
event from a perspective that is not provided by any real camera situated
to acquire the episode. In such cases, using environment model and simu-
lation techniques, one may create a virtual camera to generate the episode
from this specified perspective.
The high level architecture for this system is shown in Figure 7. Each camera
perspective is converted to its camera scene. Multiple camera scenes are then
assimilated into the environment model. A viewer can select his perspective
and that perspective is communicated to the environment model. The reasoning
system in the environment model decides what to send to the display of the
user.
•
Video Stream 1 Video Data ( ] .....t-......,
~ l Analyser lAssimi1ato~
Video Stream n
Visualizer
and
Virtual View
Builder
Display
Manager
User
250 CHAPTER 7
These projects are very active currently. We believe that we have started only
scratching the surface of these systems. Research in all aspects of these systems
continues in our laboratory and at many other laboratories. We believe that
the next few years will see exponential growth in infoscopes. In this paper, our
focus was on image and video systems. We expect most infoscopes to be able to
deal with different modes of information, ranging from alphanumeric to sound
and others, seamlessly. Due to their ability to represent information at different
levels of abstractions these systems can recover and assimilate information from
disparate sources. We did not discuss assimilation and display of information
here, but those will be equally important topics.
Acknowledgements
The research and ideas presented in this paper evolved during collaborations
with several people in the InfoScope project. I am thankful to everyone who
actively participated in the project. I want to particularly thank Jeff Bach,
Shankar Chatterjee, Amarnath Gupta, Arun Hampapur, Bradley Horowitz,
Arun Katkere, Don Kuramura, Saied Moezzi, Simone Santini, Chiao-Fe Shu,
Bo Smorhay, Deborah Swanberg, and Terry Weymou'th for collaboration in
different aspects of this work.
REFERENCES
[1] A. Akutsu, Y. Tonomura, H. Hashimoto, and Y Ohba, "Video indexing
using motion vectors", Proceedings of SPIE: Visual Communications and
Image Processing 92, November 1992.
[2] F. Arman, A. Hsu, and M.-y' Chiu, "Image processing on compressed data
for large video databases", Proceedings of the ACM Multimedia, pages 267-
272, California, USA, June 1993.
[3] J. Bach, S. Paul, and R. Jain, "An interactive image management system
for face information retrieval", IEEE Transaction on Knowledge and Data
Engineering, Special Section on Multimedia Information Systems, June
1992.
[4] G. Davenport, T. A. Smith, and N. Pincever, "Cinematic primitives for
multimedia", IEEE Computer Graphics & Applications, pages 67-74, July
1991.
[5] C. Faloutsos, R. Barber, M. Flickner, J. Hafner, W. Niblack, D. Petkovic,
and W. Equitz, "Efficient and effective querying by image content", J. of
Intelligent Information Systems, Vol. 3, No. 3/4, pages 231-262, July 1994.
[6] G. Gilder " Life After Television: The The coming transformation of Media
and American Life", W. W. Norton & Co., 1994.
[7] A. Gupta, T. Weymouth, and R. Jain, "Semantic queries with pictures:
the VIMSYS model", Proceedings of the 17th International Conference on
Very Large Data Bases, September 1991.
[11] H. V. Jagadish. "A retrieval technique for similar shapes", Proc. ACM
SIGMOD Conference, pages 208-217, May 1991.
[12] R. Jain, "NSF workshop on visual information management systems" SIG-
MOD Record, 22(3):57-75, December 1993.
[14] R. Jain and K. Wakimoto, " Multiple perspective interactive video", IEEE
Multimedia Computing Systems, pages 202-211, May 1995.
[15] K.-I. Lin, H. V. Jagadish, and C. Faloutos, "The tv-tree - an index struc-
ture for high-dimensional data", VLDB Journal, 3:517-542, October 1994.
[16] A. Nagaska and Y. Tanaka, "Automatic video indexing and full-video
search for object appearances", 2nd Working Conference on Visual
Database Systems, pages 119-133, Budapest, Hungary, October 1991.
[17] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic,
P. Yanker, C. Faloutsos, and G. Taubin. "The QBIC project: Querying
images by content using color, texture and shape", SPIE 1993 Inti. Sym-
posium on Electronic Imaging: Science and Technology, Storage and Re-
trieval for Image and Video Databases, February 1993. Also available as
IBM Research Report RJ 9203 (81511), February 1993.
[25] D. Swanberg, C.-F. Shu, and R. Jain, "Knowledge guided parsing in video
databases", Electronic Imaging: Science and Technology, San Jose, Cali-
fornia, February 1993.
[26] D. White and R. Jain, "Similarity indexing using the ss-tree"'; IEEE Data
Engineering, submitted.
[27] H. J. Zhang, A. Kankanhalli, and S. W. Smoliar, "Automatic partitioning
of video", Multimedia Systems, 1(1):10-28, 1993.
[28] H. J. Zhang, Y. Gong, S. Smoliar, and S. Y. Tan, "Automatic parsing of
news video" , Proceedings of the IEEE Conference on Multimedia Comput-
ing Systems, Boston, Massachusetts, May 1994.
8
SCHEDULING IN MULTIMEDIA
SYSTEMS
A. L. Narasimha Reddy
IBM Almaden Research Center,
650 Harry Road, K56/802,
San Jose, CA 95120, USA
ABSTRACT
In video-on-demand multimedia systems, the data has to be delivered to the consumer
at regular intervals to deliver smooth playback of video streams. A video-on-demand
server has to schedule the service of individual streams to ensure such a smooth
delivery of video data. We will present scheduling solutions for individual components
of service in a multiprocessor based video server.
1 INTRODUCTION
Several telephone companies and cable operators are planning to install large
video servers that would serve video streams to customers over telephone lines
or cable lines. These projects envision supporting several thousands of cus-
tomers with the help of one or several large video servers. These projects aim
to store movies in a compressed digital format and route the compressed movie
to the home where it can be uncompressed and displayed. These projects aim
to compete with the local video rental stores with better service; offering the
ability to watch any movie at any time (avoiding the situation of all the copies of
the desired movie rented out already) and offering a wider selection of movies.
Providing a wide selection of movies requires that a large number of movies
be available in digital form. Currently, with MPEG-l compression, a movie of
roughly 90 minute duration takes about 1 GB worth of storage. For a video
server storing about 1000 movies (a typical video rental store carries more), we
would then have to spend about $500, 000 just for storing the movies on disk at
a cost of $0.5/MB. This requirement of large amounts of storage implies that
256 CHAPTER 8
the service providers need to centralize the resources and provide service to a
large number of customers to amortize costs. Hence the requirement to build
large video servers that can provide service a large number of customers. See
[5)[14)[2)[15] for some of the projects on video servers.
If such a large video server serves about 10,000 MPEG-1 streams, the server
has to support 10,000 * 1.5 Mbits/sec or about 2 GBytes/sec of I/O band-
width. Multiprocessor systems are suitable candidates for supporting such
large amounts of real-time I/O bandwidth required in these large video servers.
We will assume that a multiprocessor video server is organized as shown in
Figure 1. A number of nodes act as storage nodes. Storage nodes are responsi-
ble for storing video data either in memory, disk, tape or some other medium
and delivering the required I/O bandwidth to this data. The system also has
network nodes. These network nodes are responsible for requesting appropriate
data blocks from the storage nodes and routing them to the customers. Both
these functions can reside on the same multiprocessor node, i.e., a node can
be a storage node, or a network node or both at the same time. Each request
stream would originate at one of the several network nodes in the system and
this network node would be responsible for obtaining the required data for this
stream from the various storage nodes in the system and delivering it to the
consumer. The data transfer from the network node to the consumer's monitor
would depend on the medium of delivery, telephone wire, cable or the LAN.
We will assume that the video data is stored on disk. Storing the video data
on current tertiary mediums such as tapes is shown to be not attractive from
price performance analysis [3]. Storing the video in memory may be attractive
for frequently accessed video streams. We will assume that the video data is
stored on disk to address the more general problem. The work required to
deliver a video stream to the consumer can then be broken down into three
components: (1) the disk service required to read the data from the disk into
the memory of the storage node, (2) the communication required to transfer
the data from the storage node memory to the network node's memory and (3)
the communication required over the delivery medium to transfer the data from
the network node memory to the consumer's monitor. These three phases of
service may be present or absent depending on the system's configuration. As
pointed out already, if the video data is stored in memory, the service in phase
1 is not needed. If the video server does not employ a multiprocessor system,
the service in phase 2 is not needed. In this chapter, we will deal with the
scheduling problems in phases 1 and 2. If the consumer's monitor is attached
to the network node directly, the service in phase 3 is not needed. Service in
phase 3 is dependent on the delivery medium and we will not address it here.
Scheduling in Multimedia Systems 257
The organization of the system and the data distribution over the nodes in
the system impact the overall scheduling strategy. In the next section, we will
describe some of the options in distributing the data and their impact on the
different phases of service. In Section 3, we will discuss scheduling algorithms
for (phase 1) disk service. In Section 4, we will describe a method for scheduling
the multiprocessor network resources. Section 5 concludes this chapter with
some general remarks and future directions.
258 CHAPTER 8
2 DATA ORGANIZATION
If a movie is completely stored on a single disk, the supportable number of that
movie will be limited by the bandwidth of a single disk. As shown earlier by
[12], a 3.5" 2-GB IBM disk can support upto 20 streams. A popular movie
may receive more than 20 requests over the length of the playback time of
that movie. To enable serving a larger number of streams of a single movie,
each movie has to be striped across a number of nodes. As we increase the
number of nodes for striping, we increase the bandwidth for a single movie.
If all the movies are striped across all the nodes, we also improve the load
balancing across the system since every node in the system has to participate
in providing access to each movie.
The width of striping (the number of disks a movie may be distributed on)
determines a number of characteristics of the system. The wider the striping,
the larger the bandwidth for any given movie and the better the load balancing.
A disk failure affects a larger number of movies when wider striping is employed.
When more disk space is needed in the system, it is easier to add a number of
disks equal to the width of striping. Hence, wider striping means a larger unit
of incremental growth of disk capacity. All these factors need to be considered
in determining the width of striping. For now, we will assume that all the
movies are striped across all the disks in the system. In a later section, we will
discuss the effects of employing smaller striping widths. The unit of striping
across the storage nodes is called a block.
Even though movies are striped across the different disks to provide high band-
width for a movie, it is to be noted that a single MPEG-1 stream bandwidth of
1.5 Mbits/sec can be sufficiently supported by the bandwidth of one disk. Re-
quests of a movie stream can be served by fetching individual blocks at a time
from a single disk. Striping provides simultaneous access to different blocks of
the movie from different disks and thus increases the bandwidth available to
a movie. Higher stream rates of MPEG-2 can also be supported by requests
to individual disks. We will assume that a single storage node is involved in
serving a request block.
Data organization has an impact on the communication traffic within the sys-
tem. During the playback of a move, a network node responsible for delivering
that movie stream to the user has to communicate with all the storage nodes
where this movie is stored.' This results in a point to point communication from
all the storage nodes to the network node (possibly multiple times depending
on the striping block size, the number of nodes in the system and the length of
Scheduling in Multimedia Systems 259
ao 80 do 81 d1 82 d2
I I I
to t1 t2 t3 t4 t5 t6 t7 ts
the movie) during the playback of the movie. Since each network node will be
responsible for a number of movie streams, the resulting communication pat-
tern is random point-to-point communication among the nodes of the system.
It is possible to achieve some locality by striping the movies among a small
set of nodes and restricting that the network nodes for a movie be among this
smaller set of storage nodes.
3 DISK SCHEDULING
A real-time request can be denoted by two parameters (c, p), where p is
the period at which the real-time requests are generated and c is the service
time required in each period. The earliest-deadline-first (EDF) [10] algorithm
showed that tasks can be scheduled by EDF if and only if the task utilization
L~=1 ci/Pi < 1. We will specify the real-time requests by specifying the re-
quired data rate in kbytes/sec. The time at which a periodic request is started
is called the release time of that request. The time at which the request is to
be completed is called the deadline for that request. Requests that do not have
real-time requirements are termed aperiodic requests.
Figure 2 shows the progress of disk service for a stream. Request for block 0
is released at time to and the request is actually scheduled at time t1 and is
denoted by event 80. The block 0 is consumed (event do) at time beginning
t 2 . The time between the consumption of successive blocks of this stream
di +1 - di has to be maintained constant for providing glitch-free service to the
user. For example, when 256 Kbyte blocks are employed for MPEG-1 streams,
this is equal to about 1.28 seconds. The time between the scheduling events of
successive blocks need not be constant. Only requirement is that the blocks be
scheduled sufficiently in advance to .guarantee that di+ 1 - di can be maintained
constant. This is shown in Figure 2. The vertical bars in the picture represent
the size of the request block.
260 CHAPTER 8
In real-time systems, algorithms such as earliest deadline first, and least slack
time first are used. As pointed out earlier, strict real-time scheduling policies
such as EDF may not be suitable candidates because of the random disk service
times and the overheads associated with seeks and rotational latency.
The scheduling algorithm should be fair. For example, shortest seek time first,
is not a fair scheduling algorithm since requests at the edges of the disk surface
may get starved. If the scheduling algorithm is not fair, an occasional request
in the stream may get starved of service and hence will result in missing the
deadlines.
EDF is the earliest deadline first policy. SCAN-EDF is a hybrid algorithm that
incorporates the real-time aspects of EDF and seek optimization aspects of
SCAN. CSCAN and EDF are well known algorithms and we will not elaborate
on them further.
SCAN-EDF applies seek optimization to only those requests that have the same
deadline. Its efficiency depends on how often these seek optimizations can be
applied, or on the fraction of requests that have the same deadlines. SCAN-
EDF serves requests in batches or rounds. Requests are given deadlines at the
end of a batch. Requests within a batch then can be served in any order and
SCAN-EDF serveS the requests within a batch in a seek optimizing order. In
other words, requests are assigned deadlines that are multiples of the period p.
When the requests have different data rate requirements, SCAN-EDF can be
combined with a periodic fill policy [16] to let all the requests have the same
deadline. Requests are served in a cycle with each request getting an amount
of service time proportional to its required data rate, the length of the cycle
being the sum of the service times of all the requests. All the requests in the
current cycle can then be given a deadline at the end of the current cycle.
The scaI1 direction can be chosen in several ways. In Step 2, if the tasks are
ordered with the track numbers of tasks such that Nl <= N2 <= ... <= NI,
then we obtain a CSCAN type of scheduling where the scan takes place only
from smallest track number to the largest track number. If the tasks are ordered
such that Nl >= N2 >= ... >= N I , then We obtain a CSCAN type of scheduling
where the scan takes place only from largest track number to the smallest track
262 CHAPTER 8
number. If the tasks can be ordered in either of the above forms depending on
the relative position of the disk arm, we get (elevator) SCAN type of algorithm.
ao 80 do 81 d1 82 d2
I I I
to t1 t2 t3 t4 t5 t6 t7 ts
II II
a~ 8'0 d~ 8~ d~
ao 80 do 81 d1 82 d2
I I I
to h t2 t3 t4 t5 t6 t7 ts tg
I I I
a~ 8~ 8'1 d'0 8~ d'1 d'2
to issue requests at the size of 5, then the buffer space requirement for each
stream is 25. If the I/O system supports n streams, the total buffer space
requirement is 2n5.
There is another tradeoff that is possible. The deadlines of the requests need
not be chosen equal to the periods of the requests. For example, we can defer
the deadlines of the requests by a period and make the deadlines of the requests
equal to 2p. This gives more time for the disk arm to serve a given request and
may allow more seek optimizations than that are possible when the deadlines
are equal to the period p. Figure 3(b) shows two streams providing the same
constant stream rate, but with different charecteristics of progress along the
time scacle. The stream with the deferred deadlines provides more time for the
disk to service a request before it is consumed. This results in a scenario where
the consuming process is consuming buffer 1, the producing process (disk) is
reading data into buffer 3 and buffer 2 is filled earlier by the producer and
waiting consumption. Hence, this raises the buffer requirements to 35 for each
request stream. The extra time available for serving a given request allows
more opportunities for it to be served in the scan direction. This results in
more efficient use of disk arm and as a result, larger number of request streams
can be supported at a single disk. A similar technique called work-ahead is
utilized in [1]. Scheduling algorithms for real-time requests when the deadlines
are different from the periods are reported in [8][13].
264 CHAPTER 8
Both these techniques, larger requests with larger periods and delayed dead-
lines, increase the latency of service at the disk. When the deadlines are delayed,
the data stream cannot be consumed until two buffers are filled as opposed to
waiting for one filled buffer when deadlines are equal to periods. When larger
requests are employed, longer time is taken for reading a larger block and hence
a longer time before the multimedia stream can be started. Larger requests
increase the response time for aperiodic requests as well since the aperiodic
requests will have to wait for a longer time behind the current real-time re-
quest that is being served. The improved efficiency of these techniques needs
to be weighed against the higher buffer requirements and the higher latency for
starting a stream.
Simulation model
A disk with the parameters shown in Table 8.1 is modeled. It is assumed
that the disk uses split-access operations or zero latency reads. In split-access
operation, the request is satisfied by two smaller requests if the read-write
head happens to be in the middle of the requested data at the end of the seek
operation. The disk starts servicing the request as soon as any of the requested
blocks comes under the read-write head. For example, if a request asks for
reading blocks numbered 1,2,3,4 from a track of eight blocks 1,2,. .. 8, and the
read-write head happens to get to block number 3 first, then blocks 3 and 4 are
read, blocks 5,6,7,8 are skipped over and then blocks 1 and 2 are read. In such
operation, a disk read/write of a single track will not take more than one single
revolution. Split access operation is shown to improve the request response
time considerably in [11]. Split-access operation, besides reducing the average
service time of a request, also helps in reducing the variability in service time.
The service policy for aperiodic requests depended on the scheduling policy
employed. In EDF and SCAN-EDF, they are served using the immediate server
approach [9] where the aperiodic requests are given higher priority over the
periodic real-time requests. The service schedule of these policies allows a
certain number of aperiodic requests each period and when sufficient number
of aperiodic requests are not present, the real-time requests make use of the
remaining service period. This policy of serving aperiodic requests is employed
so as to provide reasonable response times for both aperiodic and periodic
requests. This is in contrast to earlier approaches where the emphasis has been
only on providing real-time performance guarantees. In CSCAN, aperiodic
requests are served in the CSCAN order.
Each aperiodic request is assumed to ask for a track of data. The request size
for the real-time requests is varied among 1, 2, 5, or 15 tracks. The effect
of request size on number of supportable streams is investigated. The period
between two requests of a request stream is varied depending on the request
266 CHAPTER 8
size to support a constant data rate of 150 kB/sec. The requests are assumed
to be uniformly distributed over the disk surface.
Two systems, one with deadlines equal to the request periods and the second
with deadlines equal to twice the request periods are modeled. A comparison
of these two systems gives insight into how performance can be improved by
deferring the deadlines.
Each experiment involved running 50,000 requests of each stream. The max-
imum number of supportable streams n is obtained by increasing the number
of streams incrementally till n + 1 where the deadlines cannot be met. Twenty
experiments were conducted, with different seeds for random number genera-
tion, for each point in the figures. The minimum among these values is chosen
as the maximum number of streams that can be supported. Each point in the
figures is obtained in this way. The minimum is chosen (instead of the average)
in order to guarantee the real-time performance.
3.4 Results
When deadlines are deferred, CSCAN has the best performance. SCAN-EDF
has performance very close to CSCAN. EDF has the worst performance. EDF
scheduling results in random disk arm movement and this is the reason for poor
Scheduling in Multimedia Systems 267
26
""
~ 25
24
~ 23
~3t 22
21
,.g 20
";I 19
§
..
18
17
.5 16
""
:.:2: 15 o EDF
14 + CSCAN
13 x SCAN-EDF
12
11 ..... Extended deadlines
10 - •• Nonextended deadlines
9
8
7
6
Figure 4 also presents the improvements that are possible by increasing the
request size. As the request size is increased from 1 track to 15 tracks, the
number of supportable streams keeps increasing. The knee of the curve seems
to be around 5 tracks or 200 kbytes. At larger request sizes, the different
scheduling policies make relatively less difference in performance. At larger
request sizes, the transfer time dominates the service time. When seek time
overhead is a smaller fraction of service time, the different scheduling policies
have less scope for optimizing the schedule. Hence, all the scheduling policies
perform equally well at larger request sizes.
At smaller request sizes, deferring the deadlines has a better impact on per-
formance than increasing the request size. For example, at a request size of 1
track and deferred deadlines (with buffer requirements of 3 tracks) EDF sup-
ports 13 streams. When deadlines are not deferred, at a larger request size of
2 tracks and buffer requirements of 4 tracks, EDF supports only 12 streams. A
similar trend is observed with other policies as well. A similar observation can
be made when request sizes of 2 and 5 tracks are compared.
From Figures 4 and 5, it is seen that SCAN-EDF performs well under both mea-
sures of performance. CSCAN performs well in supporting real-time requests
but does not have very good performance in serving the aperiodic requests.
EDF, does not perform very well in supporting real-time requests but offers
good response times for aperiodic requests. SCAN-EDF supports almost as
many real-time streams as CSCAN and at the same time offers the best re-
sponse times for aperiodic requests. When both the performance measures are
considered, SCAN-EDF has better characteristics.
o EDF
+ CSCAN
X SCAN-EDF
100
treats all requests equally and hence higher aperiodic request arrival time only
reduces the time available for the real-time request streams and does not alter
the schedule of service. In other policies, since aperiodic requests are given
higher priorities, higher aperiodic request arrival rate results in less efficient
arm utilization due to more random arm movement. Hence, other policies see
more impact on performance due to higher aperiodic request arrival rate.
A more detailed performance study can be found in [12] where several other
factors such as the impact of a disk array are considered.
270 CHAPTER 8
~
- - - Nonextended deadlines
24
~
~ o EDF
.,S:!
21 + CSCAN
~
.2
X SCAN-EDF
-;; 18
§ 15
.§
......,
:::E 12
,•.
9
.' , ,
,,
6
CitJ'
3
,,
I2l
0
20 50 100 200
Aperiodic int. arr. time ms.
30
'"
I
.sa
29
2B
-;
.sa
27
26
";;I 25
§ 24
.51 23
~ 22
21 0 EOP
20 + CSCAN
X SCAN-EOP
19
18
17
16
15
0
The time taken to serve a batch of requests through a sweep, using SCAN-EDF,
has little variance. The possible variances of individual seek times could add
up to a possible large variance if served by a strict EDF policy. SCAN-EDF
reduces this variance by serving all the requests in a single sweep across the
disk surface. SCAN-EDF, by reducing the variance, reduces the time taken
for serving a batch of requests and hence supports larger number of streams.
This reduction in the variance of service time for a batch of requests has a
significant impact on improving the service time guarantees. Larger request
sizes, split-access operation of disk arm also reduce the variance in service time
by limiting the random, variable components of the service time to a smaller
fraction.
Figure 8 compares the predictions of analysis with results obtained from sim-
ulations for extended deadlines. For this experiment, aperiodic requests were
not considered and hence the small difference in the number of streams sup-
portable by SCAN-EDF from Figure 4. It is observed that the analysis is very
close to the simulation results. The error is within one stream.
I
r'
26
215
24 .-- - - -- ------+- I
u 23
~ 22
~ 21
=... 20
Iii 19
18
.~ 17
:::E 16 + SCAN·EOl'
115 Simulations
14 - - - Analysis
13
12
11
10
9
8
7
8
15
0
Request size
SCSI bus is a priority arbitrated bus. If more than one disk tries to transfer
data on the bus, disk with higher priority always gets the bus. Hence, it is
possible that real-time streams being supported by the lower priority disks
may get starved if the disk with higher priority continues to transmit data.
Better performance may be obtained with other arbitration policies such as a
round-robin policy. For multimedia applications, other channels such as the
proposed SSA by IBM, which operates as a time division multiplexed channel,
are more suitable.
Figure 9 shows the impact of SCSI bus contention on the number of streams
that can be supported. The number of streams supported is less than three
times that of the individual disk real-time request capacity. This is mainly due
to the contention on the bus. At a request size of 5 tracks, the ratio of the
number of streams supported in a three disk configuration to that of a single
disk configuation varies from 2.1 in the system with extended deadlines to 1.8
in the system without extended deadlines. This again shows that deadline
extension increases the chances of meeting deadlines, in this case smoothing
over the bus contention delays. Figure 9 assumes that the number of streams
on the three disks differ at most by one. If the higher priority disk is allowed
to support more real-time streams, the total throughput of real-time streams
out of the three disks would be lower. We observed a sharp reduction in the
number of streams supported at the second and third disks when the number
274 CHAPTER 8
of streams supported at the first disk is increased even by one. For example, at
a request size of 5 tracks and extended deadlines, SCAN-EDF supported 15, 14
and 14 streams at the three disks but only supported 7 streams at the second
and the third disks when the number is raised to 16 at the first disk.
50 X SCSI-multiple disks
'"
~ o Single disk
2:! 45
>< >E·>(·x.x···x ... •••••• X
t>!
.!!.!
~ 40
~
.9
-;; 35
§ ¥_ :Mi ~ X ~ +< - _ ~ - - - - - - -X
.9
><
30
I
'"
::; 25
I
20
x.r01I~
---,
••••••• ·8 ........... , .... , ... , , .... , . :.: '':'
-----
'15
[3 I ~------
15 >I:: --
t:r-
10 { .... Extended deadlines
• - - Nonextended deadlines
50
Another key difference that is noted is that with SCSI bus contention, there is
a peak in supportable request streams as the request size is increased. With
larger blocks of transfer, the SCSI bus could be busy for longer periods of time
when a disk with lower priority wants to access the bus and thus causing it to
miss a deadline. From the figure it is found that the optimal request size for a
real-time stream is roughly around 5 tracks.
The optimal request size is mainly related to the relative transfer speeds of the
SCSI bus and the raw disk. When a larger block size is used, disk transfers
are more efficient, but as explained earlier, disks with lower priority see larger
delays and hence are more likely to miss deadlines, When a shorter block is
used, disk transfers are less efficient, but the latency to get access to SCSI bus
is shorter. This tradeoff determines the optimal block size.
Most of the modern disks have a small buffer on the disk arm for storing the
data currently being read by the disk. Normally, the data is filled into this buffer
Scheduling in Multimedia Systems 275
by the disk arm at the media transfer rate (in our case, at 3.8 MB/sec) and
transfered out of this buffer at the SCSI bus rate (in our case, at 10 MB/sec).
If this arm buffer is not present, the effective data rate of SCSI bus will be
reduced to the media transfer rate or lower. When the disk arm buffers are
present, SCSI transfers can be intiated by the individual disks in an intelligent
fashion such that the SCSI data rate can be maintained high while providing
that the individual transfers are completed across the SCSI bus as they are
being completed at the disk surface. IBM's Allicat drive utilizes this policy for
transfering in and out of its 512 kbyte arm buffer and this is what is modeled in
our simulations. Without this arm buffer, when multiple disks are configured
on a single SCSI bus, the real-time performance will be significantly lower.
4 NETWORK SCHEDULING
We will assume that time is divided into a number of 'slots'. The length of a
slot is roughly equal to the average time taken to transfer a block of movie over
the multiprocessor network from a storage node to a network node. Average
delivery time itself is not enough in choosing a slot; we will comment later on
how to choose the size of a slot. Each storage node starts transferring a block
to a network node at the beginning of a slot and this transfer is expected to
finish by the end of the slot. It is not necessary for the transfer to finish strictly
within the slot but for ease of presentation, we will assume that a block transfer
completes within a slot.
The time taken for the playback of a movie block is called a frame. The length
of the frame depends on the block size and the stream rate. For a block size of
256 Kbytes and a stream rate of 200 Kbytes/sec, the length of a frame equals
256/200 = 1.28 seconds. We will assume that a basic stream rate of MPEG-l
quality at 1.5Mbits/sec is supported by the system. When higher stream rates
are required, multiple slots are assigned within a frame to achieve the required
delivery rate for that stream. It is assumed that all the required rates are
supported by transferring movie data in a standard block size (which is also
the striping size).
For a given system, the block size is chosen first. For a given basic stream
rate, the frame length is then determined. Slot width is then approximated by
dividing the block size by the average achievable data rate between a pair of
nodes in the system. This value is adjusted for variations in communication
delay. Also, we require that frame length be an integer multiple of the slot
276 CHAPTER 8
width. From here, we will refer to the frame length in terms of number of slots
per frame 'F'.
When the system is running to its capacity, each column would have an entry for
each storage node. The schedule in slot j can be represented by a set (nij. 8ij),
a set of network node and storage node pairs involved in a block transfer in
slot j. If we specify F such sets for the F slots in a frame (j = 1,2, ... F), we
would completely specify the schedule. If a movie stream is scheduled in slot
j in a frame, then it is necessary to schedule the next block of that movie in
slot j of the next frame (or in (j + F) mod F N slot) as well. Once the movie
distribution is given, the schedule of transfer (nij, 8ij) in slot j of one frame
automatically determines the pair (nij, 8ij) in the next frame, 8i(j+F)mod FN
being the storage node storing the next block of this movie and ni(j+F)mod FN
= nij. Hence, given a starting entry in the table (row. column specified), we can
Scheduling in Multimedia Systems 277
Movie/Blocks 0 1 2 3
A 0 1 2 3
B 1 3 0 2
C 2 0 3 1
D 3 2 1 0
E 2 1 0 3
Req 0 1 2 3 4 5 6 7 8 9 10 11
0 E.2 E.1 E.O E.3
1 C.2 C.O C.3 C.1
2 B.1 B.3 B.O B.2
3 E.2 E.1 E.O E.3
immediately tell what other entries are needed in the table. It is observed that
the F slots in a frame are not necessarily correlated to each other. However,
there is a strong correlation between two successive frames of the schedule and
this correlation is determined by the data distibution. It is also observed that
the length of the table (F N) is equal to the number of streams that the whole
system can support.
Now, the problem can be broken up into two pieces: (a) Can we find a data
distribution that, given an assignment of (nij, 8ij) that is source and destination
conflict-free, can produce a source and destination conflict-free schedule in the
same slot j of the next frame? and (b) Can we find a data distribution that,
given an assignment of (nij , 8ij) that is source, destination and network conflict-
free, produce a source, destination and network conflict-free schedule in the
same slot j of the next frame? The second part of the problem, (b), depends
on the network of the multiprocessor and that is the only reason for addressing
the problem in two stages. We will propose a general solution that addresses
(a). We then tailor this solution to suit the multiprocessor network to address
the problem (b).
Part (aJ
Assume that all the movies are striped among the storage nodes starting at node
oin the same pattern i.e., block i of each movie is stored on a storage node given
by i mod N, N being the number of nodes in the system. Then, a movie stream
accesses storage nodes in a sequence once it is started at node o. If we can start
the movie stream, it implies that the source and the destination do not collide
in that time slot. Since all the streams follow the same sequence of source
nodes, when it is time to schedule the next block of a stream, all the streams
scheduled in the current slot would request a block from the next storage node
in the sequence and hence would not have any conflicts. In our notation, a
set (nij, 8ij) in slot j of a frame is followed by a set (nij, (8ij + 1) mod N) in
the same slot j of the next frame. It is clear that if (nij, 8ij) is source and
destination conflict-free, (nij, (8ij + 1) mod N) is also source and destination
conflict-free.
°
any given slot. Since every movie stream has to start at storage node 0, node
becomes a serial bottleneck for starting movies. (ii) when short movie clips
are played along with long movies, short clips increase the load on the first
few nodes in the storage node sequence resulting in non-uniform loads on the
storage nodes. (iii) as a results of (a), the latency for starting a movie may be
°
high if the request arrives at node just before a long sequence of scheduled
busy slots.
The proposed solution addresses all the above issues (i), (ii) and (iii) and the
communication scheduling problem. The proposed solution uses one sequence
of storage nodes for storing all the movies. But, it does not stipulate that every
movie start at node 0. We allow movies to be distributed across the storage
nodes in the same sequence, but with different starting points. For example
°
°
movie can be distributed in the sequence of 0, 1, 2, ... , N-1, movie 1 can be
distributed in the sequence of 1, 2, 3, ... , N-1, and movie k (mod N) can be
distributed in the sequence of k, k+1, ... , N-1, 0, ... , k-l. We can choose any
such sequence of storage nodes, with different movies having different starting
points in this sequence.
When movies are distributed this way, we achieve the following benefits: (i)
multiple movies can be started in a given slot. Since different movies have
different starting nodes, two movie streams can be scheduled to start at their
starting nodes in the same slot. We no longer have the serial bottleneck at the
starting node (we actually do, but for l/Nth of the content on the server). (ii)
Since different movies have different starting nodes, even when the system has
short movie clips, all the nodes are likely to see similar workload and hence
the system is likely to be better load-balanced. (iii) Since different movies have
different starting nodes, the latency for starting a movie is likely to be lower
since the requests are likely to spread out more evenly.
The benefits of the above approach can be realized on any network. Again, if
the set (nij, 8ij) is source and destination conflict-free in slot j of a frame, then
the set (nij, (8ij + 1) mod N) is given to be source and destination conflict-free
in slot j of the next frame, whether or not all the movies start at node O. As
mentioned earlier, it is possible to find many such distributions. In the next
section, it will be shown that we can pick a sequence that also solves problem
(b), i.e., guarantees freedom from conflicts in the network.
280 CHAPTER 8
Part (b)
The issues addressed in this section are specific to the network of the system.
We will use IBM's SP2 multiprocessor with an Omega interconnection network
as an example multiprocessor. The solution described is directly applicable
to hypercube networks as well. The same technique can be employed to find
suitable solution for other networks. We will show that the movie distribu-
tion sequence can be carefully chosen to avoid communication conflicts in the
multiprocessor network. The approach is to choose an appropriate sequence of
storage nodes such that if movie streams can be scheduled in slot j of a frame
without communication conflicts, then the consecutive blocks of those streams
can be scheduled in slot j of the next frame without communication conflicts.
First, let us review the Omega network. Figure 11 shows a multiprocessor sys-
tem with 16 nodes which are interconnected by an Omega network constructed
out of 4x4 switches. To route a message from a source node whose address is
given by POP1P2P3 to a destination node whose address is given by QOq1 Q2Q3,
the following procedure is employed: (a) shift the source address left circular
by two bits (log of the switch size) to produce P2P3POP1, (b) use the switch
in that stage to replace POP1 with QoQ1 and (c) repeat the above two steps for
the next two bits of the address. In general, steps (a) and (b) are repeated as
the number of stages in the network. Network conflicts arise in step (b) of the
above procedure when messages from two sources need to be switched to the
same output of a switch.
Now, let's address our problem of guaranteeing freedom from network conflicts
for a set (nij' S(i+1) mod N j) given that the set (nij, Sij) is conflict-free. Our
result is based on the following theorem of Omega networks.
Theorem: If a set of nodes (ni, sd is network conflict-free, then the set of
nodes (ni, (Si + a)modN) is network conflict-free, for any a.
Proof: Refer to [7].
The above theorem states that given a network conflict-free schedule of commu-
nication, then a uniform shift of the source nodes yields a network conflict-free
schedule.
Scheduling in Multimedia Systems 281
(0000)00
(0000)00
(0001)01 (0001)01
(0010)02 (0010)02
(0011)03
(0011)03
(0100)04 (0100)04
(0101)05 (0101)05
(0110)06 (0110)06
(0111)07 (0111)07
(1000)08 (1000)08
(1001)09 (1001)09
(1010)10 (1010)10
(1011)11 (1011)11
(1100)12 (1100)12
(1101)13 (1101)13
(1110)14 (1110)14
(1111)15 (1111)15
There are several possibilities for choosing a storage sequence that guarantees
the above property. A sequence of 0, 1,2, .... , N-l is one of the valid sequences
- a simple solution indeed! Let's look at an example. The set 8 1 = (0,0),
(1,1), (2,2), ... , (14,14), (15,15) of network-storage nodes is conflict free over
the network (identity mapping). From the above theorem, the set 8 2 = (0,1),
(1,2), (2,3), ... , (14,15), (15,0) is also conflict-free and can be so verified. If 8 1
is the conflict-free schedule in a slot j, 8 2 will be the schedule in slot j of the
next frame, which is also conflict-free.
Now, the only question that remains to be addressed is how do we schedule the
movie stream in the first place, i.e., in which slot should a movie be started.
When the request arrives at a node ni, we first determine its starting node
So based on the movie distribution. We look at each available slot j (where
ni is ·free and So is free) to see if the set of already scheduled movies do not
conflict for communication with this pair. We search until we find such a slot
and schedule the movie in that slot. Then, the complete length of that movie
is scheduled without any conflicts.
.a
.!j 7
~
8- 6
5
~ 8 E)
3
+----+----+-----------+
2
O~3----4~---5~--~8~--~7~--~8~--~--~~~·
Inter-arrival time (rns)
It is clear from the above description that we need to carry out some experi-
ments in choosing the optimal slot size. Both the average and the maximum
delays in transferring a block over the network need to be considered. As men-
tioned earlier, the slot size is then adjusted such that a frame is an integer
multiple of the width of the slot. Since the block transfers are carefully sched-
uled to avoid conflicts, it is expected that the variations in communication
times will be lower in our system.
x- X Latency reduction
~ 10000 0 - - 0 No latency reduction
ii
=
!
1000
100
10
Latency
Node failures
Before, we can deal with the subject of scheduling after a failure, we need to talk
about how the data on the failed data is duplicated elsewhere in the system.
There are several ways of handling data protection, RAID, and mirroring being
two examples. RAID increases the load on the surviving disks by 100% and this
will not be acceptable in a system that has to meet real-time guarantees unless
the storage system can operate well below its peak operating point. Mirroring
may be preferred because the required bandwidths from the data stored in the
system are high enough that the entire storage capacity of a disk drive may not
be utilized. The un-utilized capacity can be used for storing a second copy of
the data. We will assume that the storage system does mirroring. We will also
assume that the mirrored data of a storage node is evenly spread among some
set of f{, K < N, storage nodes.
Let the data on the failed node fa be mapped to nodes rna, mI, ... , mK-I' Be-
fore the failure, a stream may request blocks from nodes 0,1,2, ... , fa, ... N -
1 in a round-robin fashion. The mirrored data of a movie is distributed
among rna, mI, ... , mK-I such that the same stream would request blocks in
the following order after a failure: 0,1,2, ... , rna, ... , N -1, 0,1, 2, ... , mI, ... , N -
I, ... ,0, 1,2, ... , mK-I, ... , N - 1,0,1,2, ... , mo, ... , N - 1. The blocks that would
have been requested from the failed node are requested from the set of mirror
nodes of that failed node in a round-robin fashion. With this model, a failure
increases the load on the mirrored set of nodes by a factor of (1+I/K) since for
every request to the failed node, a node in the set of mirrored nodes observes
11K requests. This implies that f{ should be as large as possible to limit the
load increases on the mirror nodes.
we can serve all the streams that we could serve before the failure. Now, let's
examine the conditions that will enable us to do this.
Given that the data on the failed node is now supported by K other nodes, the
total number of blocks that can be communicated in 1 slots is given by K * I.
The failed node could have been busy during (F N -I) slots before the failure.
This implies that K I 2: F N - I, or 1 2: F N / (K + 1) - (1).
It is noted that no network node ni can require communication from the failed
node fa in more than (FN - I)/N slots. Under the assumptions of system
wide striping, once a stream requests a block from a storage node, it does not
request another block from the same storage node for another N - 1 frames.
Since each network node can support at most (F N - 1)/ N streams before the
failure, no network node requires communication from the failed node fa in
more than (F N - 1)/ N slots. Since every node is free during the 1 free slots,
the network nodes require that 12: (F N -1)/ N, or 1 2: F N /(N + 1) - (2). The
above condition (1) is more stringent than (2).
Ideally, we would like K = N - 1 since this minimizes the load increase on the
mirror nodes. Also, we would like to choose the mirror data distribution such
that if block transfer from the mirrored nodes is guaranteed to be conflict-free
during a free slot j, then it will also be conflict-free in the slot j + F N (the
same free slot in the next schedule table) when the transfers would require
data from the next node in the mirror set. In our notation if the set (ni' mi) is
conflict-free in a free slot j, then we would like the set (ni' m(i+1)modK) to be
conflict-free in slot j + N F.
Schedule of block transfers during the free slots is explained below. A maximal
number of block transfers are found that do not have conflicts in the network.
This set is assigned one of the free slots. With the remaining set of required
block transfers, the above procedure is repeated until all the communication is
scheduled. This algorithm is akin to the problem of finding a minimal set of
matchings of a graph such that the union of these matchings yields the graph.
We can show an upper bound on the number offree slots required. We can show
that at least 4 blocks can always be transferred without network conflicts as long
as the source and destinations have no conflicts, when the Omega network is
built out of 4x4 switches. If a set of four destinations are chosen such that they
differ in the most significant 2 bits of the address, it can be shown that as long
as the source and destinations are different, the block transfers do not collide
in the network. The proof is based on the procedure for switching a block from
a source to a destination and if the destinations are so chosen it can be shown
Scheduling in Multimedia Systems 287
that these four transfers use different links in the network. Since at most F N -I
blocks need to be transferred during the free slots, 1 :s; (F N - 1)/4. This gives
I :s; F N /5. This implies that if the network nodes requiring communication
from the failed node are equally distributed over all the nodes in the system,
we can survive a storage node failure with about 20% overhead.
Network node failures can be handled in the following way. The movie streams
at the failed node are rerouted (redistributed) evenly to the other network nodes
in the system. This assumes that the delivery site can be reached through any
one of the network nodes. The redistributed streams are scheduled as if the
requests for these streams (with a starting point somewhere over the length of
the movie, not necessarily at the beginning) are new requests.
If a combo node fails, both the above procedures for handling the failure of a
storage node and a network node need to be invoked.
Clock Synchronization
Throughout this section, it is assumed that the clocks of all the nodes in the
system are somehow synchronized and that the block transfers can be started
at the slot boundaries. If the link speeds are 40MB/sec, a block transfer of 256
Kbytes requires 6.4 ms, quite a large period of time compared to the precision of
the node clocks which tick every few nanoseconds. If the clocks are synchronized
to drift at most, say 600 us, the nodes observe the slot boundaries within ±10%.
During this time, it is possible that the block transfers observed collisions in
the network. But during the rest of the 90% transfer time, the block transfers
take place without any contention over the network. This shows that the clock
synchronization requirements are not very strict. It is possible to synchronize
clocks to such a coarse level by broadcasting a small packet of data at regular
intervals to all the nodes through the switch network.
starting nodes), a conflict free schedule in one slot guarantees that the set of
transfers required a frame later would also be conflict free.
For other lower degree networks such as a mesh or a two dimensional torus,
it can be shown that similar guarantees cannot be provided. For example, in
a two dimensional nxn torus, the average path length of a message is 2* n/4
= n/2. Given that the system has a total of 4 * n 2 unidirectional links, the
average number of transmissions that can be in progress simultaneously is given
by 4*n 2 /(n/2) = 8*n, which is less than the number of nodes n 2 in the system
for n > 8. However, n simultaneous transfers are possible in a 2-dimensional
torus when each node sends a message to a node along a ring. If this is a
starting position of data transfer in one slot, data transfer in the following
frames cannot be sustained because of the above limitation on the average
number of simultaneous transfers through the network. In such networks, it
may be advantageous to limit the data distribution to a part of the system so
as to limit the average path length of a transfer and thus increasing the number
of sustainable simultaneous transfers.
Incremental growth
How does the system organization change if we need to add more disks for
putting more movies in the system? In our system, all the disks are filled
nearly to the same capacity since each movie gets distributed across all the
nodes. If more disk capacity is required, we would require that at least one
disk be added at each of the nodes. If the system has N nodes, this would
require N disks. The newly added disks can be used as a set to distribute
movies across all the nodes to obtain similar guarantees for the new movies
distributed across these nodes. If the system size N is large, this may pose a
problem. In such a case, it is possible to organize the system such that movies
are distributed across a smaller set of nodes. For example, the movies can be
distributed across the two sets 0, 2, 4, 6 and 1, 3, 5, 7 in an 8-node machine
to provide similar guarantees as when the movies are distributed across all the
8 nodes in the system. (This result is again a direct consequence of the above
Theorem 1.) In this example, we only need to add 4 new disks for expansion
as opposed to adding 8 disks at once. This idea can be generalized to provide
a unit of expansion of ]{ disks in an N node system, where ]{ is a factor of N.
This shows that the width of striping has an impact on the system's incremental
expansion. The wider the movies are striped across the nodes of the system,
the larger the bandwidth to a single movie but also the larger the unit of
incremental growth.
Scheduling in Multimedia Systems 289
5 GENERAL DISCUSSION
When the system is expanded, the newly added disks may have different per-
formance characterisitcs than the already installed disks. How do we handle
the different performance charateristics of different disks?
Providing fast-forward and rewind operations has not been discussed in this
chapter. Depending on the implementation, these operations may result in
varying demands on the system. It is possible to store a second version of the
290 CHAPTER 8
movie sampled at a higher (fast-forward) rate and then compressed on the disk
for handling these operations. Then, fast-forward and rewind operations will
not cause any extra demands on the system resources but will introduce the
problems of scheduling the proper version of the movie at the right time. These
strategies remain to be evaluated.
Acknowledgements
The work reported here has benefited significantly from discussions and inter-
actions with Jim Wyllie and Roger Haskin of IBM Almaden Research Center.
REFERENCES
[1] D.P. Anderson, Y. Osawa, and R. Govindan, "Real-Time Disk Storage
and Retrieval of Digital Audio/Video data", Technical Report UCB/CSD
91/646, University of California, Berkeley, August 1991.
[2] D.P. Anderson, Y. Osawa, and R. Govindan. "A File System for Contin-
uous Media", ACM Transactions on Computer Systems, November 1992,
pp. 311-337.
[3] A. Chervenak, "Tertiary Storage: An Evaluation of New Applications",
Ph.D. Dissertation, University of California, Berkeley, 1994.
[4] H.M. Deitel, "An Introduction to Operating Systems", Addison Wesley,
1984.
[5] R. Haskin, "The Shark Continuous-Media File Server", Proceedings of
IEEE CaMP CON. February 1993.
[6] K. Jeffay, D.F. Stanat, and C.D. Martel, "On Non-Preemptive Schedul-
ing of Periodic and Sporadic Tasks", Proceedings of Real- Time Systems
Symposium, December 1991, pp. 129-139.
[7] D.H. Lawrie, "Access and Alignment of Data in an Array Processor",
IEEE Transactions on Computers, Vol. 24, No. 12, December 1975, pp.
1145-1155.
[8] J.P. Lehoczky, "Fixed Priority-Scheduling of Periodic Task Sets with Ar-
bitrary Deadlines", Proceedings of Real- Time Systems Symposium, Dece-
mebr 1990, pp. 201-212.
Scheduling in Multimedia Systems 291
[9] T.H. Lin and W. Tarng, "Scheduling Periodic and Aperiodic Tasks in Hard
Real-Time Computing Systems", Proceedings of SIGMETRICS, May 1991,
pp. 31-38.
[10] C.L. Liu and J .W. Layland, "Scheduling Algorithms for Multiprogram-
ming in a hard Real-Time Environment", Journal of ACM, 1973, pp. 46-
61.
[11] A.L. Narasimha Reddy, "A Study ofl/O System Organizations", Proceed-
ings of Int. Symposium on Computer Architecture, May 1992.
[13] W.K. Shih, J .W. Liu, and C.L. Liu, "Modified rate Monotonic Algorithm
for Scheduling Periodic Jobs with Deferred Deadlines", Technical Report,
University of Illinois, Urbana-Champaign, September 1992.
[14] F.A. Tobagi, J. Pang, R. Biard, and M. gang, "Streaming RAID: A Disk
Storage System for Video and Audio Files", Proceedings of ACM Multi-
media Conference, August 1993, pp. 393-400.
[15] H.M. Vin and P.V. Rangan, "Designing File Systems for Digital Video
and Audio", Proceedings of 13th A CM Symposium on Operating Systems
Principles, 1991.
[16] J. Yee and P. Varaiya, "Disck Scheduling Policies for Real-Time Multime-
dia Applications", Technical Report, University of California, Berkeley,
August 1992.
[17] P.S. Yu, M.S. Chen, and D.D. Kandlur, "Grouped Sweeping Scheduling
for DASD-Based Multimedia Storage Management", Multimedia Systems,
Vol. 1, 1993, pp. 99-109.
9
VIDEO INDEXING AND
RETRIEVAL
Stephen W. Smoliar and HongJiang Zhang
Institute of Systems Science, National University of Singapore
Singapore
1 INTRODUCTION
1.1 Motivation
Video technology has developed thus far as a technology of images, but little
has been done to help us use those images effectively. We can buy a camera
that "knows" how to focus itself properly or compensate for our inability to
hold it steady without a tripod; but no camera knows "where the action is"
during a football game or even a press conference. A camera shot can give us
a clear image of the ball going through the goal posts, but only if we find the
ball for it.
The effective use of video is beyond our grasp because the effective use of its
content is still beyond our grasp. In this Chapter we shall address four areas
in which software can make the objects of video content more accessible:
classifying agent [K+91], which makes it very likely that any given video
may be classified according to multiple ontologies.
Indexing and retrieval: One way to make video content more accessible is
to store it in a database [SSJ93]. Thus, there are also problems concerned
with how such a database should be organized, particularly if its records
are to include images as well as text. Having established how material can
be put into a database, we must also address the question of how that
same material can be effectively retrieved, either through directed queries
which must account for both image and text content or through browsing
when the user may not have a particularly focused goal in mind.
Interactive tools: Most of our experiences with video involve sitting and
watching it passively. For video to be an information resource, we shall
need tools which facilitate interacting with it. These tools will make it
more likely that the functionality of the other three areas in this list will
actually be employed.
If the word is the fundamental syntactic unit of language, then the fundamental
unit of video and film is the shot, defined to be "one uninterrupted run of
the camera to expose a series of frames" [BT93]. The shot thus consists of a
sequence of image units, which are the frames. Often, it is desirable to represent
a shot by one or more of its frames which capture its content; such frames are
known as key frames [O'C91].
Much of the analysis offrames and shots is concerned with the extent to which
they are perceptually similar. This means that it is necessary to define some
quantitative representation of qualitative differences. This representation is
called a. difference metric [ZKS93].
queries are processed most efficiently if the images are indexed according to
quantitative representations of their content features, an organization known
as content-based indexing. However, content-based queries rarely can be pro-
cessed as exactly as conventional alphanumeric queries. The result is more likely
to be a set of suitable candidates than a set of images which exactly match what
the user has specified [F+94]. If this number of candidates is sufficiently large,
the user will also need browsing facilities [ZSW95] to review them quickly and
select those which are closest to what he had in mind.
2 PARSING
2.1 Techniques
Temporal Segmentation
Assuming that our basic indexing unit is a single uninterrupted camera shot,
temporal segmentation is the problem of detecting boundaries between con-
secutive camera shots. As was observed in Section 1.2, the general approach
to solution has been the definition of a suitable quantitative difference metric
which represents significant qualitative differences between frames. A segment
boundary can then be declared whenever that metric exceeds a given threshold.
One of the most successful of these metrics uses a histogram of intensity levels,
since two frames with similar content will show little difference in their respec-
tive histograms. The histogram is represented as a function Hj(j), where i is
the frame number and j is the code for a specific histogram bin. The simplest
way to define histogram bins is as ranges of intensity values. However, it is also
possible to define bins which correspond to intensity ranges of color compo-
nents, making the histogram a somewhat richer representation of color content
[NT92]. Regardless of how the bins are defined, the difference between the ith
frame and its successor may be computed as a discrete L 1-norm [Rud66] as
follows:
G
(9.2)
However, experimental results reported in [ZKS93] showed that this also in-
creases the difference due to camera or object movements. Therefore, the over-
all performance is not necessarily better than that achieved by using Equation
9.1, while Equation 9.2 also requires more computation time.
(9.5)
We can say that a particular block has changed across the two frames if its
difference exceeds a given threshold t:
D(i,i+if;»Tb (9.7)
if not all, blocks across a camera shot boundary. If M is the number of valid
motion vectors for each P frame and the smaller of the numbers of valid forward
n
and backward motion vectors for each B frame, and is a threshold value close
to zero, then
M<n (9.8)
will be an effective indicator of a camera boundary before or after (depending
on whether interpolation is forward or backward) the B/P frame [ZLS95].
For MPEG sources it is possible to combine data from both DCT coefficients
and motion vectors in a hybrid technique. The first step is to apply a DCT
comparison, such as the one defined by Equation 9.7, to the I frames with a large
skip factor c/J to detect regions of potential gradual transitions, breaks, camera
operations, or object motion. The large skip factor reduces processing time by
comparing fewer frames. Furthermore, gradual transitions are more likely to be
detected as potential breaks, since the difference between two more "temporally
distant" frames could be larger than the break threshold, n. The drawbacks
of using a large skip factor, false positives and low temporal resolution for
shot boundaries, are then recovered by a second pass, only applied to the
neighborhood of the potential breaks and transitions, in which the difference
metric of Equation 9.8 is applied to all the Band P frames of those selected
sequences to confirm and refine the break points and transitions detected by
the DCT comparison.
each shot depends on features of the shot content, variations in those features,
and the camera operations involved [ZSW95j . The technique is based on the
same sort of difference metric used for temporal segmentation. In this case
computation takes place within a single shot. The first frame is proposed as a
key frame, and consecutive frames are compared against that candidate. A two-
threshold technique, similar to twin-comparison, identifies a frame significantly
different from the candidate; and that frame is proposed as another candidate,
against which successive frames are compared. Users can specify the density of
the detected key frames by adjusting the two threshold values.
Model-Based Parsing
Automatic extraction of "semantic" information of general video programs is
outside the capability of current signal analysis technologies [PPS94j. On the
other hand "content parsing" may be possible with a structure model based on
domain knowledge. Such a model may represent spatial order within individual
images and/or temporal order across a sequence of shots.
Algorithms used Nd Nm Nj Np Nz
Gray-level comparison 101 8 9 2 1
X 2 gray-level comparison 93 16 9 2. 3
2.2 Examples
Temporal Segmentation
Some experimental results which summarize the efficacy of twin-comparison
are presented in Table 1 [ZKS93). These figures are based on applying three
difference metrics to a documentary video. The first two rows of Table 1 show
the results of applying difference metrics 9.1 and 9.2, respectively, to gray-level
histograms. The third row shows the result of applying difference metric 9.1 to
histograms of a six-bilt color code.
Video Indexing and Retrieval 301
Table 1 shows that histogram comparison, whether the bins are based on gray-
level or color intensity, is highly accurate in detecting both camera breaks and
gradual transitions. Color gives the best results: besides the high accuracy, it
is also the fastest of the three algorithms. (This is primarily because the color
histogram requires only 64 histogram bins, since the code is based on the two
high-order bits of the red, green, and blue intensities; gray-level, on the other
hand, is represented as an 8-bit value, yielding a 256-bin histogram. In other
words six bits of a color code appear to provide more effective information than
eight bits of gray level.) Finally, the rightmost two columns in Table 1 account
for transitions detected by twin-comparison which are actually due to camera
operation and are detected by analysis of motion content.
This documentary was also compressed in MPEG, and Figure 3 shows sequences
of the three difference metrics discussed in Section 2.1. A good example of the
advantage of the hybrid approach is break 2 in Figure 3(a), which is missed
by the DCT correlation algorithm but is clearly recognized by motion vector
comparison. The images compared across this break (as well as those across
break 3 for control) are illustrated in Figure 4. Figure 5, on the other hand, is
the camera pan corresponding to segment Tl in Figure 3(b), which is identified
as a gradual transition by twin-comparison (Tt is the second threshold for twin-
4
0.6 5
0.4
0.2
o
120 240 360 480 600 120 840 Q60
Figure 3 (a) Video parsing results on a test video compressed in MPEG with
1 intra-picture, 1 predictively coded frame and 4 bi-directional predictively
coded frames in every 6 frames: DCT coefficient correlation using Arman's
difference value III; difference values are computed between successive intra-
picture frames. Tb is the threshold for detecting a camera break. T t is the
second threshold used in twin-comparison.
302 CHAPTER 9
1000
3 4
800 1
600 - ~.
400
200
o
120 240 360 490 1300 no ~40 gao
Figure 3 (continued)
(b) Pair-wise block comparison of DCT coefficients based on difference value
D(i,i + q,); difference values are computed between successive intra-picture
frames.
250
200
150
11)(1
50
Ttl - - - 1-
two "Stock footage" videos consisting of unedited raw shots of various lengths,
covering a wide range of scenes and objects. "Singapore," the second test, is
travelogue material from an elaborately produced documentary which draws
upon a variety of sophisticated editing effects. Finally, the "Dance" video is
the entirety of Changing Steps, a "dance for television" conceived by choreog-
rapher Merce Cunningham and video designer Elliot Caplan, which contains
shots with fast moving objects (dancers), complex and fast changing lighting
and camera operations. and highly imaginative editing effects, many of which
are far less conventional than those in "Singapore." As shown in Table 2, the
temporal segmentation algorithms perform very well with an accuracy over 95%
(counting both missing and false detection as errors) for the first three video
sequences. Most of the missed detections of shot boundaries in "Dance" are
due to special editing effects in which techniques such as fades and dissolves
are combined with lighting changes in manners which often leave the viewer
unaware that a shot change is taking place.
Video Indexing and Retrieval 305
For all shots which are correctly detected, at least one key frame has been
detected to represent the shot. As was observed in Section 2.1, the density
of key frames is under user control; and these trials yielded an average of
between two and three key frames per shot . The test results also demonstrate
abstraction of camera operations (both panning and zooming) by key frames.
Figure 6 illustrates a "soft" VCR which displays the extracted key frames for
the camera shot currently being viewed. In this particular example those key
frames summarize a shot which zooms out.
Program N Ns Nm NJ
SBCl 20 18 2 0
SBC2 19 18 1 0
Model-Based Parsing
Table 3 [Z+95a] lists the numbers of video shots and anchorperson shots iden-
tified by the news parsing system discussed in Section 2.1. Over 95% of the
anchorperson shots were detected correctly. The missing anchorperson shot in
SBCl was due to the anchorperson not facing the camera. There are a few
false positives in both programs, where an interview scene was falsely detected
as an anchorperson shot.
Video Indexing and Retrieval 307
Table 4 [Z+95a] lists the numbers of news items identified by the news pars-
ing system and the numbers manually identified by watching the programs.
The system identifies news items with over 95% accuracy. The missed news
items resulted from the assumption that each news item in the program starts
with an anchorperson shot followed by a sequence of news shots. However,
this condition can be violated by both news items which are only read by an
anchorperson without news shots and by news items which start without an
anchorperson shot. Those limitations are difficult to overcome with only image
analysis techniques, and content analysis would benefit from the texts of either
teleprompter scripts or closed captions.
3 REPRESENTATION AND
CLASSIFICATION
3.1 Techniques
Image Features
While key frame extraction constitutes an approach to representation-the
abstraction of motion video into static images-there remains the question of
how the content of those images may be represented. One of the most appealing
approaches is to work with descriptions based on properties which are inherent
in the images themselves: the patterns, colors, textures, and shapes of image
objects, and related layout and location information. For many applications
such queries may be either supplemental or preferable to text, if not either
necessary or easier to formulate. For example, when it is necessary to verify
that a trade mark or logo has not been used by another company, the easiest
way is to query an image database system for all images similar to the proposed
pattern [K+ 91]. Search is driven by identifying specific features of that pattern
which need to match images in the database.
There are two ways in which retrieval based on image features may be ap-
proached. We call the first model-based, because it assumes some a priori
knowledge-the model-of how images are structured. As we saw in Section
2.1, models can be very useful in classifying anchorperson shots in news broad-
casts or similar corpora of highly stereotyped material. However, the most
important content of a news program is the collection of news stories being
308 CHAPTER 9
The second approach requires a more general model of which features should be
examined and how different instances of those features should be compared for
proximity. On the basis of results which have been established to date, those
features which have been seen to be most effective have been color, texture,
shape of component objects, and relationships among edges which may be
expressed in terms of line sketches. Extensive research has been carried out
to address how those features may be represented; and currently the most
comprehensive results may be found in the designs of IBM's QBIC [F+94], the
Photobook project at The MIT Media Laboratory [PPS94], and the SWIM
(Show What I Mean) project at the Institute of Systems Science [G+94].
Color
Color histograms may be defined for key frames just as they are defined for
frame comparison in temporal segmentation. They may be again compared
with an L1-norm. If Q and I are histograms corresponding, for example, to
a query image and an image in a database, then a suitable L 1-norm is the
following:
N
2: min (Ii , Qi) (9.9)
;=1
This value may then be normalized by the total number of pixels in one of the
two histograms:
(9.10)
This value will always range between 0 and 1. Previous work has shown that
this metric is fairly insensitive to change in image resolution, histogram size,
occlusion, depth, and view point [SB91].
Texture
Texture has long been recognized as being as important a property of im-
ages as is color, if not more so, since textural information can be conveyed
as readily with gray-level images as it can in color. Nevertheless, there is an
extremely wide variety of opinion concerned with just what texture is and how
it may be quantitatively represented [TJ93]. The so-called "Tamura features"
[TMY76] are particularly notable for their quantification of psychological at-
tributes. These features are usually known by the names coarseness, direction-
ality, and contrast. Coarseness is a measure of the granularity of the texture. It
is derived from moving averages computed over windows of different sizes; one
of these sizes gives an optimum fit in both the horizontal and vertical direc-
tions and is used to calculate the coarseness metric. Directionality is computed
310 CHAPTER 9
Shape
Color and texture are useful features in representing both scenes and the objects
within them. However, there are also properties which may only be defined over
individual objects. For purposes of indexing and retrieval, the most important
of these features are geometric properties of objects, including shape, size, and
location. Quantitative representations of these properties may be based on
standard techniques in digital image processing [G+94].
Sketch Features
One of the more intuitive approaches to describing an image in visual terms for
the sake of retrieving it is to provide a sketch of what is to be retrieved. The
richness of features in such a sketch will, of course, depend heavily on the skill
of the user who happens to be doing the sketching, since a well-drawn sketch
can make good use of techniques such as coloring and shading. However, as
sort of a "lowest common denominator ," we may assume that a sketch is a
simple line drawing which captures some basic information about the shapes
and orientations of at least some of the objects in the image.
Under this assumption the features of an image which would guide any attempt
at retrieval would be some set of edges associated with a reasonably abstract
representation ofthe image. One technique for constructing such an edge-based
representation has been developed by the Electrotechnical Laboratory at MITI
in Japan [K+92]. The measurement of similarity is then applied to an edge-
based representation of each image, compared against a pixel representation of
a sketch.
-
in the actual search space [F+94]. This mapping will guarantee that a query
based on X' space will not miss any actual hit , but may contain some false
--+
hits . This reduces the problem to defining a suitable X' space.
312 CHAPTER 9
Drawing a Query
A natural way to specify a visual query is to let the user paint or sketch an
image. A feature-based query can then be formed by extracting the visual
features from the painted image. The sketch image shown in Figure 8 is an
example of such a query: the user draws a sketch with a pen-based device or
photo manipulation tool. The query may also be formed by specifying objects
in target images with approximate location. size, shape, color and texture on a
drawing area using a variety of tools, similar to the template maps, to support
such paintings. This is one of the interface approaches currently being pursued
in QBIC [Sey94]. In most of the cases, coarse specification is sufficient, since a
query can be refined based on retrieval results. This means it is not necessary
for the user to be a highly skilled artist. Also. a coarse specification of object
features can absorb many errors between user specifications and the actual
target images.
Video Indexing and Retrieval 313
Figur 9 (continud)
A qu ry ima,e comp ad from Ielectioll8 {rom the menu
314 CHAPTER 9
Incomplete Queries
Supporting incomplete queries is important for image databases, since user
descriptions often tend to be incomplete. Indeed, asking for a complete query
is often impractical. To accommodate such queries, search should be confined
to a restricted set of features (a single feature value, if necessary). All query
formation environments presented above can provide the option of specifying
which features will be engaged in the search process. For instance, a query can
be formed based on template selection that will retrieve all images containing
a red car, regardless of whether the car is in the center or any other part of
the image and whatever size it may be. Furthermore. the user should have the
option of modifying the feature vector which drives the query (modifying the
shade of red of the car, for example) [F+94].
5 INTERACTIVE TOOLS
5.1 Micons
An icon is a visual representation of some unit of information. If that infor-
mation is time-related, then representing it as an icon requires being able to
account for the temporal dimension. The video icon (also called a "micon" or
"movie icon") is such a representation of a video source [A +92] . It is constructed
according to a very simple principle: If a single frame of video is represented by
a two-dimensional array of pixels, then the video itself may be represented by
a three-dimensional volume, where the third dimension represents the passage
of time. Thus, a micon may be constructed by stacking successive frames of
a video, one behind the other, enabling the user to see the frame on the front
and the pixels along the upper and side faces (Figure 10) .
It is important to observe the extent to which the pixel traces recorded on the
top and side faces of a micon can bear information as useful as frame images.
The video source material for Figure 10 is a brief clip of the ascent of the
rocket which launched the Apollo 11 mission to the moon. As the rocket rises,
it cuts across the top row of pixels in each frame. The sequence of pixels in
316 CHAPTER 9
this row in successive frames thus contains a "trace" of the horizontal layers
of the rocket which cross in each frame. The top face of the micon thus serves
as a "tomogram" [AT94] in which the body of the rocket is reconstructed in a
spatiotemporal image.
If a shot includes camera action, such as panning and zooming, then the spatial
coordinates of a micon will no longer correspond to those of the physical space
being recorded. However, a Hermart transform [A +92] may be used to construct
a micon which is more like "extruded plastic" than a rectangular solid. As the
camera zooms in and out, the size of the frame images shrink and expand,
respectively; and if the camera pans to the right, then the entire frame will be
displaced to the right a corresponding distance. The resulting micon is then
accurately embedded into its two spatial coordinates.
6 CONCLUSION
Audio
As any film-maker knows, the audio track provides a very rich source of in-
formation to supplement our understanding of any video [BT93]. Thus, any
attempt to work with video content must take the sound-track into account
as well. Logging a video requires decomposing our auditory perceptions into
objects just as we do with our visual perceptions. Unfortunately, we currently
know far less about the nature of "audio objects" than we do about correspond-
ing visual objects [Smo93]. Nevertheless, there are definitely the beginnings of
models of audio events, some of which are similar to models used in image-based
content parsing. For example, in a sports video, very loud shouting followed by
Video Indexing and Retrieval 319
a long whistle might indicate that someone has scored a goal, which should be
recognized by content analysis as an "event." Clearly, however, audio analysis
is a new frontier which needs considerable exploration within the discipline of
multimedia.
Parsing Models
In retrospect it is not surprising that the news program material discussed in
Section 2.1 is relatively easy to parse. Its temporal and spatial predictability
make it easier for viewers to pick up those items which are of greatest interest.
However, because more general video program material tends to be concerned
with entertainment, which, in turn, is a matter of seizing and holding viewers'
attention, the success of a program often hinges on the right balance between
the predictable and the unpredictable. Thus, the very elements which often
make a program successful are those which would confound attempts to au-
tomate parsing. On the other hand those unpredictable elements can only be
detected and appreciated in the context of knowledge of what is predictable or
anticipated. What this means is that a variety of other video programs should
also admit of parsing models. Some of these models may not be as detailed as
those which characterize news programs, but they will still be based on spatial
or temporal features which assist the viewer in appreciating the nature of the
content. Consequently, a major area of research will involve identifying those
tools and techniques which may be employed in modeling different kinds of
video program material.
Acknowledgments
Figures 1, 2, and 11 are reproduced with the permission of Springer-Verlag, the
first two having appeared in Volume 2, Number 6 of the journal Multimedia Sys-
tems and the third having appeared in the book Advances in Digital Libraries.
Figures 3, 4, and 5 are reproduced with the permission of Kluwer Academic
Publishers, having first appeared in Volume 1, Number 1 of the journal Mul-
timedia Tools and Applications. The images in Figures 7 and 12 have been
used with the permission of the Cunningham Dance Foundation, and those in
Figure 8 were provided with the kind permission of the Television Corporation
of Singapore.
REFERENCES
[A +92] A. Akutsu et al. Video indexing using motion vectors. In Visual Com-
munications and Image Processing '92, pages 1522-1530, Boston, MA,
November 1992. SPIE.
[AHC93] F. Arman, A. Hsu, and M.-Y. Chiu. Image processing on compressed
data for large video databases. In Proceedings: A CM Multimedia 93, pages
267-272, Anaheim, CA, August 1993. ACM.
[AT94] A. Akutsu and Y. Tonomura. Video tomography: An efficient method
for camerawork extraction and motion analysis. In Proceedings: A CM
Multimedia 94, pages 349-356, San Francisco, CA, October 1994. ACM.
[G+94} Y. Gong et al. An image database system with content capturing and
fast image indexing abilities. In Proceedings of the International Confer-
ence on Multimedia Computing and Systems, pages 121-130, Boston, MA,
May 1994. IEEE.
[K+91] T. Kato et al. A cognitive approach to visual interaction. In Interna-
tional Conference on Multimedia Information Systems '91, pages 109-120,
SINGAPORE, January 1991. ACM, McGraw Hill.
Video Indexing and Retrieval 321
[K+92] T. Kato et al. A sketch retrieval method for full color image database:
Query by visual example. In Proceedings: 11th International Conference on
Pattern Recognition, pages 530-533, Amsterdam, HOLLAND, September
1992. IAPR, IEEE.
[MCW92] M. Mills, J. Cohen, and Y. Y. Wong. A magnifier tool for video data.
In Proceedings: CHI'92, pages 93-98, Monterey, CA, May 1992. ACM.