Alberto Vigata Master Thesis - MPEG Archival and Live Streaming Over RTP UDP

An MPEG archival and
live streaming system

over RTP/UDP networks
Alberto Vigatá Pascual

An MPEG archival and live streaming system over RTP/UDP networks
1. INTRODUCTION .................................................................................................. 4
Startup point....................................................................................................................... 5
New features and requirements ........................................................................................ 6
MPEG Archival .................................................................................................................. 6
MPEG file streaming ......................................................................................................... 6
Network streaming............................................................................................................. 6
2. TECHNOLOGIES ................................................................................................. 7
MPEG.................................................................................................................................. 7
MPEG Systems, an overview ............................................................................................ 8

About MPEG Systems ..................................................................................................... 8
MPEG transport stream.................................................................................................... 8
Synchronization ............................................................................................................... 9
STD Model....................................................................................................................... 9
PES packet ..................................................................................................................... 10
MPEG Transport packet ................................................................................................ 11
Program Stream pack ..................................................................................................... 12
Mpeg Video Bitstream Syntax, an overview.................................................................. 13

Sequence ........................................................................................................................ 13
Group of Pictures ........................................................................................................... 16
Picture ............................................................................................................................ 19
RTP.................................................................................................................................... 22
About realtime networking ............................................................................................ 22
Multimedia over Internet and other TCP/IP networks ................................................... 22
Some solutions ............................................................................................................... 23
RTP - Realtime Transport Protocol ............................................................................... 24
Development .................................................................................................................. 24
RTP operation ................................................................................................................ 24
RTP fixed header fields ................................................................................................. 25
RTP features................................................................................................................... 26
3. MPEG ARCHIVAL ............................................................................................. 28

Introduction .................................................................................................................... 28
MPEG compliance and further ...................................................................................... 28
Audio Elementary Streams ............................................................................................ 29
Video Elementary Streams ............................................................................................ 29
Program Streams and System Streams .......................................................................... 30
Transport Streams .......................................................................................................... 30
2
The Approach................................................................................................................. 31
Analysis process ............................................................................................................... 32

Program/System Streams ............................................................................................... 36
Transport Streams .......................................................................................................... 38
Decision process ............................................................................................................... 40

Program/System & Transport Streams .......................................................................... 44
Implementation .............................................................................................................. 46
4. MPEG FILE STREAMING ................................................................................. 47
Introduction ...................................................................................................................... 48
The system ........................................................................................................................ 48

Audio elementary streams.............................................................................................. 49
Program/System Stream................................................................................................. 53
Transport stream ............................................................................................................ 55
5. NETWORK TRANSMISSION / RECEPTION ................................................... 56
Program , Audio and Video Streams ............................................................................. 57
Transport Streams ........................................................................................................... 58
The receiver side .............................................................................................................. 59
6. IN THE WORKS ................................................................................................. 60
MPEG Archival ................................................................................................................ 60
MPEG File Streaming ..................................................................................................... 60
RTP/UDP Transmision & Reception ............................................................................. 60
7. SOURCES .......................................................................................................... 62
8. ANNEX ............................................................................................................... 63
3
1. Introduction
There has recently been a flood of interest in the delivery of multimedia services
over digital networks, in particular the Internet. The growing popularity of Internet
telephony, streaming audio and video services (such as those provided by Real Audio
from Real Networks , Windows Media from Microsoft or QuickTime from Apple) are all
indicators of this trend.
In the advent of the digital era breakthroughs in video compression and network
protocols and technologies have contributed to make the term ‘streaming’ acquire new
significance. Streaming, in the digital world denotes de action of transmitting multimedia
content over any network, in particular audio and video. The flow of digital data is
actually streamed through the network following different patterns multicast ( one
transmitter multiple receivers ), unicast ( one transmitter one receiver ).
On the other side MPEG 1 is the unquestionable standard for digital video
compression. It’s usually accepted that no standard before has had such a wide acceptance
and quick deployment and MPEG still has a long way to go. Even when the beginning of
standard is set at the beginning of the 90’s and multiple applications are already using it
(digital satellite broadcasts, Digital Video Disks (DVD), digital video cameras …), new
technologies in other areas pose new uses for MPEG.
In particular the high speed at what that digital networks are working these days
make it possible to transmit MPEG content over those networks enabling a whole new
range of applications like personal teleconferencing, surveillance, distance learning
between many others.
This project, developed between September 2000 and March 2001 at Innovacom
Inc.2 works in that direction, showing an actual implementation of new applications of
MPEG technology, in particular MPEG archival and transmission, as well as RTP/UDP
MPEG streaming.
1
MPEG, referring to ISO/IEC 11172 and ISO/IEC 13818 standards, also known as MPEG1 and MPEG2.
2
Innovacom Inc. 3400 Garrett Drive 95054 Santa Clara CA USA www.transpeg.com
4
Startup point
The work carried out in this project was an addition to the MediaWEB ™ series of
products from Innovacom Inc. The system in place is completely based on MPEG
technology, either in its software components or its hardware parts.
Some of the already implemented features of the system were:
 Live MPEG video encoding. Using state of the art, MPEG hardware encoders,
the system converts analog video and audio to a compressed MPEG stream,
being able to tweak the most common parameters of it like bitrate, GOP
structure, MPEG format (MPEG-1 or MPEG-2)…
 Live MPEG video decoding. Relying on high quality hardware MPEG
decoders the system can output uncompressed video through analog and
digital interface.
 Network transmitting & receiving. The system had support for two different
types of networking: ATM and TCP. The basic functionality was the ability to
send and receive MPEG data over ATM and TCP networks.
Some important pieces were missing in the system in place: the need of some kind
of archival feature. That translates into the need to be able to store MPEG streams into
files, and at the same time, being able to stream those stored files either to the network or
to an MPEG decoder (either hardware of software)
There was also the need to incorporate a more adequate system to transmit MPEG
data over TCP/IP networks. It would be interesting to have some kind of protocol to be
used over the network so it was possible to track problems in the transmission, (packet
loss, changes in the packet order…) and react in front of them.
5
New features and requirements
After analyzing the existing features new specific requirements for the system
were made and are outlined next:
MPEG Archival
 The system should be able to store live incoming streams from a non specified
source
 The saved MPEG streams should be able to be used in most MPEG editing
applications, and should be MPEG compliant streams
 A high compatibility ratio should be obtained with files stored with the
system. Any MPEG compliant device should be able to play them flawlessly
 No assumption should be done from the way the structure of the MPEG
stream is setup
 The system should be able to detect interruptions in the MPEG stream and
react accordingly to that.
 The system should support System Streams, Program Streams, Transport
Streams and Elementary Streams 3.
MPEG file streaming

 The system should be able to stream an MPEG file emulating the behavior of a
live MPEG encoder
 The system should be highly efficient being able to stream up to 8 files
simultaneously in the same box
 The system should support system, program, transport and elementary
streams.
Streams and Elementary Streams
Network streaming
 The new network streaming protocol should support packet loss detection
 It should support multiplexing and checksum services, in some layer of the
network structure.
Streams and Elementary Streams
3
Elementary streams, includes MPEG-1 and MPEG-2 audio and video
6
2. Technologies
This project is based on two main technologies. The MPEG standard 4 and RTP,
the real time protocol. It’s desirable that the reader is familiar with them in order to
understand the concepts and features deployed in the work of this paper.
MPEG
Although the MPEG standard was developed at the beginning of the 90s it wasn’t
until recently when the flourishing and maturation of the MPEG technology took place.
All current digital video standards for broadcasting applications are based on MPEG, like
DVD (Digital Versatile Disk), DSS, DVB, DAB and an astounding number of other
applications.
However, it was interesting to find how a feature as basic as MPEG archival is not
widespread, at least in the sense in which is treated here. MPEG archival from existing
MPEG content is a rare feature that we’ve tried to bring into life from scratch given the
non-existing references we have found. MPEG file streaming is also a not very prominent
feature, so a non biased approach has been taken.
To be able to understand the details of the implementation of our archival system,

it’s a requirement that the reader has knowledge about MPEG systems and the MPEG
video syntax. We then proceed to overview them, and suggest that the reader takes a deep
look to this section before going forward.
4
ISO/IEC 11172 and ISO/IEC 13818
7
MPEG Systems, an overview
We overview here some basic facts MPEG systems. Although most of the
explanations are MPEG-2 specific, they can very easily applied to the case of MPEG-1 if
one thinks of MPEG-2 program streams.
About MPEG Systems
MPEG-2 Systems is an ISO/IEC standard (13818-1) that defines the syntax and
semantics of bitstreams in which digital audio and visual data are multiplexed. Such
bitstreams are said to be MPEG-2 Systems compliant. The MPEG specification does not
mandate, however, how equipment that produces, transmits, or decodes such bitstreams
should be designed. As a result, the specification can be used in a diverse array of
environments, including local storage, broadcast (terrestrial and satellite), as well as
interactive environments.
MPEG-2 Systems provides a two layer multiplexing approach. The first layer is
dedicated to ensure tight synchronization between video and audio. It is a common way
for presenting all the different materials which require synchronization (video, audio, and
private data). This layer is called Packetized Elementary Stream (PES). The second
layer is dependent on the intended communication medium. The specification for error
free environments such as local storage is called MPEG-2 Program Stream, while the
specification addressing error prone environments is called MPEG-2 Transport Stream.
The differences between MPEG-2 program stream and MPEG-1 Systems are
mild. MPEG-2 Systems mandated compatibility with MPEG-1 Systems. The MPEG-2
Program Stream is designed for that purpose. MPEG-2 Systems also addresses error
prone environments, and provides all the hooks for Conditional Access systems. The
major difference lies on the signalling which is present in MPEG-2 Program Streams and
was absent in MPEG-1 Systems. A minor difference also exists in the PES format.
MPEG transport stream
MPEG-2 transport is a service multiplex thought to work on error prone

environments. No mechanism, within the syntax, exists to ensure the reliable delivery of
the transported data. MPEG-2 transport relies on underlying layers for such services.
MPEG-2 transport requires the underlying layer to identify the transport packets and to
indicate in the transport packet header, when a transport packet has been erroneously
transmitted. The MPEG-2 Transport Stream is so named to signify that it is the input to
the Transport layer in the OSI seven layer network model. It is not, in itself, the Transport
layer.
MPEG-2 Transport Streams carry transport packets. These packets carry two
types of information: the compressed material and the associated signalling tables. A
8
transport packet is identified by its PID (Packet Identifier). Each PID is assigned to carry
data belonging either to one particular compressed data source (and only this data source)
or one particular signaling table. The ordered sequence of packets with a given PID may
be considered as one data stream. The compressed data source may be derived from either
video, audio or data elementary streams. These elementary streams may be tightly
synchronized (as it is usually necessary for Digital TV programs, or for Digital Radio
programs), or not synchronized (in the case of programs offering downloading of
software or games, as an example).
The associated signalling tables consist of the description of the elementary

streams which are combined to build programs, and in the description of those programs.
Tables are carried in sections. The signaling tables are called PSI (Program Specific
Information).
There is a description of each program carried within the MPEG-2 Transport

Stream. This description usually requires a particular table, the Program Map Table, with
one table per program. This table is only sent periodically. On the other hand, the
elementary streams which make up a program are continuously carried in PES streams. In
that sense it could be said that an MPEG-2 Transport Stream does not carry programs, but
only carries elementary streams and the instructions required to associate particular
elementary streams into particular programs.
Synchronization
Synchronization inside MPEG systems is achieved through timestamps.
They are two types of time stamps:
 The first type is usually called a reference time stamp. This time stamp is a sample
of the clock of the encoder that was used to generate this mux stream. Reference
time stamps are to be found in the PES syntax (ESCR), in the program syntax
(SCR), and in the transport syntax (PCR).
 The second type of time stamp is called DTS (Decoding Time Stamp) or PTS
(Presentation Time Stamp). They indicate the exact moment where a video frame
or an audio frame has to be decoded or presented to the user respectively.
Although timestamps are not mandatory some applications like Digital TV
broadcast, where tight synchronization is required, will make an extensive use of them. In
that case both reference time stamp and DTS/PTS are used. In other cases (game or
software downloading for example) neither reference nor DTS/PTS time stamps are
necessary. DTS and PTS time stamps are not relevant if reference time stamps are not
present.
PTS and DTS are inserted as close as possible to the portions of compressed
video, audio, or data to which they apply. Precisely, this means that they are inserted in
the PES packet headers, a syntax which is common to all data sources.
STD Model
9
A system target decoder (STD) model is a virtual decoder. There are two models,
one within the MPEG-2 program syntax (the P-STD), the other within the MPEG-2
transport syntax (The T-STD). A model defines buffer sizes, their input and output rates,
and timing constraints related to time stamps values.
The STD model was invented for not being implementation dependent. The first
model comes from MPEG-1 Systems. Some of the assumptions in the T-STD are even
not realistic at all: buffers, for instance, when decoding occurs are supposed to be emptied
instantaneously.
Next we depict what is the syntax of MPEG-2. We highlight those elements that
are used later in this project, and are key to the way our systems work.
PES packet
packet
PES optional
start stream
packet PES PES packet data bytes
code id
length HEADER
prefix
24 8 16
PES PES data PES stuffing

original optional
scrambling priority alignment copyright header bytes
'10' or copy 7 flags fields
data
control indicator (0xFF)
length
2 2 1 1 1 1 8 8 m*8
ES DSM additional previous PES

PTS
ESCR rate trick copy info PES extension
DTS
mode CRC
Presentation 33 42 22 8 7 16
timestamps 33 optional
5 flags
fields
PES pack program P-STD PES PES

packet extension extension
private header buffer
field
data field seq cntr length field data
128 8 8 16 7
10
MPEG Transport packet
188 bytes
transport
packet header payload header
header payload header
header payload
stream
sync transport payload transport transport adaptation continuity adaptation

error unit start PID scrambling field counter field
byte priority
indicator indicator control control
8 1 1 1 13 2 2 4
adaptation discontinuity random elementary

optional stuffing
field indicator access stream
5 flags fields bytes
priority
length indicator
indicator
8 1 1 1 5
transport transport adaptation optional

splice private field
PCR OPCR private 3 flags fields
countdown data extension
data
length length
42 42 8 8 8 3
Program
Clock
ltw_valid ltw piecewise splice
Reference flag offset rate type
DTS_next_au
1 15 2 22 4 33
11
Program Stream pack
13818 pack pack pack

Program header
pack 1
header pack 2 ... header
pack n
Stream
System
Clock pack program pack pack system
pack '0
Referencelayer start SCR mux stuffing stuffing header PES packet 1 ...
1'
code rate length byte
32 2 42 22 5 3
PES packet i ... PES packet n ...
system header rate audio fixed CSPS audio video video N loop
header length bound bound flag flag lock lock bound
start
flag flag
code
32 16 22 6 1 1 1 1 5
P-STD P-STD
stream '1 buffer buffer
id 1' bound size ... ...
scale bound
8 2 1 13
12
Mpeg Video Bitstream Syntax, an overview
It is extremely important to understand the layout of the MPEG Video bitstream to

be able to identify and solve the problems related with this project. Because of this, this
part intends to describe the MPEG video bitstream.
For simplicity, an explanation of the MPEG-1 syntax has been chosen. MPEG-2
inherits MPEG-1’s syntax and adds some extensions, but as far as we are concerned it
will be enough to be familiar with MPEG-1’s one. A brief explanation of the bitstream
structure is presented, and some of the relevant fields for our study are explained.
Overview
A sequence is the top video level of coding. It begins with a sequence header
which defines important parameters needed by the decoder. The sequence header is
followed by one or more groups of pictures. Groups of pictures, as the name suggests,
consist of one or more individual pictures. The sequence may contain additional sequence
headers. A sequence is terminated by a sequence_end_code. The MPEG standard allows
considerable flexibility in specifying application parameters such as bit rate, picture rate,
picture resolution, and picture aspect ratio. All these parameters are specified in the
sequence header.
Sequence
A video sequence commences with a sequence header and is followed by one or

more groups of pictures and is ended by a sequence_end_code. Additional sequence
headers may appear within the sequence. In each such repeated sequence header, all of
the data elements with the permitted exception of those defining quantization matrices
(load_intra_quantizer_matrix, load_non_intra_quantizer_matrix and optionally
intra_quantizer_matrix and non_intra_quantizer_matrix) must have the same values as
the first sequence header. Repeating the sequence header with its data elements makes
random access into the video sequence possible. The quantization matrices may be
redefined as required with each repeated sequence header.
The encoder may set such parameters as the picture size and aspect ratio in the
sequence header, to define the resources that a decoder requires. In addition, user data
may be included.
Sequence Header Code
A coded sequence begins with a sequence header and the header starts with the
sequence start code. Its value is:
hex: 00 00 01 B3
13
This is a unique string of 32 bits that cannot be emulated anywhere else in the
bitstream, and is byte-aligned, as are all start codes. To achieve byte alignment the
encoder may precede the sequence start code with any number of zero bits. These can
have a secondary function of preventing decoder input buffer underflow. This procedure
is called bit stuffing, and may be done before any start code. The stuffing bits must all be
zero. The decoder discards all such stuffing bits.
The sequence start code, like all video start codes, begins with a string of 23 zeros.
The coding scheme ensures that such a string of consecutive zeros cannot be produced by
any other combination of codes, i.e. it cannot be emulated by other codes in the bitstream.
This string of zeros can only be produced by a start code, or by stuffing bits preceding a
start code.
The exact syntax of a sequence header, is as follows:
sequence_header() { No. of bits

sequence_header_code 32
horizontal_size_value 12
vertical_size_value 12
aspect_ratio_information 4
frame_rate_code 4
bit_rate_value 18
marker_bit 1
vbv_buffer_size_value 10
constrained_parameters_flag 1
load_intra_quantiser_matrix 1
if ( load_intra_quantiser_matrix )
intra_quantiser_matrix[64] 8*64
load_non_intra_quantiser_matrix 1
if ( load_non_intra_quantiser_matrix )
non_intra_quantiser_matrix[64] 8*64
next_start_code()
}
Vertical Size
This is a 12-bit number representing the height of the picture in pels, i.e. the
vertical resolution. It is an unsigned integer with the most significant bit first. A value of
zero is not allowed (to avoid start code emulation) so the legal range is from 1 to 4095. In
practice values are usually a multiple of 16. At 1.5 Mbps, a popular vertical resolution is
240 to 288 pels. Values of 240 pels are convenient for interfacing to 525-line NTSC
systems, and values of 288 pels are more appropriate for 625-line PAL and SECAM
systems.
If the vertical resolution is not a multiple of 16 lines, the encoder must fill out the
picture at the bottom to the next higher multiple of 16 so that the last few lines can be
coded in a macroblock. The decoder will discard these extra lines before display.
14
Horizontal Size
This is a 12-bit number representing the width of the picture in pels, i.e. the
horizontal resolution. It is an unsigned integer with the most significant bit first. A value
of zero is not allowed (to avoid start code emulation) so the legal range is from 1 to 4095.
In practice values are usually a multiple of 16. At 1.5 Mbps, a popular horizontal
resolution is 352 pels. The value 352 is derived from half the CCIR 601 horizontal
resolution of 720, rounded down to the nearest multiple of 16 pels. Otherwise the encoder
must fill out the picture on the right to the next higher multiple of 16 so that the last few
pels can be coded in a macroblock. The decoder will discard these extra pels before
display.
Picture Rate
This is a four-bit integer which is an index to the following table:
CODE PICTURES PER SECOND

0000 Forbidden
0001 23.976
0010 24
0011 25
0100 29.97
0101 30
0110 50
0111 59.94
1000 60
1001 Reserved
.
. .
1111 Reserved
The allowed picture rates are commonly available sources of analog or digital
sequences. One advantage in not allowing greater flexibility in picture rates is that
standard techniques may be used to convert to the display rate of the decoder if it does not
match the coded rate.
Bit Rate
The bit rate is an 18-bit integer giving the bit rate of the data channel in units of
400 bps. The bit rate is assumed to be constant for the entire sequence. The actual bit
rate is rounded up to the nearest multiple of 400 bps. For example, a bit rate of 830100
bps would be rounded up to 830400 bps giving a coded bit rate of 2076 units.
15
If all 18 bits are 1 then the bitstream is intended for variable bit rate operation.
The value zero is forbidden.
For constant bit rate operation, the bit rate is used by the decoder in conjunction
with the vbv_delay parameter in the picture header to maintain synchronization of the
decoder with a constant rate data channel. If the stream is multiplexed using Part 1 of this
standard, the time-stamps and system clock reference information defined in Part 1
provide a more appropriate tool for performing this function.
VBV Buffer Size
The buffer size is a 10-bit integer giving the minimum required size of the input
buffer in the model decoder in units of 16384 bits (2048 bytes). For example, a buffer
size of 20 would require an input buffer of 20 x 16384 = 327680 bits (= 40960 bytes).
Decoders may provide more memory than this, but if they provide less they will probably
run into buffer overflow problems while the sequence is being decoded.
Group of Pictures
Two distinct picture orderings exist, the display order and the bitstream order (as
they appear in the video bitstream). A group of pictures (gop) is a set of pictures which
are contiguous in display order. A group of pictures must contain at least one I picture.
This required picture may be followed by any number of I and P pictures. Any number of
B pictures may be interspersed between each pair of I or P pictures, and may also precede
the first I picture.
A group of pictures, in bitstream order, must start with an I picture and may be
followed by any number of I, P or B pictures in any order.
Another property of a group of pictures is that it must begin, in display order, with
an I or a B picture, and must end with an I or a P picture. The smallest group of pictures
consists of a single I picture, whereas the largest size is unlimited.
The original concept of a group of pictures was a set of pictures that could be
coded and displayed independently of any other group. In the final version of the MPEG
standard this is not always true, and any B pictures preceding (in display order) the first I
picture in a group may require the last picture in the previous group in order to be
decoded. Nevertheless encoders can still construct groups of pictures which are
independent of one another. One way to do this is to omit any B pictures preceding the
first I picture. Another way is to allow such B pictures, but to code them using only
backward motion compensation.
From a coding point of view, a concisely stated property is that a group of

pictures begins with a group of pictures header, and either ends at the next group of
pictures header or at the next sequence header or at the end of sequence, whichever comes
first.
16
Some examples of groups of pictures are given below:
I
I P P
I B P B P
B B I B P B P
B B I B B P B B P B B P
B I B B B B P B I B B I I
The syntax of the group of pictures is defined as follows:
group_of_pictures_header() { No. of bits

group_start_code 32
time_code 25
closed_gop 1
broken_link 1
next_start_code()
}
.
Group of Pictures Start Code
The group of pictures header starts with the Group of Pictures start code. This
code is byte-aligned and is 32 bits long. Its value is:
hex: 00 00 01 B8
It may be preceded by any number of zeros. The encoder may have inserted some
zeros to get byte alignment, and may have inserted additional zeros to prevent buffer
underflow. An editor may have inserted zeros in order to match the vbv_delay parameter
of the first picture in the group.
Time Code
A time code of 25 bits immediately follows the group of pictures start code. This
time code conforms to the SMPTE time code [6].
The time code can be broken down into six fields as shown in the following table:
FIELD BITS VALUES

Drop frame flag 1 0 to 23
Hours 5 0 to 23
Minutes 6 0 to 59
Fixed 1 1
Seconds 6 0 to 59
Picture number 6 0 to 60
17
The time code refers to the first picture in the group in display order, i.e. the first
picture with a temporal reference of zero.
Closed GOP
A one bit flag follows the time code. It denotes whether the group of pictures is
open or closed. Closed groups can be decoded without using decoded pictures of the
previous group for motion compensation, whereas open groups require such pictures to be
available.
A typical example of a closed group is shown below.
I B B P B B P B B P B B P
0 1 2 3 4 5 6 7 8 9 10 11 12
closed group
B B I B B P B B P B B P B B P
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
open or closed group
A less typical example of a closed group is shown in this last example. In it, the B
pictures which precede the first I picture must use backward motion compensation only,
i.e. any motion compensation must be based only on picture number 2 in the group.
If the closed_gop flag is set to 0 then the group is open. The first B pictures in the
group may have been encoded using the last picture in the previous group for motion
compensation.
Broken Link
A one bit flag follows the closed gop flag. It denotes whether the previous group
of pictures can be used to decode the current group. Encoders normally set this flag to 0
indicating that the previous group can be used for decoding. If the sequence has been
edited so that the original group of pictures no longer precedes the current group, then this
flag must be set to 1 by the editor.
If the group is closed, then the flag is less useful. It is suggested that encoders still
set it to zero, and editors set it to 1 so that decoders can detect if the bitstream has been
edited at that point.
18
Picture
The picture layer contains all the coded information for one picture. The header
identifies the temporal reference of the picture, the picture coding type, the delay in the
video buffer verifier (VBV) and, if appropriate, the range of the vectors used.
The syntax
picture_header() { No. of bits

Picture_start_code 32
temporal_reference 10
Picture_coding_type 3
vbv_delay 16
if ( picture_coding_type == 2 || picture_coding_type == 3) {
full_pel_forward_vector 1
forward_f_code 3
}
if ( picture_coding_type == 3 ) {
full_pel_backward_vector 1
backward_f_code 3
}
while ( nextbits() == ‘1’ ) {
extra_bit_picture /* with the value ‘1’ */ 1
extra_information_picture 8
}
extra_bit_picture /* with the value ‘0’ */ 1
next_start_code()
}
Picture Header and Start Code
A picture begins with a picture header. The header starts with a picture start code.
This code is byte-aligned and is 32 bits long. Its value is:
hex: 00 00 01 00
It may be preceded by any number of zeros.
Temporal Reference
The Temporal Reference is a ten-bit number which can be used to define the order
in which the pictures must be displayed. It is useful since pictures are not transmitted in
display order, but in the order which the decoder needs to decode them. The first picture,
in display order, in each group must have Temporal Reference equal to zero. This is
incremented by one for each picture in the group.
19
Some example groups of pictures with their Temporal Reference numbers are
given below:
Example (a) in I B P B P
display order 0 1 2 3 4
Example (a) in I P B P B
decoding order 0 2 1 4 3
Example (b) in B B I B B P B B P B B P
display order 0 1 2 3 4 5 6 7 8 9 10 11
Example (b) in I B B P B B P B B P B B
decoding order 2 0 1 5 3 4 8 6 7 11 9 10
Example (c) in B I B B B B P B I B B I I
display order 0 1 2 3 4 5 6 7 8 9 10 11 12
Example (c) in I B P B B B B I B I B B I
decoding order 1 0 6 2 3 4 5 8 7 11 9 10 12
If there are more then 1023 pictures in a group, then the Temporal Reference is
reset to zero and then increments anew. This is illustrated below:
B B I B B P ... P B B P ... P B B P display order

0 1 2 3 4 5 ... 1022 1023 0 1 ... 472 473 474 475
Example Group of Pictures containing 1500 pictures
Picture Coding Type
A three bit number follows the temporal reference. This is an index into the
following table defining the type of picture.
CODE PICTURE TYPE

000 Forbidden
001 I Picture
010 P Picture
011 B Picture
VBV Delay
For constant bit rate operation, vbv_delay defines the current state of the VBV
buffer (VBV is an acronym for Video Buffer Verifier - the model decoder). It specifies
how many bits it should contain when bits for all previous pictures have been removed,
and the model decoder is about to start decoding the current picture.
20
Its purpose is to allow the decoder to synchronize its clock with the encoding
process, and to allow the decoder to determine when to start decoding a picture after
random access in order not to run into future problems of buffer overflow or underflow.
The buffer fullness is not specified in bits but rather in units of time. The
vbv_delay is a 16-bit number defining the time needed in units of 1/90000 second to fill
the input buffer of the model decoder from an empty state to the current state at the bit
rate specified in the sequence header.
For example, suppose the vbv_delay had a decimal value of 30000, then the time
delay would be:
D = 30 000 / 90 000 = 1 / 3 second
If the channel bit rate were 1.2 Mbps then the contents of the buffer before the
picture is decoded would be:
B = 1 200 000 / 3 = 400 000 bits
If the decoder determined that its actual buffer fullness differed significantly from
this value, then it would have to adopt some strategy for regaining synchronization.
The meaning of vbv_delay is undefined for variable bit rate operation.
The following scheme represents the structure of a general MPEG-2 stream. It

includes the extensions that are present there and if analysed closely, the differences with
MPEG-1 are not very important in only bitstream structure is considered.
21
RTP
About realtime networking
Multimedia networking 5 faces many technical challenges like real-time data over
non-realtime networks, high data rate over limited network bandwidth, unpredictable
availability of network bandwidth.
First, multimedia applications usually require much higher bandwidth than

traditional textual applications. High-bandwidth network protocols such as Gigabit
Ethernet, FDDI, and ATM are expected to make the networking of digital video and
audio practical. The basis of Internet, TCP/IP and UDP/IP, provides the range of services
needed to support both small and large scale networks.
Second, almost all multimedia applications require the real-time traffic which is
very different from non-real-time data traffic. If the network is congested, the only effect
on non-real-time traffic is that the transfer takes longer to complete. In contrast, real-time
data becomes obsolete if it doesn't arrive in time. As a consequence, real-time
applications deliver poor quality during periods of congestion.
On the other hand, bandwidth is not the only problem. For most multimedia
applications, the receiver has a limited buffer, if the data arrives too fast, the buffer can be
overflowed and some data will be lost, also resulting in poor quality.
Therefore, protocols for realtime applications must be worked out to get real
multimedia networking.
Multimedia over Internet and other TCP/IP networks
Running multimedia applications over packet-switched networks like the Internet

is very attractive. First, the infrastructure often is already in place. Adopting existing
datagram networks and Internet connections will save expensive software development.
Second, this approach seems more economical and easier to manage than separate
datagram and real-time networks. In contrast to dedicated lines or connections, LAN and
WAN technologies provide a relatively inexpensive, plentiful, but shared bandwidth over
bigger and bigger networks.
Because of its shared nature, at first glance, datagram networks do not seem
suitable for real-time traffic. Packets are routed independently across shared networks, so
transit times vary significantly (jitter 6). A class of real-time applications called playback
applications aims to solve the jitter problem. Playback applications aim to solve the jitter
problem by buffering at the receiver. Adaptive applications adapt to changing delays and
5
Multimedia networking is understood as building the multimedia on network and distributed systems, so
different users on different machines can share image, sound, video, voice and many other features and to
communicate with each under these tools.
6
Jitter - Variations in transit delays
22
work well on moderately loaded datagram networks. They can deal with jitter caused by
short-lived bursts, and they can tolerate occasional lost packets during brief periods of
congestion.
However, parts of the Internet are often heavily loaded. The price tag attached to
shared bandwidth is congestion, leading to jitter and packet loss. At certain times of the
day, some MBone audio multicasts are unintelligible because of more than 30% packet
loss. While real-time traffic contributes heavily to congestion because of large bandwidth
requirements, it also suffers more from congestion than non-real-time traffic. The only
effect of congestion on non-real-time traffic is that a transfer takes longer to complete. In
contrast, real-time data becomes obsolete if it doesn't arrive in time. As a consequence,
real-time applications deliver poor quality during periods of congestion.
To cope with congestion, several approaches have been proposed in which the
application adapts to the available bandwidth by switching to a different encoding.
Adaptive encoding mechanisms help to keep up useful service during congestion, but
they are not a general solution. Real-time applications are useless when the available
bandwidth drops below a certain minimum bandwidth or when transit times vary so much
that interactivity is impossible.
Some solutions
The underlying problem is that different classes of applications require different

services. For example, a file transfer application requires that some quantity of data is
transferred in an acceptable amount of time, while internet telephony requires that most
packets get to the receiver in less than 0.3 seconds. If enough bandwidth is available,
best-effort service fulfills all of these requirements. When resources are scarce, however,
real-time traffic should be treated differently.
The Integrated Services working group in the IETF (Internet Engineering Task
Force) developed an enhanced Internet service model that includes best-effort service and
real-time service. The Resource Reservation protocol(RSVP), together with Realtime
Transport Protocol (RTP), Real-Time Control Protocol(RTCP), Real-Time Streaming
Protocol (RTSP), provides a working foundation for this architecture that is a
comprehensive approach to provide applications with the type of service they need in the
quality they choose.
In the framework of our project, where MPEG transmission using some kind of
streaming transport protocol was required, RTP stood out as the clear option for
streaming MPEG over networks TCP/IP. RTP provides a thin layer between the actual
MPEG stream data and the TCP/IP framing, and at the same time provides basic features
as sequencing for detecting data loss, time stamping and payload type detection. Because
of the importance of RTP for this project, let’s review the protocol in detail.
23
RTP - Realtime Transport Protocol
Because of their unpredictable delay and availability, TCP/UDP are not suitable
for applications with realtime character. The realtime transport protocol (RTP) is a thin
protocol providing support for applications with real-time properties, including timing
reconstruction, loss detection, security and content identification. RTP can be used
without RTCP if desired. RTP can transport independently so that it could be used over
CLNP (Connectionless Network Protocol), IPX (Internetwork Packet Exchange) or other
protocols. RTP is currently also in experimental use directly over AAL5/ATM.
Development
After some initial experiments, which go back to the early 70's, research in the
field of audio transmission over the Internet increased enormously. Voice experiments
within the DARTnet (ARPA network) in 1991 formed the groundwork for RTP based
MBone transmissions.
Another important influence was Sun Microsystems introduction of the

SPARCstation incorporating built-in audio codecs. In early versions of LBL's Audio
Conference Tool vat7 a protocol was used which has been referred to later on as RTP
version 0.
In December 1992 Henning Schulzrinne, GMD Berlin, published RTP version 1,

which went through several states of Internet drafts.
It finally got approved on Nov 22, 1995 by the IESG as an proposed standard. At
this time several not backward-compatible changes had been made resulting in RTP
version 2. It has been published as
RFC 1889, RTP: A Transport Protocol for Real-Time Applications

RFC 1890, RTP Profile for Audio and Video Conferences with Minimal Control
On January 31, 1996, Netscape announced "Netscape LiveMedia" based on RTP,

RFC 1889 and other standards.
Microsoft claims that their NetMeeting Conferencing Software, that is included

with the Windows operating system is compatible with RTP.
The latest extensions have been made by an industry alliance around Netscape
Inc., who uses RTP as the basis of their Real Time Streaming Protocol RTSP.
RTP operation
There are two transport layer protocols in the Internet protocol suite, TCP and
UDP. TCP provides a reliable flow between two hosts. It is connection-oriented and thus
7
http://www-nrg.ee.lbl.gov/vat/
24
can't be used for multicast. UDP provides a connectionless unreliable datagram service.
To use UDP as a transport protocol for real-time traffic, some functionality has to be
added. Functionality that is needed for many real-time applications is combined into RTP,
the real-time transport protocol. RTP is standardized in RFC 1889. Applications typically
run RTP on top of UDP as part of the transport layer protocol, as shown in this figure.
In practice, RTP is usually implemented within the application. RTP is designed

to be independent from the underlying transport protocol and can be used over unicast as
well as multicast.
To set up an RTP session, the application defines a particular pair of destination

transport addresses (one network address plus a pair of ports for RTP and RTCP). In a
multimedia session, each medium is carried in a separate RTP session, with its own
RTCP packets reporting the reception quality for that session. For example, audio and
video would travel on separate RTP sessions, enabling a recipient to select whether or not
to receive a particular medium.
An audio-conferencing scenario presented in RFC 1889 illustrates the use of RTP.

Suppose each participant sends audio data in segments of 20 ms duration. Each segment
of audio data is preceded by an RTP header, and then the resulting RTP message is placed
in a UDP packet. The RTP header indicates the type of audio encoding that is used, e.g.,
PCM. Users can opt to change the encoding during a conference in reaction to network
congestion or, for example, to accommodate low-bandwidth requirements of a new
conference participant. Timing information and a sequence number in the RTP header are
used by the receivers to reconstruct the timing produced by the source, so that in this
example, audio segments are contiguously played out at the receiver every 20 ms.
RTP fixed header fields

The RTP header has the following format:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| synchronization source (SSRC) identifier |
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
| contributing source (CSRC) identifiers |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
25
The first twelve octets are present in every RTP packet, while the list of CSRC
(contributing source) identifiers is present only when inserted by a mixer.The fields have
the following meaning:
version (V): 2 bits. Version of RTP. The newest version is 2.

padding (P): 1 bit. If set, the packet contains one or more additional padding
octets at the end which are not part of the payload. The last octet of the padding contains
a count of how many padding octets should be ignored.
extension (X): 1 bit. If set, the fixed header is followed by exactly one header
extension.
CSRC count (CC): 4 bits. The number of CSRC identifiers that follow the fixed
header. This number is more than one if the payload of the RTP packet contains data from
several sources.
marker (M): 1 bit. Defined by a profile, the marker is intended to allow
significant events such as frame boundaries to be marked in the packet stream.
payload type (PT): 7 bits. Identifies the format of the RTP payload and
determines its interpretation by the application.
sequence number: 16 bits. Increments by one for each RTP data packet sent,
may be used by the receiver to detect packet loss and to restore packet sequence. The
initial value is randomly set.
timestamp: 32 bits. The sampling instant of the first octet in the RTP data packet.
May be used for synchronization and jitter calculations. The initial value is randomly set.
SSRC: 32 bits. Randomly chosen number to distinguish synchronization sources
within the same RTP session. It indicates where the data was combined, or the source of
the data if there is only one source.
CSRC list: 0 to 15 items, 32 bits each. Contributing sources for the payload
contained in this packet. The number of identifiers is given by the CC field.
RTP features
 RTP provides end-to-end delivery services for data with real-time
characteristics, such as interactive audio and video.
 Applications typically run RTP on top of UDP to make use of its
multiplexing and checksum services. But efforts have been made to make
RTP transport-independent so that it could be used on other protocols.
 RTP itself does not provide any mechanism to ensure timely delivery or
provide other quality of service guarantees, but relies on lower-layer
services to do so. RTP assumes that the underlying network is reliable and
delivers packets in sequence.
 RTP is a protocol framework that is deliberately not complete. A complete
specification of RTP for a particular application requires a profile
specification or/and a payload format specification.
 RTP doesn't assume anything about the underlying network, except that it
provides framing. Its original design target was the Internet, but it is
intended to be protocol-independent. For example, test runs of RTP
transmissions over ATM AAL5 and IPv6 are in progress.
 Field PT (payload type) of the RTP header identifies within seven bits the
media type and encoding/compression format of the payload. At any given
time an RTP sender is supposed to send only a single type of payload,
although during transmission change of payload types may occur (e.g. in
26
reaction to bad receiving rate feedback from the receiver via RTCP
packets).
 RTP provides functionality suited for carrying real-time content, e.g. a
timestamp and control mechanisms for synchronizing different streams
with timing properties. Because RTP/RTPC is responsible for controlling
the flow of one media stream it will not automatically synchronize various
streams. This has to happen at application level.
The basis for flow and congestion control is provided by RTCP sender and
receiver reports. We distinguish transient congestion and persistent congestion. By
analyzing the interarrival jitter field of the sender report (below), we can measure the
jitter over a certain interval and indicate congestion before it becomes persistent, hence
resulting in packet loss.
27
3. Mpeg Archival
Before proceeding to read this section a reading of the ‘MPEG Systems, an
overview’ is highly recommended.
Introduction
We understand for MPEG Archival to the act of storing an MPEG stream into a
file. Although it seems a trivial problem when we first approach it, some questions and
problems arise when we think of the way to bring this feature into life.
The first thing to point out is what are our requirements for the saved file. Let’s
point them out:
 Saved file must be an MPEG compliant stream (program, system, transport or

elementary stream)
 Any player should be able to play the saved stream flawlessly.
 Any player should be able to seek through the stream is the player has that
feature.
 No assumptions can be made from the structure of the MPEG stream.
MPEG compliance and further
Probably the most important point is that the saved stream should be compliant
with the MPEG standard. The meaning of compliance can also be described as generating
a stream that follows the syntax defined in the international standards ISO/IEC 11172 and
ISO/IEC 13818 also known as MPEG1 and MPEG2.
However, in order to meet our requirements not only MPEG compliance has to be
assured. We have to make sure that some constrains are meet in the stream so all the
requirements are met. For example, we could have an MPEG stream were the embedded
clock in it resets to zero from to time to time. Because the embedded clock is usually used
by the player to be able to give a global position in the stream, such behavior would fool
the player and the results we would get are totally unpredicted. In video streams for
example, it would be interesting to start with a closed GOP if possible so references to
non available pictures would be avoided.
Accordingly because the system has to be able to work a the different levels of the
MPEG standard, either with multiplexed streams or elementary streams we have to
review what we steps should be taken to be compliant in all the cases.
In order for the following actions to be true we assume that the original stream is
already MPEG compliant.
28
Audio Elementary Streams
MPEG audio elementary streams have a simple structure formed by audio frames.
An audio frame is the smallest portion of MPEG audio data that one can decode.
Trying to save this kind of stream one has to make sure that stream starts and
finishes with an MPEG audio frame.
 The audio elementary stream must start with a valid mpeg audio frame
 The audio elementary stream must end with a valid mpeg audio frame
Video Elementary Streams
In order for a MPEG decoder to be able to learn what kind of stream we’re going
to decode a sequence header has to appear at the beginning of the stream. That is what all
existing decoders look for to be able to start the decoding process. The sequence header
carries information as important as video resolution, frame rate, VBV Buffer size …
However the MPEG standard doesn’t require that the sequence header is
transmitted repeatedly. Only once at the beginning of the stream is enough. Many MPEG-
1 streams are like this case, an only present one sequence header when the stream starts.
 The video elementary stream must start with a valid sequence header
It would also be interesting that no video glitches appear on the video when the
decoder starts its work. To achieve that, we could make sure that our stream starts with a
closed GOP, that is, no references from previous GOPs are needed to decode the pictures
in the current GOP. However that kind of setting is behavior is optional so we can’t rely
on the appearance of a closed GOP as a good start point.
In the worst case, we should indicate by setting the broken_link flag in the GOP
header that in fact, that GOP has been extracted from an existing stream and some
references are missing. The decoder could use that information to not to present the first
bad frames.
 The video elementary stream should start with a closed GOP so no

references are needed from previous pictures.
Last, it is necessary also to end the video stream with the last picture of the last
GOP that we have received. Doing that we can assure that we have all the information to
decode all the frames. Also we should add and sequence_end_code to the end of the
stream.
 The video elementary stream must end with the last frame of the last
complete GOP received.
 A sequence_end_code must be added to the end of the stream
Due to the similarity between the syntax of MPEG-2 and MPEG-1 we can
consider this case as just one.
29
Program Streams and System Streams
A program/system stream is the multiplex system that MPEG streams usually use.
That implies that audio and video elementary streams are carried inside these kind of mux
streams, and as a consequence, all the requirements we have seen for video an audio
elementary streams are going to apply in this case.
Additionally, we have to meet more restrictions in this case that are exclusive to
program and system streams.
 The stream must start with a pack header with the correct value in the
program_mux_rate field.
 The stream must include a valid system header for the following MPEG
data or valid until a new system header is found. A valid system header,
has to include all the correct fields for the stream that we’re handling
 The stream must end with a program_stream_end_code
Due to the similarity between Program Streams (MPEG-2) and System Streams
(MPEG-1) we can take this case as just one.
Transport Streams
A transport stream is the multiplex system that MPEG streams usually use. That
implies that audio and video elementary streams are carried inside these kind of mux
streams, and as a consequence, all the requirements we have seen for video an audio
elementary streams are going to apply in this case.
Additionally, we have to meet more restrictions in this case that are exclusive to
program and system streams.
All transport streams are formed by transport packets of 188 bytes each. In terms
of MPEG compliance it will be enough to ensure that stream starts with a transport packet
and ends with it.
 The stream must start with a transport packet.
 The stream must end with a transport packet.
Special care has to be taken in the detection of the transport packet, due to the fact
that the sync byte of the transport packet, 0x47 is very easy to emulate inside any part of
the stream.
30
The Approach
Once we have seen the kind of problems that need to be resolved we are in the
position to provide different approaches to build the system.
The first approach one could think of is, of course, saving the incoming raw
bitstream directly to a file. However, this poses serious problems to meet the
requirements.
Let’s say we’re set for example to save a video elementary stream. By starting to
save the stream at a random point we’re are very likely to miss the startcode that indicates
the sequence_start_code that is usually what any player will be looking for to start
operations. Also, without further analysis, it’s very unlikely that the video will start with
a closed gop, and even less likely that it will start with a picture header. Most decoders
will ignore such a stream as invalid. If the decoder knew beforehand that the stream that
was going to be sent was a video elementary stream, there is a chance that the decoder
operation would start. However, probably some artifacting would occur at the beginning
of the decoding process due to the fact that no one took care of starting the stream with a
closed gop if available, and the references are lost.
A different approach must be taken. Although there are possible many solutions to
this problem, we’re presenting here the one that we opted to. The problem was attacked
by having two different processes.
In the first process, or analysis process the incoming stream is analyzed and all
the important features of it are labeled and indexed. Also, important information from the
stream is retrieved and it will be used afterwards.
The second phase, or decision process we take all the information from the
analysis phase some decisions are made and some actions are performed.
It’s important to note that both processes are concurrent, due to the realtime nature
of the project. The important point to be seen is that the data extracted from the analysis
process is continuously sent to the decision process.
MPEG STREAM
Stream Decision Archival

Analysis process
31
Analysis process
During the analysis process, key data from the incoming stream is retrieved. The
stream is fully indexed and labeled so that appropriate decisions can be taken from that
information.
For each different type of stream we can have, audio and video elementary streams,
program and transport streams we have to define what the analysis consists of. In the next
pages we discuss, what are the options taken in this respect, and how some of the
problems where solved.
32
Wait for audio

frame sync
word
Was the sync Yes

word
emulated?
No
Label position
as a valid sync
word
The first thing we do is waiting for the arrival of the MPEG audio frame sync
word. Because the syncword is composed by only 12 bits, 0xFFF we have no guarantees
that the syncword that we just detected was the beginning of a real audio frame.
To figure out if we have a valid start of an audio frame we must analyze further
the data that follows the detected sync word.
Let’s take a look at the MPEG audio frame header:
Mpeg_audi o_header()
{
syncwor d 12 bi t s
ID 1 bi t
l ayer 2 bi t s
pr ot ecti on_bi t 1 bi t
bi trat e_i ndex 4 bi t s
sa mpli ng_frequency 2 bi t s
paddi ng_bi t 1 bi t
pri vat e_bit 1 bi t
mode 2 bi t s
mode_ext ensi on 2 bi t s
copyri ght 1 bi t
33
ori gi nal/ home 1 bi t

e mphasi s 2 bi t s
}
We can quickly see that if we analyze the data after the start code we found, it
should follow the pattern above. Not all the fields can have all the possible values so that
adds even more reliability to our method of detecting emulated startcodes.
Besides, the position of consecutive sync words, that is the audio frame size, can
be calculated from the information provided by the seven bits just after the sync word :
the bitstream is subdivided in slots. The distance between the start of two consecutive
sync words is constant and equals "N" slots. The value of "N" depends on the Layer.
For Layer I the following equation is valid:

N = 12 * bit_rate / sampling_frequency.
For Layers II and III the equation becomes:
N = 144 * bit_rate / sampling_frequency.
If this calculation does not give an integer number the result is truncated and
'padding' is required. In this case the number of slots in a frame will vary between N and
N+1. The padding bit is set to '0' if the number of slots equals N, and to '1' otherwise. This
knowledge of the position of consecutive sync words greatly facilitates our task to found
valid audio frame beginnings.
Once we’ve finally determined a valid sync word, we label its position so we can
go to it later if we decide to start saving from that point.
34
Wait for
sequence_start_
code
Label and store

sequence
header info
sequence_header_startcode
Wait for next
startcode
Gop_header_startcode picture_header_startcode
Label and store Label and store

GOP header picture header
info info
The first thing we have to do in the case of an elementary video stream is wait for
the a sequence header. Because of the way MPEG video elementary streams are
constructed, if the stream was originally MPEG compliant no startcode emulation can
appear. For this reason, we don’t have to double check that the sequence header we found
is actually a sequence start code.
The process from that point involves labeling all the sequence header, gop and
picture header appearances so we can decide later what actions to take.
We also retrieve and store specific information from the sequence header, like
quantization matrixes, resolutions or the broken_link and closed_gop flags from the gop
header.
35
Program/System Streams
Wait for pack

header start code
Label pack
header and store
info
No
Is next startcode a
system header?
Yes
Label system
header and get
info
PES header Pack_header_code

Wait for next
startcode
Get PES header Label pack

info and analyze header and store
payload info
The first code that we wait for when dealing with program streams is the
pack_start_code. We also want that following the pack header we have a system header.
The system header carries vital information of the stream, like number of streams, size of
the buffers for the demultiplex operation, so it is very important to retrieve it before being
able to continue.
36
From that point we label and retrieve all the information from all the pack headers
that pass by. Especially important is the SCR (System Clock Reference) that we can find
in every pack header. That info will be helpul lately to identify disruptions in the stream,
and to be able to modify the timestamps inside the PES headers.
When we come across with a PES we also label its position, retrieve the
timestamps included in it if any.
Also, when we have received a PES header, we must analyze the payload and deal
with that stream. For example, if then we have a video PES header, we must take the
payload of the PES packet and analyze it as stated in the video elementary stream part.
This process is equivalent to demultiplex the program stream and deal with the
individual streams separately. This fact indicates that we can build the actual system, in
a modular fashion where we have different parsers for the different elementary streams,
and a system that controls them.
37
Transport Streams
Wait for sync

byte
Was the sync Yes

byte
emulated?
No
Label position
as a valid sync
byte
Analyze packet
data
In the case of transport stream, it’s very important to identify the correct
beginning of a transport packet.
A transport packet begins with the sync byte 0x47 and it’s very likely that it will
be emulated somewhere in the stream.
The algorithm chosen to synchronize with a transport packet is the following:
because all transport packets have a fixed length of 188 bytes, a new sync byte should
appear 188 bytes after the sync byte detected. To make the detection more robust we
repeat this process several times imposing that all times we have find the sync byte in the
appropriate position. If any of the tries fails, we must start from scratch. This algorithm
proved to be really robust for synchronizing with transport stream packets.
On the other hand, a transport stream case can be seen as the program stream case.
It is a multiplex, that basically wraps a PES stream. So, once we have been synchronized
with the stream, we must analyze the data inside the PES as if they were elementary
streams. Again, This process is equivalent to demultiplex the transport stream and
deal with the individual streams separately. This fact further indicates that we can
38
build the actual system, in a modular fashion where we have different parsers for the
different elementary streams.
39
Decision process
The decision process is carried out concurrently with the analysis process. At this
stage, and taking the information from the analysis, the algorithm determines the last
course of action for the given scenario.
Let’s recall what kind of information we retrieve from the analysis process
 Labels. They specify the position of all the important features in the
stream. With them we can know where sequence header , gop headers,
picture headers begin, and so we have a quick access to that information if
we need to.
 Features info. We retrieved the content of the sequence headers, gop
headers and picture headers and we can use this information now. For
example, if we want to terminate the stream we can use the
temporal_reference flag to determine which of the pictures is last in the
stream, and can be used correctly.
All the decisions are dependent of the type of stream we’re dealing with. Let’s
review every case in particular to see what are decision flow that is taken.
40
Wait for audio

frame label
Enough No
audio frames
to save?
Yes
Save audio
frame
For this case we wait until we have an audio frame label. Finding an audio label is
equivalent to having found an mpeg audio frame.
Avoiding saving every frame individually, will help us to be more efficient in
terms of efficiency. For that we wait until we have a number of audio frames to write
them to disk.
41
Wait for
sequece header
label
Wait for gop

header label
Do we have No
a gop of
pictures?
Yes
Save gop of
pictures.
Modify gop
headers.
For the case of video elementary streams the following approach was taken.
At the beginning of the process we wait for a sequence header label to appear.
Let’s remember that the sequence header is a must at the beginning of any VES due to the
key information it carries.
Once we’ve got the sequence header what we do is tracking the gop of pictures
labels. The basic idea behind it is that we have a valid piece of the stream to be saved
between gop header marks. The pictures inside the gop start just after the first gop header,
and all of them are contained between the two gop header labels. If we’re in this case
we’re set to save that portion of the stream to disk.
An interesting problem arises trying to determine the cutoff point in the last
picture of the stream. Because the picture header information does not contain the size of
the picture we can’t know beforehand where the picture ends, and consequently we don’t
know where is the last byte that we will save.
The first option would be parsing the whole picture information data to get to the
end of it. This option doesn’t look attractive due to the complex structure of the MPEG
42
video frame data, that would lead to taking more computational resources than the ones
strictly needed.
On the other hand, we can make a good assumption about the MPEG video
elementary stream. That is, before a sequence header or gop header startcode we will
have the last byte of the previous picture coded in the stream. Taking a look to the
bitstream syntax:
video_sequence() {
next_start_code()
sequence_header()
do {
extension_and_user_data( 0 )
do {
if (nextbits() == group_start_code) {
group_of_pictures_header()
extension_and_user_data( 1 )
}
picture_header()
picture_coding_extension()
extensions_and_user_data( 2 )
picture_data()
} while ( (nextbits() == picture_start_code) ||
(nextbits() == group_start_code) )
if ( nextbits() != sequence_end_code ) {
sequence_header()
sequence_extension()
}
} while ( nextbits() != sequence_end_code )
sequence_end_code
}
If we analyze the syntax we can derive from it that after the picture data (
picture_data() ) we can only have sequence header start codes, picture start codes and
group start codes. Our assumption is then, right on track and we can use this fact to
determine the end of the last picture coded in the gop.
If this is the first gop we are saving and it’s an open gop, we’ve lost all the
references needed to decode the first pictures in the stream. In this case, we indicate it by
setting the broken_link flag in the first gop header that we´re going to save.
43
Program/System & Transport Streams
Wait for pack

system header
labels
Wait for PES

header label
label
Can the No
payload be
saved?
Yes
Save mux
stream portion
Modify SCR
and
timestamps.
As we stated the program stream must start with a pack / system header pair. The
decision process waits for those two at the beginning. From that point the mux process
enters in a loop looking for PES headers, from the mux point of view all data inside a
PES can be saved.
We saw during the analysis process, how when we’re dealing with multiplex
streams, we have to merge the analysis of the individual elementary streams with the ones
from the mux stream.
For example, we are receiving a program stream with one video elementary
stream inside it. The analysis process marked all the important features at both levels: the
elementary stream and the mux stream. Now, we have to take into account whatever
decisions are necessary in the case of the video elementary stream. We can have a portion
of the program stream ready to be saved but we are waiting to have a group of pictures
inside the payload of the mux to do the actual saving.
In any case we have to take a look at the labels and information from the analysis
process of the video elementary stream to be able to make a decision. After joining the
two labels we can determine what parts of the stream are ready to be saved.
44
P P P P P
e e e e e
s s s s s
H H H H H
e Payload e Payload e Payload e Payload e
a a a a a
d d d d d
e e e e e
r r r r r
In this example, the grey parts are those that can be saved. In the last PES packet,
the payload has been validated for saving partially. Thus we can´t save that part of the
payload or even the PES header. Only those PES headers where the payload has been
fully validated for saving can be actually saved by now.
When the moment arrives and we have to finish the operation, we can be left with
PES packets without fully validated payloads. One possible course of action is padding
the non validated areas with zeroes so no more data is seen there, and then validating the
whole PES.
45
Implementation
The system was coded using the C++ language. It gives the user the ability to
write the code very modularly, a thing that was specially important in this project. C++ is
an object oriented language, that is, code can be written as objects that can act
independently and have a life of its own.
Two types of C++ objects where created:
Parsers. The core of the system, and the ones who perform the analysis and
decision process for each type of stream. A common interface was created in the
parser objects, so a combination of them was very straightforward. The basic idea
behind it is that the stream is presented to the parser and then it´s the parser who
validates the areas of it to be saved. For example, the program stream parser does
its work on the PES headers, and then passes the payload to a video elementary
stream parser. The last, returns the validated payload to the program stream
parser, being that way the work done inside the elementary payload transparent to
the program stream parser.
The most convoluted case is when dealing with transport streams where the TS
parser does calls to the PS parser and at the same time this ones does the same for
the ES parser.
Flush File Map. The file map object is an actual copy of the incoming stream in
memory system. The parsers label the stream features in terms of offsets of the
flush file map, modify the stream in it and validate chunks of it.
It is also useful for buffering reasons. Because we don´t know when we will be
able to start storing the file, we need quite a bit piece of the stream stored before
we do that.
The flush file map is also an abstraction of the file being saved, as the times goes
by all the data in the file map, ends in the saved file. All the offsets are kept and
then all labels are valid.
Fully validated areas of the map are no longer represented in memory so they
can‘t be accessed.
46
4. MPEG file streaming
47
Introduction
To understand the concept of MPEG file streaming let’s present the following
scenario.
We have an archived MPEG file stored inside a computer that has a network
connection to a TCP/IP network. We want to multicast or stream the file to multiple
recipients to emulate a broadcast station where only one emitter is present and multiple
receivers are watching the same content.
The first problem to be solved is how to deliver the MPEG file to the network.
Although a trivial problem at first sight some questions arise:
 At what rate do we have to push the file to the network?

 How do we extract the rate information from the stream?
 Can we get a minimal performance-hit in this process?
It’s very important to note here that we are talking of file streaming where no
feedback from the recipients is get. That means, that our system has to be able to work
independently from the number of receivers we have.
Also important to note is that system is designed to be able to support streaming

of multiple files (up to 16), so minimal CPU usage is one of the goals of the system.
The system
The system has to emulate the behavior of an MPEG encoder, trying to avoid
overflows or underflows in the buffering system of the decoder. For achieving that, the
sending of the data has to be precisely timed or otherwise glitches will appear in the
video.
48
Audio elementary streams
Syncronize with
audio frame
Retrieve
sampling rate
and layer
info
Send audio
frame
Wait until
frame duration
has elapsed
Compensate
timing
In order to stream MPEG audio data, the most reasonable approach, seems to be
sending audio frames one after the other trying to keep the pace which they have to be
played back.
Two options can be used here to time the outgoing pace of the frames. Either we
use the bitrate information in the frame or we can also take into account the fact that all
frames represent a fixed number of samples.
The option chosen was the second. The bitrate information is dependent on how
the encoder built the frame and can quickly introduce inaccuracies. On the other side, we
know that the number of samples per frame is fixed, so the second option seems a much
more robust one.
Here we have the number of samples per frame:
49
Mpeg layer Number of samples per frame

(NSpF)
I 384
II 1152
III 1152
Thus the time delay we have to wait equals
Td = NSpF / SF , SF = Sampling Frequency
For example in the case of 44.1 kHz audio and layer III audio
Td = 1152 / 44100 = 26.12 ms
If we note that the granularity (the precision in which the operating system can
time our operations) for the Windows platform is of 10 ms, we are going to need a
mechanism to compensate for the errors introduced by the operating system.
To compensate for possible deviations, we keep track of the number of frames

sent, that is equivalent to keep track of the duration of the stream sent. When that duration
lags behind than the actual time elapsed, we can adjust the waiting time making it shorter.
Timing Compensation
| Te -  Fr | > Th Yes Ajdust waiting

time
No
Te: Time Elapsed

Fr: Frame time
Th: threshold
50
Syncronize with
video frame
Parse picture
data
Send picture
data
Wait until
frame duration
has elapsed
Compensate
timing
For video elementary streams the mechanism is no different than with audio. We
grab whole video frames and then send them.
Because we need to send an entire frame we jump until the beginning of the next
picture and consider the data between picture start codes as the picture data.
The duration of the video frame, or the frame rate can be known by looking at the
sequence header of the video sequence. Special care has to be given to MPEG-2 field
pictures where the wait time is half the frame period. Because an MPEG-2 sequence can
be formed with field and frame pictures interleaved we have to extract this information
from the picture header to know the exact time we will have to wait.
Let´s see common values for frame and field periods:
51
Frame Frame Period = 1/ Fr (ms)

Speed (Fs)
25 fr 40
29.97 fr 33.3
50 fields 20
59.94 16.68
Again, the frame/field periods are of the order of 10 ms, that turns out to be the
granularity of our operating system. Again, a compensation mechanism is required, and
its exactly the same we used for audio elementary streams, so we are not going to repeat it
here.
52
Program/System Stream
Implementing this feature in program and system streams is more straightforward.

The MPEG multiplexed streams carry and embedded clock that represents the clock of
the encoder that generated the stream. Its purpose is mainly synchronization of the
streams inside the mux, but in our case it’s an excellent choice to have a temporal
reference of the stream to push it to the network.
Synchronize
with pack
header
Grab SCR
stamp
Adapt SCR for

discontinuities
No
SCR < Wait
STC lapse
Yes
Send pack
An interesting problem arises in this case. It was seen inside many sample streams
how the SCR stamp inside the stream, was reset to zero from time to time. Because in our
system we’re comparing the SCR with a monolithically increasing STC (System Time
Clock) some kind of adaptation must be done before doing the comparison.
Then, just after grabbing the SCR stamp, if we detect it jumps for more than one
second from the previous value we assume that there has been a discontinuity in the
stream, and from that point and then we adjust the SCR so the comparison with the STC
is meaningful.
53
The algorithm is then very simple. If the SCR in the stream lags behind the STC
we just send the pack. If the SCR is still in the future, we go and wait a little bit to send it.
It was seen that time lapses of 10 ms in the wait where small enough to make the system
work.
Overall the system proved to really resemble the output of a real time encoder and
that was we were looking for. In tests with a decoder no flaws where observed a good
indication that neither overflows nor underflows were occurring in the stream.
54
Transport stream
Lock PCR
stream
Synchronize No
with transport PCR
packet available?
Yes
Grab PCR
stamp
Adapt PCR for

discontinuities
No
PCR < Wait
STC lapse
Yes
Send packet
The transport stream case is similar to that of the program streams.
The first difference is that the embedded clock, PCR (Program Clock Reference)
is not included in all the transport packets. More than that, inside a transport stream
various programs can travel with their associated PCRs.
The option taken to solve this issue was to lock the first PCR stream found in the
transport. After that point, we track only and only that PCR stream. For that we have to
look at every transport packet, and find out if the adaptation field contains the PCR. From
that point, the mechanism is as we saw with program streams.
55
5. Network Transmission / Reception

As we saw in the presentation of this paper RTP over UDP was used as the
preferred transport protocol to send MPEG data over UDP networks.
One of the attractive sides of MPEG over RTP is that a good defined way of
MPEG over it is set in the RFC 2250 (see annex).
Due to time constrains, it was not possible to implement the full specification of
RFC 2250, and some shortcuts were taken to get to a working final product with a limited
set of features.
We now specify what were the exact steps to packetize MPEG over RTP.
56
Program , Audio and Video Streams
In all these cases the streams was inserted inside the RTP encapsulation as a
packetized stream of bytes. Although correct for program and system streams, audio and
video streams require much more work as reflected in RFC2250. Please refer to the
document, to know the details of audio and video encapsulation.
In particular, RTP packets with a maximum payload of 1460 bytes were chosen.
The incoming buffers from the system werre sliced in packets of 1460, and the remaining
bytes were sent as a smaller RTP packet. Although not an optimal solution, some features
like sequence numbering in the RTP packets were used.
The timestamping capability inside the RTP header was not used.
This is an example of the packetization used, with a buffering scheme with 8K

buffers. The buffers that come from the system come as 8K buffers and they have to be
sliced into RTP packets. Our implemenation generates 1460 bytes RTP packets as long as
it is possible, and then sends a packet with the remaining of the data.
8K 892 bytes RTP packet

buffer
1460 bytes RTP

packets
57
Transport Streams
One of the selling points of the system was supposed to be the Transport Stream
support over RTP. Special care was taken to be more compliant with RTP standards in
this sense.
The option taken was to packetize exactly seven transport packets in each RTP
packet. Because every transport packet weights 188 bytes, that sums up to a total of 1316
packets.
TP packet TP packet TP packet TP packet TP packet TP packet TP packet

188 bytes 188 bytes 188 bytes 188 bytes 188 bytes 188 bytes 188 bytes
1316 bytes RTP packet
Also, in this case only sequence numbers were used in the RTP header.
58
The receiver side
Notably less complex, the work of the received merely consisted on taking the
payload of the RTP packets and deliver them to whatever was the system that the receiver
was hooked to, either a real time hardware decoder, our archival system or a network
transceiver.
However, one important feature was implemented to really make sense of the use
of RTP. Because we encoded the sequence numbers in the transmitting side (let’s recall
that the sequence number increases by one with each packet sent), we can detect at the
receiving side if some packets are lost or if their order has been reversed. Because we are
using UDP under the RTP layer, it can be really useful to implement this feature.
A simple window algorithm was chosen to reorder the incoming RTP packets. An
interesting point to note here is that the bigger our reordering window is, the more latency
we will add to the system. For this reason, small reordering windows were chosen, 3 or 4
packets.
Buffer N packets
Reorder
Deliver oldest
packet
Buffer new packet
Although a simplistic approach, this algorithm proved to be effective and work

reliably. A reorder window of 4 packets was chosen because the latency introduced with
it is negligible.
59
6. In the works
Although almost all the main goals of the project were achieved, some features
could have been improved, and some work was left open for continuation. Let’s discuss
what are those parts individually in each section of this paper
MPEG Archival
Almost all of the goals were achieved looking one important was left to do.
In particular MPEG restamping. One of the requirements of the project was the
ability for the system to generate mux streams where the embedded clock (SCR and PCR)
increased monolithically with time. This behavior would make MPEG file player see the
file as continous in time and no seeking problems would occur.
To get to that point, one has to modify all the timestamps in the stream, SCR,
PCR, ESCR and DTS and PTS in a two step process. First a detection of the disruption
must occur, and then a compensation time must be calculated and applied to the stream in
real time.
Although some attempts were made in the area, it wasn’t possible to get a solid
and consistent behavior in terms of performance. This feature is then, work on progress.
Also, the system was found not to be optimal performance-wise. Although it

worked flawlessly the performance hit was found to be important. A lot of work is then
envisioned in this area.
MPEG File Streaming

The system worked flawlessly and complied with the specs requested.
One interesting feature was brought up after the work started. The need to pause
the file streaming indefinitely and the ability to resume it from the stop point. The feature
would require some changes in the timing part of the system, and it seems not too
problematic to implement.
RTP/UDP Transmision & Reception
As stated previously in the paper, most of the specifications inside the RFC2250
document were left out of the implementation.
Particularly important, the packaging of audio and video elementary streams a per
the RTP spec. The spec makes sure that small decodable features of the MPEG video
stream are individually packetized in RTP packets (we’re talking about MPEG video
slices). The first and most important advantage of doing this is that if an RTP packet is
lost, still the rest of the picture can be decoded without causing big artifacting in the
decoding process. Again, this feature was left out so It would be extremely interesting to
have it in.
60
Some other features like using the timestamping inside the headers for program
and system streams would have also been interesting, although this latest feature is not as
important as the one about audio/video packetizing.
61
7. Sources
 ISO/IEC IS 13818, ITU-T Recommendation H.262, “Information technology -
Generic coding of moving pictures and Associated Audio information”
 ISO/IEC IS 11172, “Information technology - Generic coding of moving pictures
and Associated Audio information for digital storage media”
 H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson, "RTP: A Transport Protocol
for Real-Time Applications", RFC 1889, January 1996.
 G. Fernando, V.Goyal, M.Civanlar, "RTP Payload Format for MPEG1/MPEG2
Video ", RFC 2250, January 1998.
62
8. Annex
63
RFC 2250
64
Network Working Group D. Hoffman

Request for Comments: 2250 G. Fernando
Obsoletes: 2038 Sun Microsystems, Inc.
Category: Standards Track V. Goyal
Precept Software, Inc.
M. Civanlar
AT&T Labs - Research
January 1998
RTP Payload Format for MPEG1/MPEG2 Video
Status of this Memo
This document specifies an Internet standards track protocol for the

Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright Notice
Copyright (C) The Internet Society (1998). All Rights Reserved.
Abstract
This memo describes a packetization scheme for MPEG video and audio
streams. The scheme proposed can be used to transport such a video
or audio flow over the transport protocols supported by RTP. Two
approaches are described. The first is designed to support maximum
interoperability with MPEG System environments. The second is
designed to provide maximum compatibility with other RTP-encapsulated
media streams and future conference control work of the IETF.
This memo is a revision of RFC 2038, an Internet standards track

protocol. In this revision, the packet loss resilience mechanisms in
Section 3.4 were extended to include additional picture header
information required for MPEG2. A new section on security
considerations for this payload type is added.
Hoffman, et. al. Standards Track [Page 1]
65
RFC 2250 RTP Format for MPEG1/MPEG2 Video January 1998
1. Introduction
ISO/IEC JTC1/SC29 WG11 (also referred to as the MPEG committee) has

defined the MPEG1 standard (ISO/IEC 11172)[1] and the MPEG2 standard
(ISO/IEC 13818)[2]. This memo describes a packetization scheme to
transport MPEG video and audio streams using the Real-time Transport
Protocol (RTP), version 2 [3, 4].
The MPEG1 specification is defined in three parts: System, Video and

Audio. It is designed primarily for CD-ROM-based applications, and
is optimized for approximately 1.5 Mbits/sec combined data rates. The
video and audio portions of the specification describe the basic
format of the video or audio stream. These formats define the
Elementary Streams (ES). The MPEG1 System specification defines an
encapsulation of the ES that contains Presentation Time Stamps (PTS),
Decoding Time Stamps and System Clock references, and performs
multiplexing of MPEG1 compressed video and audio ES's with user data.
The MPEG2 specification is structured in a similar way. However, it

hasn't been restricted only to CD-ROM applications. The MPEG2 System
specification defines two system stream formats: the MPEG2 Transport
Stream (MTS) and the MPEG2 Program Stream (MPS). The MTS is tailored
for communicating or storing one or more programs of MPEG2 compressed
data and also other data in relatively error-prone environments. The
MPS is tailored for relatively error-free environments.
We seek to achieve interoperability among 4 types of end-systems in

the following specification. The 4 types are:
1. Transmitting Interworking Unit (TIU)
Receives MPEG information from a native MTS system for

distribution over packet networks using a native RTP-based
system layer (such as an IP-based internetwork). Examples:
real-time encoder, MTS satellite link to Internet, video
server with MTS-encoded source material.
2. Receiving Interworking Unit (RIU)
Receives MPEG information in real time from an RTP-based

network for forwarding to a native MTS environment.
Examples: Internet-based video server to MTS-based cable
distribution plant.
66
3. Transmitting Internet End-System (TAES)
Transmits MPEG information generated or stored within the

internet end-system itself, or received from internet-based
computer networks. Example: video server.
4. Receiving Internet End-System (RAES)
Receives MPEG information over an RTP-based internet for

consumption at the internet end-system or forwarding to
traditional computer network. Example: desktop PC or
workstation viewing training video.
Each of the 2 types of transmitters must work with each of the 2

types of receivers. Because it is probable that the TAES, and
certain that the RAES, will be based on existing and planned
internet-connected computers, it is highly desirable for the
interoperable protocol to be based on RTP.
Because of the range of applications that might employ MPEG streams,

we propose to define two payload formats.
Much interest in the MPEG community is in the use of one of the MPEG
System encodings, and hence, in Section 2 we propose encapsulations
of MPEG1 System streams and MPEG2 Transport and Program Streams with
RTP. This profile supports the full semantics of MPEG System and
offers basic interoperability among all four end-system types.
When operating only among internet-based end-systems (i.e., TAES and

RAES) a payload format that provides greater compatibility with the
Internet architecture is desired, deferring some of the system issues
to other protocols being defined in the Internet community (such as
the MMUSIC WG). In Section 3 we propose an encapsulation of
compressed video and audio data (referred to in MPEG documentation as
"Elementary Streams" (ES)) complying with either MPEG1 or MPEG2.
Here, neither of the System standards of MPEG1 or MPEG2 are utilized.
The ES's are directly encapsulated with RTP.
Throughout this specification, we make extensive use of MPEG

terminology. The reader should consult the primary MPEG references
for definitive descriptions of this terminology.
2. Encapsulation of MPEG System and Transport Streams
Each RTP packet will contain a timestamp derived from the sender's
90KHz clock reference. This clock is synchronized to the system
stream Program Clock Reference (PCR) or System Clock Reference (SCR)
and represents the target transmission time of the first byte of the
67
packet payload. The RTP timestamp will not be passed to the MPEG
decoder. This use of the timestamp is somewhat different than
normally is the case in RTP, in that it is not considered to be the
media display or presentation timestamp. The primary purposes of the
RTP timestamp will be to estimate and reduce any network-induced
jitter and to synchronize relative time drift between the transmitter
and receiver.
For MPEG2 Transport Streams the RTP payload will contain an integral
number of MPEG transport packets. To avoid end system
inefficiencies, data from multiple small MTS packets (normally fixed
in size at 188 bytes) are aggregated into a single RTP packet. The
number of transport packets contained is computed by dividing RTP
payload length by the length of an MTS packet (188).
For MPEG2 Program streams and MPEG1 system streams there are no
packetization restrictions; these streams are treated as a packetized
stream of bytes.
2.1 RTP header usage
The RTP header fields are used as follows:
Payload Type: Distinct payload types should be assigned for

MPEG1 System Streams, MPEG2 Program Streams and MPEG2
Transport Streams. See [4] for payload type assignments.
M bit: Set to 1 whenever the timestamp is discontinuous

(such as might happen when a sender switches from one data
source to another). This allows the receiver and any
intervening RTP mixers or translators that are synchronizing
to the flow to ignore the difference between this timestamp
and any previous timestamp in their clock phase detectors.
timestamp: 32 bit 90K Hz timestamp representing the target

transmission time for the first byte of the packet.
3. Encapsulation of MPEG Elementary Streams
The following ES types may be encapsulated directly in RTP:
(a) MPEG1 Video (ISO/IEC 11172-2) (b) MPEG2 Video (ISO/IEC

13818-2) (c) MPEG1 Audio (ISO/IEC 11172-3) (d) MPEG2 Audio
(ISO/IEC 13818-3)
68
A distinct RTP payload type is assigned to MPEG1/MPEG2 Video and

MPEG1/MPEG2 Audio, respectively. Further indication as to whether the
data is MPEG1 or MPEG2 need not be provided in the RTP or MPEG-
specific headers of this encapsulation, as this information is
available in the ES headers.
Presentation Time Stamps (PTS) of 32 bits with an accuracy of 90 kHz

shall be carried in the fixed RTP header. All packets that make up a
audio or video frame shall have the same time stamp.
3.1 MPEG Video elementary streams
MPEG1 Video can be distinguished from MPEG2 Video at the video

sequence header, i.e. for MPEG2 Video a sequence_header() is followed
by sequence_extension(). The particular profile and level of MPEG2
Video (MAIN_Profile@MAIN_Level, HIGH_Profile@HIGH_Level, etc) are
determined by the profile_and_level_indicator field of the
sequence_extension header of MPEG2 Video.
The MPEG bit-stream semantics were designed for relatively error-free

environments, and there is significant amount of dependency (both
temporal and spatial) within the stream such that loss of some data
make other uncorrupted data useless. The format as defined in this
encapsulation uses application layer framing information plus
additional information in the RTP stream-specific header to allow for
certain recovery mechanisms. Appendix 1 suggests several recovery
strategies based on the properties of this encapsulation.
Since MPEG pictures can be large, they will normally be fragmented

into packets of size less than a typical LAN/WAN MTU. The following
fragmentation rules apply:
1. The MPEG Video_Sequence_Header, when present, will always

be at the beginning of an RTP payload.
2. An MPEG GOP_header, when present, will always be at the
beginning of the RTP payload, or will follow a
Video_Sequence_Header.
3. An MPEG Picture_Header, when present, will always be at the
beginning of a RTP payload, or will follow a GOP_header.
Each ES header must be completely contained within the packet.

Consequently, a minimum RTP payload size of 261 bytes must be
supported to contain the largest single header defined in the ES
(that is, the extension_data() header containing the
quant_matrix_extension()). Otherwise, there are no restrictions on
where headers may appear within packet payloads.
69
In MPEG, each picture is made up of one or more "slices," and a slice

is intended to be the unit of recovery from data loss or corruption.
An MPEG-compliant decoder will normally advance to the beginning of
next slice whenever an error is encountered in the stream. MPEG
slice begin and end bits are provided in the encapsulation header to
facilitate this.
The beginning of a slice must either be the first data in a packet

(after any MPEG ES headers) or must follow after some integral number
of slices in a packet. This requirement insures that the beginning
of the next slice after one with a missing packet can be found
without requiring that the receiver scan the packet contents. Slices
may be fragmented across packets as long as all the above rules are
met.
An implementation based on this encapsulation assumes that the

Video_Sequence_Header is repeated periodically in the MPEG bit-
stream. In practice (though not required by MPEG standard) this is
used to allow channel switching and to receive and start decoding a
continuously relayed MPEG bit-stream at arbitrary points in the media
stream. It is suggested that when playing back from an MPEG stream
from a file format (where the Video_Sequence_Header may only be
represented at the beginning of the stream) that the first
Video_Sequence_Header (preceded by an end-of-stream indicator) be
saved by the packetizer for periodic injection in to the network
stream.
3.2 MPEG Audio elementary streams
MPEG1 Audio can be distinguished from MPEG2 Audio from the MPEG
ancillary_data() header. For either MPEG1 or MPEG2 Audio, distinct
Presentation Time Stamps may be present for frames which correspond
to either 384 samples for Layer-I, or 1152 samples for Layer-II or
Layer-III. The actual number of bytes required to represent this
number of samples will vary depending on the encoder parameters.
Multiple audio frames may be encapsulated within one RTP packet. In

this case, an integral number of audio frames must be contained
within the packet and the fragmentation header defined in Section 3.5
shall be set to 0.
Also, if relatively short packets are to be used, one frame may be so

large that it may straddle multiple RTP packets. For example, for
Layer-II MPEG audio sampled at a rate of 44.1 KHz each frame would
represent a time slot of 26.1 msec. At this sampling rate if the
compressed bit-rate is 384 kbits/sec (i.e. 48 kBytes/sec) then the
average audio frame size would be 1.25 KBytes. If packets were to be
500 Bytes long, then each audio frame would straddle 3 RTP packets.
70
The audio fragmentation indicator header (See Section 3.5) shall be

present for an MPEG1/2 Audio payload type to provide for this
fragmentation.
3.3 RTP Fixed Header for MPEG ES encapsulation
The RTP header fields are used as follows:
Payload Type: Distinct payload types should be assigned

for video elementary streams and audio elementary streams.
See [4] for payload type assignments.
M bit: For video, set to 1 on packet containing MPEG frame

end code, 0 otherwise. For audio, set to 1 on first packet of
a "talk-spurt," 0 otherwise.
PT: MPEG video or audio stream ID.
timestamp: 32-bit 90K Hz timestamp representing presentation

time of MPEG picture or audio frame. Same for all packets
that make up a picture or audio frame. May not be
monotonically increasing in video stream if B pictures present
in stream. For packets that contain only a video sequence
and/or GOP header, the timestamp is that of the subsequent
picture.
3.4 MPEG Video-specific header
This header shall be attached to each RTP packet after the RTP fixed
header.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MBZ |T| TR | |N|S|B|E| P | | BFC | | FFC |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
AN FBV FFV
MBZ: Unused. Must be set to zero in current

specification. This space is reserved for future use.
T: MPEG-2 (Two) specific header extension present (1 bit).

Set to 1 when the MPEG-2 video-specific header extension (see
Section 3.4.1) follows this header. This extension may be
needed for improved error resilience; however, its inclusion
in an RTP packet is optional. (See Appendix 1.)
71
TR: Temporal-Reference (10 bits). The temporal reference of

the current picture within the current GOP. This value ranges
from 0-1023 and is constant for all RTP packets of a given
picture.
AN: Active N bit for error resilience (1 bit). Set to 1 when

the following bit (N) is used to signal changes in the
picture header information for MPEG-2 payloads. It must be
set to 0 for MPEG-1 payloads or when N bit is not used.
N: New picture header (1 bit). Used for MPEG-2 payloads when

the previous bit (AN) is set to 1. Otherwise, it must be set
to zero. Set to 1 when the information contained in the
previously transmitted Picture Headers can't be used to
reconstruct a header for the current picture. This happens
when the current picture is encoded using a different set of
parameters than the previous pictures of the same type. The N
bit must be constant for all RTP packets that belong to the
same picture so that receipt of any packet from a picture
allows detecting whether information necessary for
reconstruction was contained in that picture (N = 1) or a
previous one (N = 0).
S: Sequence-header-present (1 bit). Normally 0 and set to 1 at

the occurrence of each MPEG sequence header. Used to detect
presence of sequence header in RTP packet.
B: Beginning-of-slice (BS) (1 bit). Set when the start of the

packet payload is a slice start code, or when a slice start
code is preceded only by one or more of a
Video_Sequence_Header, GOP_header and/or Picture_Header.
E: End-of-slice (ES) (1 bit). Set when the last byte of the

payload is the end of an MPEG slice.
P: Picture-Type (3 bits). I (1), P (2), B (3) or D (4). This

value is constant for each RTP packet of a given picture.
Value 000B is forbidden and 101B - 111B are reserved to
support future extensions to the MPEG ES specification.
FBV: full_pel_backward_vector
BFC: backward_f_code
FFV: full_pel_forward_vector
FFC: forward_f_code
Obtained from the most recent picture header, and are
constant for each RTP packet of a given picture. For I frames
none of these values are present in the picture header and
72
they must be set to zero in the RTP header. For P frames

only the last two values are present and FBV and BFC must be
set to zero in the RTP header. For B frames all the four
values are present.
3.4.1 MPEG-2 Video-specific header extension
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|X|E|f_[0,0]|f_[0,1]|f_[1,0]|f_[1,1]| DC| PS|T|P|C|Q|V|A|R|H|G|D|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
X: Unused (1 bit). Must be set to zero in current

specification. This space is reserved for future use.
E: Extensions present (1 bit). If set to 1, this header

extension, including the composite display extension when D =
1, will be followed by one or more of the following
extensions: quant matrix extension, picture display
extension, picture temporal scalable extension, picture
spatial scalable extension and copyright extension.
The first byte of these extensions data gives the length of

the extensions in 32 bit words including the length field
itself. Zero padding bytes are used at the end if required to
align the extensions to 32 bit boundary.
Since they may not be vital in decoding of a picture, the

inclusion of any one of these extensions in an RTP packet is
optional even when the MPEG-2 video-specific header extension
is included in the packet (T = 1). (See Appendix 1.) If
present, they should be copied from the corresponding
extensions following the most recent MPEG-2 picture coding
extension and they remain constant for each RTP packet of a
given picture.
The extension start code (32 bits) and the extension start
code ID (4 bits) are included. Therefore the extensions are
self identifying.
f_[0,0]: forward horizontal f_code (4 bits)

f_[0,1]: forward vertical f_code (4 bits)
f_[1,0]: backward horizontal f_code (4 bits)
f_[1,1]: backward vertical f_code (4 bits)
DC: intra_DC_precision (2 bits)
PS: picture_structure (2 bits)
73
T: top_field_first (1 bit)
P: frame_predicted_frame_dct (1 bit)
C: concealment_motion_vectors (1 bit)
Q: q_scale type (1 bit)
V: intra_vlc_format (1 bit)
A: alternate scan (1 bit)
R: repeat_first_field (1 bit)
H: chroma_420_type (1 bit)
G: progressive frame (1 bit)
D: composite_display_flag (1 bit). If set to 1, next 32 bits
following this one contains 12 zeros followed by 20 bits
of composite display information.
These values are copied from the most recent picture coding
extension and are constant for each RTP packet of a given
picture. Their meanings are as explained in the MPEG-2 standard.
3.5 MPEG Audio-specific header
This header shall be attached to each RTP packet at the start of the
payload and after any RTP headers for an MPEG1/2 Audio payload type.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| MBZ | Frag_offset |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Frag_offset: Byte offset into the audio frame for the data
in this packet.
4. Security Considerations
RTP packets using the payload format defined in this specification

are subject to the security considerations discussed in the RTP
specification [3], and any appropriate RTP profile (for example [4]).
This implies that confidentiality of the media streams is achieved by
encryption. Because the data compression used with this payload
format is applied end-to-end, encryption may be performed after
compression so there is no conflict between the two operations.
A potential denial-of-service threat exists for data encodings using

compression techniques that have non-uniform receiver-end
computational load. The attacker can inject pathological datagrams
into the stream which are complex to decode and cause the receiver to
be overloaded. However, this encoding does not exhibit any
significant non-uniformity.
74
As with any IP-based protocol, in some circumstances a receiver may

be overloaded simply by the receipt of too many packets, either
desired or undesired. Network-layer authentication may be used to
discard packets from undesired sources, but the processing cost of
the authentication itself may be too high. In a multicast
environment, pruning of specific sources may be implemented in future
versions of IGMP [5] and in multicast routing protocols to allow a
receiver to select which sources are allowed to reach it.
A security review of this payload format found no additional

considerations beyond those in the RTP specification.
75
Appendix 1. Error Recovery and Resynchronization Strategies.
The following error recovery and resynchronization strategies are

intended to be guidelines only. A compliant receiver is free to
employ alternative (or no) strategies.
When initially decoding an RTP-encapsulated MPEG Elementary Stream,

the receiver may discard all packets until the Sequence-header-
present bit is set to 1. At this point, sufficient state information
is contained in the stream to allow processing by an MPEG decoder.
Loss of packets containing the GOP_header and/or Picture_Header are

detected by an unexpected change in the Temporal-Reference and
Picture-Type values. Consider the following example GOP sequence:
In display order: 0B 1B 2I 3B 4B 5P 6B 7B 8P GOP_HDR 0B ...

In stream order: 2I 0B 1B 5P 3B 4B 8P 6B 7B GOP_HDR 2I ...
Consider also two counters:
ref_pic_temp (Reference Picture (I,P) Temporal Reference)

dep_pic_temp (Dependent Picture (B) Temporal Reference)
At each GOP beginning, set these counters to the temporal reference

value of the corresponding picture type. For our example GOP
sequence, ref_pic_temp = 2 and dep_pic_temp = 0. Keep incrementing
BOTH counters by unity with each following picture. Ref_pic_temp
should match the temporal references of the I and P frames, and
dep_pic_temp should match the temporal references of the B frames.
dep_pic_temp: - 0 1 2 3 4 5 6 7 8 9
In stream order: 2I 0B 1B 5P 3B 4B 8P 6B 7B GOP_H 2I 0B 1B ...
ref_pic_temp: 2 3 4 5 6 7 8 9 10 ^ 11
-------------------------- | ^
Match Drop |
Mismatch
in ref_pic_temp
The loss of a GOP header can be detected by matching the appropriate

counter (based on picture type) to the temporal reference value. A
mismatch indicates a lost GOP header. If desired, a GOP header can be
re-constructed using a "null" time_code, repeating the closed_gop
flag from previous GOP headers, and setting the broken_link flag to
1.
The loss of a Picture_Header can also be detected by a mismatch in

the Temporal Reference contained in the RTP packet from the
appropriate dep_pic_temp or ref_pic_temp counters at the receiver.
76
For MPEG-1 payloads, after scanning to the next Beginning-of-slice

the Picture_Header is reconstructed from the P, TR, FBV, BFC, FFV and
FFC contained in that packet, and from stream-dependent default
values.
For MPEG-2, additional information is needed for the reconstruction.

This information is provided by the MPEG-2 video specific header
extension contained in that packet if the T bit is set to 1, or the
Picture Header for the current picture may be available from previous
packets belonging to the same picture. The transmitter's strategy for
inclusion of the MPEG-2 video specific header extension may depend
upon a number of factors. This header may not be needed when:
1. the information has been transmitted a sufficient number of

times in previous packets to assure reception with the desired
probability, or
2. the information is transmitted over a separate reliable

channel, or
3. expected loss rates are low enough that missed frames are not a
concern, or
4. conserving bandwidth is more important than error resilience,

etc.
If T=1 and E=0, there may be extensions present in the original video
bitstream that are not included in the current packet. The
transmitter may choose not to include extensions in a packet when
they are not necessary for decoding or if one of the cases listed
above for not including the MPEG-2 video specific header extension in
a packet applies only to the extension data.
If N=0, then the Picture Header from a previous picture of the same
type (I,P or B) may be used so long as at least one packet has been
received for every intervening picture of the same type and that the
N bit was 0 for each of those pictures. This may involve:
1. Saving the relevant picture header information that can be

obtained from the MPEG-2 video specific header extension or
directly from the video bitstream for each picture type,
2. Keeping validity indicators for this saved information based on

the received N bits and lost packets, and,
3. Updating the data whenever a packet with N=1 is received.
77
If the necessary information is not available from any of these

sources, data deletion until a new picture start code is advised.
Any time an RTP packet is lost (as indicated by a gap in the RTP
sequence number), the receiver may discard all packets until the
Beginning-of-slice bit is set. At this point, sufficient state
information is contained in the stream to allow processing by an MPEG
decoder starting at the next slice boundary (possibly after
reconstruction of the GOP_header and/or Picture_Header as described
above).
References
[1] ISO/IEC International Standard 11172; "Coding of moving pictures

and associated audio for digital storage media up to about 1,5
Mbits/s", November 1993.
[2] ISO/IEC International Standard 13818; "Generic coding of moving

pictures and associated audio information", November 1994.
[3] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson,

"RTP: A Transport Protocol for Real-Time Applications", RFC 1889,
January 1996.
[4] Schulzrinne, H., "RTP Profile for Audio and Video Conferences
with Minimal Control", RFC 1890, January 1996.
[5] Deering, S., "Host Extensions for IP Multicasting", STD 5,

RFC 1112, August 1989.
Authors' Addresses
Gerard Fernando
Sun Microsystems, Inc.
Mail-stop UMPK14-305
2550 Garcia Avenue
Mountain View, California 94043-1100
USA
Phone: +1 415-786-6373
EMail: gerard.fernando@eng.sun.com
78
Vivek Goyal
Precept Software, Inc.
1072 Arastradero Rd,
Palo Alto, CA 94304
USA
Phone: +1 415-845-5200
EMail: goyal@precept.com
Don Hoffman
Sun Microsystems, Inc.
Mail-stop UMPK14-305
2550 Garcia Avenue
Mountain View, California 94043-1100
USA
Phone: +1 503-297-1580
EMail: don.hoffman@eng.sun.com
M. Reha Civanlar
AT&T Labs - Research
100 Schutlz Drive, 3-213
Red Bank, NJ 07701-7033
USA
Phone: +1 732-345-3305
EMail: civanlar@research.att.com
79
Full Copyright Statement
Copyright (C) The Internet Society (1998). All Rights Reserved.
This document and translations of it may be copied and furnished to

others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
English.
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an

"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
80
81

Alberto Vigata Master Thesis - MPEG Archival and Live Streaming Over RTP UDP

Încărcat de

Informații document

Titlu original

Drepturi de autor

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Alberto Vigata Master Thesis - MPEG Archival and Live Streaming Over RTP UDP

Încărcat de

Drepturi de autor:

An MPEG archival and

live streaming system

Alberto Vigatá Pascual

New features and requirements ........................................................................................ 6

MPEG Archival .................................................................................................................. 6

MPEG file streaming ......................................................................................................... 6

MPEG Systems, an overview ............................................................................................ 8

Mpeg Video Bitstream Syntax, an overview.................................................................. 13

3. MPEG ARCHIVAL ............................................................................................. 28

Analysis process ............................................................................................................... 32

Decision process ............................................................................................................... 40

4. MPEG FILE STREAMING ................................................................................. 47

The system ........................................................................................................................ 48

5. NETWORK TRANSMISSION / RECEPTION ................................................... 56

Program , Audio and Video Streams ............................................................................. 57

Transport Streams ........................................................................................................... 58

The receiver side .............................................................................................................. 59

6. IN THE WORKS ................................................................................................. 60

MPEG Archival ................................................................................................................ 60

MPEG File Streaming ..................................................................................................... 60

RTP/UDP Transmision & Reception ............................................................................. 60

New features and requirements

MPEG file streaming

To be able to understand the details of the implementation of our archival system,

MPEG Systems, an overview

About MPEG Systems

MPEG transport stream

MPEG-2 transport is a service multiplex thought to work on error prone

The associated signalling tables consist of the description of the elementary

There is a description of each program carried within the MPEG-2 Transport

PES PES data PES stuffing

ES DSM additional previous PES

PES pack program P-STD PES PES

MPEG Transport packet

sync transport payload transport transport adaptation continuity adaptation

adaptation discontinuity random elementary

transport transport adaptation optional

Program Stream pack

13818 pack pack pack

PES packet i ... PES packet n ...

Mpeg Video Bitstream Syntax, an overview

It is extremely important to understand the layout of the MPEG Video bitstream to

A video sequence commences with a sequence header and is followed by one or

Sequence Header Code

The exact syntax of a sequence header, is as follows:

sequence_header() { No. of bits

This is a four-bit integer which is an index to the following table:

CODE PICTURES PER SECOND

VBV Buffer Size

From a coding point of view, a concisely stated property is that a group of

Some examples of groups of pictures are given below:

The syntax of the group of pictures is defined as follows:

group_of_pictures_header() { No. of bits

Group of Pictures Start Code

FIELD BITS VALUES

A typical example of a closed group is shown below.

open or closed group

picture_header() { No. of bits

Picture Header and Start Code

It may be preceded by any number of zeros.

B B I B B P ... P B B P ... P B B P display order

Example Group of Pictures containing 1500 pictures

Picture Coding Type

CODE PICTURE TYPE

D = 30 000 / 90 000 = 1 / 3 second

B = 1 200 000 / 3 = 400 000 bits