Sunteți pe pagina 1din 50

Multiview video compression and display

To perceive three dimensionals, a persons eyes must see different,slightly unaligned images.In real word the spacing between the eyes makes that happen naturally.One display somehow has to present a different and separate view to each eye. Recent technological advances have made possible number of new applications in the area of 3D video. One of the enabling technologies for many of these 3D applications is multiview video coding. This project reveals signal processing issues related to coded repesentation,reconstruction and rendering of multiview video for 3D display using a Panda board.This technology sheds the clunky,chunky 3-D eyeglasses which were used to view a 3D image. An experimental analysis of multiview video compression for various temporal and inter-view prediction structures is presented in this project. The compression method is based on the multiple reference picture technique in the H.264/AVC video coding standard. The idea is to exploit the statistical dependencies from both temporal and inter-view reference pictures for motion-compensated prediction.


Multiview video compression and display


Chapter 1: Introduction
1.1 1.2 1.3 1.4 1.5

Video compression History of video compression standards Literature survey Motivation Objective

Chapter 2: Overview of MVC 2.1 Rendering 2.2 Requirements of MVC Chapter 3: Ubuntu 3.1 Introduction 3.2 Features 3.3 Sytem requirements 3.4 Variants 3.5 Terminal in Ubutu


Multiview video compression and display

Chapter 1

1.1 Video Compression
Video compression refers to reducing the quantity of data used to represent digital video images, and is a combination of spatial image compression and temporal motion compensation. Most video compression is lossy , it operates on the premise that much of the data present before compression is not necessary for achieving good perceptual quality. Video compression is a tradeoff between disk space, video quality, and the cost of hardware required to decompress the video in a reasonable time. However, if the video is over compressed in a lossy manner, visible (and sometimes distracting) artifacts can appear. Video data contains spatial and temporal redundancy. Similarities can thus be encoded by merely registering differences within a frame (spatial), and/or between frames (temporal). Spatial encoding is performed by taking advantage of the fact that the human eye is unable to distinguish small differences in color as easily as it can perceive changes in brightness, so that very similar areas of color can be "averaged out. With temporal compression only the changes from one frame to the next are encoded as often a large number of the pixels will be the same on a series of frames. There are two types of video compression: 1. LosslessLossless compression preserves all the data, but makes it more compact. The movie that comes out is exactly the same quality as what went in. Lossless compression produces very high quality digital audio or video, but requires a lot of data. The drawback with Lossless compression is that it is


Multiview video compression and display inefficient when trying to maximize storage space or network and Internet delivery capacity (bandwidth). 2. LossyLossy compression eliminates some of the data. Most images and sounds have more details than the eye and ear can discern. By eliminating some of these details, Lossy compression can achieve smaller files than Lossless compression. However, as the files get smaller, the reduction in quality can become noticeable. The smaller file sizes make Lossy compression ideal for placing video on a CD-ROM or delivering video over a network or the Internet. Most codecs in use today are Lossy codec.

1.2 History of video compression standards

Popular Implementations

Year 1984 1990 1993 1995

Standard H.120 H.261 MPEG-1 Part 2 H.262/MPEG-2 Part 2 H.263

Publisher ITU-T ITU-T ISO, IEC

Videoconferencing, telephony Video-CD


ISO, IEC, ITU- DVD Video, Blu-ray, Digital T Video Broadcasting, SVCD ITU-T Videoconferencing, Video telephony, Video on Mobile Phones (3GP) Video on Internet (DivX, Xvid ...)



MPEG-4 Part 2 H.264/MPEG-4 AVC VC-2 (Dirac)



Blu-ray, Digital Video ISO, IEC, ITUBroadcasting, iPod Video, HD T DVD ISO, BBC Video on Internet, HDTV broadcast, UHDTV



Multiview video compression and display

1.3 Literature survey: 3D video formats

3D depth perception of observed visual scenery can be provided by 3D display systems which ensure that the user sees a specific different view with each eye. Such a stereo pair of views must correspond to the human eye positions. Then the brain can compute the 3D depth perception. History of 3D displays dates back almost as long as classical 2D cinematography. In the past, users had to wear specific glasses (anaglyph, polarization, or shutter) to ensure separation of left and right view which was displayed simultaneously. Together with limited visual quality this is regarded as main obstacle for wide success of 3D video systems in home user environments.

1.3.1 Simulcast
The most obvious and straightforward means to represent stereo or multi-view video is simulcast, where each view is encoded independent of the other. This solution has low complexity since dependencies between views are not exploited, thereby keeping computation and processing delay to a minimum. It is also a backward compatible solution since one of the views could be decoded for legacy 2D displays. With simulcast, each view is assumed to be encoded with full spatial resolution, however, here asymmetrical coding of stereo is done, whereby one of the views is encoded with less quality, suggest that substantial savings in bit rate for the second view could be achieved. In this way, one of the views is more coarsely quantized than the other or coded with a reduced spatial resolution, yielding an imperceptible impact on the stereo quality.

1.3.2 Stereo Interleaving


Multiview video compression and display There is a class of formats for stereo content that we collectively refer to as stereo interleaving. This category includes both time multiplexed and spatial multiplexed in the time multiplexed format, the left and right views would be interleaved as alternating frames or fields. With spatial multiplexing, the left and right views would appear in either a side-by-side or over/under format. As is often the case with spatial multiplexing, the respective views are squeezed in the horizontal dimension or vertical dimension to fit within the size of an original frame. To distinguish the left and right views, some additional out-of-band signaling is necessary. For instance, the H.264/AVC standard specifies a Stereo SEI message that identifies the left view and right view; it also has the capability of indicating whether the encoding of a particular view is self-contained, i.e., frame or field corresponding to the left view are only predicted from other frames or fields in the left view. Inter-view prediction for stereo is possible when the self-contained flag is disabled. Similar type of signaling would be needed for the spatially multiplexed content.

1.3.3 2D + Depth
Another well-known representation format is the 2D plus depth format. The inclusion of depth enables a display independent solution for 3D that supports generation of an increased number of views as need by any stereoscopic display. A key advantage is that the main 2D video provides backward compatibility with legacy devices. Also, this representation is agnostic of coding format, i.e., the approach works with both MPEG-2 and H.264/AVC. ISO/IEC 23002-3 (also referred to as MPEG-C Part 3) specifies the representation of auxiliary video and supplemental information. In particular, it enables signaling for depth map streams to support 3D video applications.

1.4 Motivation
Literature survey reveals that existing solutions to rendering a 3D view have number of limitations like

Multiview video compression and display 1. Coding efficiency is not maximized since redundancy between views is not exploited. 2. Above techniques are not backward compatible. 3. 2D+depth technique is only capable of rendering a limited depth range and has problems with occlusions. Thus in order to exploit inters-image similarities and to overcome the above listed limitations, an efficient algorithm must be developed.

1.5 Objective
The proposed technique in this project attempts to 1. Improve coding efficiency of multiview video. 2. Provide better results compared to simple AVC-based simulcast for the same bit rate. 3. Provide backward compatibility.

We have used Linux platform for the implementatin purpose since the Panda board runs on Linux Kernel.Chapter 3 provides a brief introduction to linux and its usage.


Multiview video compression and display

Chapter 2

Overview of MVC
2.1 Rendering
Multiview video rendering belongs to the broad research field of image-based rendering , and has been studied extensively in the literature. In here, we focus on one particular form of multiview video multi-stereoscopic video with depth maps. We assume we are given a number of video sequences captured from different viewpoints.

Figure 2.1 The rendering process from multiview video. In the following, we briefly describe the process of rendering an image from a virtual viewpoint given the image set and the depth map. As shown in Fig 2.1, given a virtual viewpoint, we first split the to be rendered view into light rays. For each light ray, we trace the light ray to the surface of the depth map, obtain the intersection, and re-project the intersection into nearby cameras The intensity of the light ray is thus the weighted average of the projected light rays in Cam 3 and Cam 4. The weight can be determined by many factors. In the simplest form, we

Multiview video compression and display can use the angular difference between the light ray to be rendered and the light ray being projected, assuming the capturing cameras are at roughly the same distance to the scene objects. Care must be taken to perform such rendering, as when the virtual viewpoint moves away from Cam 4 (where the depth map is given), there will be occlusions and holes when computing the light ray/geometry intersection. In our algorithm, we first convert the given depth map into a 3D mesh surface, where each vertex corresponds to one pixel in the depth map. The mesh surface is then projected to the capturing cameras to compute any potential occlusions in the captured images. Finally, the mesh is projected to the virtual rendering point with multi-texture blending, similar to that in .For each vertex being rendered, it is projected to the nearby captured images to locate the corresponding texture coordinate. This process takes into consideration the occlusion computed earlier. That is, if a vertex is occluded in a nearby view, its weight for that camera will be set to zero. With that information, multiple virtual rendering within the estimated range can be conducted to compute a combined weight map for compression. In addition, if the users viewpoint does not change significantly, we may achieve a similar effect by simply smoothing the computed weight maps. During adaptive multiview video compression, the weight map will be converted into a coarser one for macroblock based encoding, which effectively smoothes the weight map too.

2.2 MVC Requirements:

2.2.1 Compression Related Requirements
1. Compression efficiency MVC shall provide high compression efficiency relative to independent coding of each view of the same content. Some overhead, such as camera parameters, may be necessary for facilitating view interpolation, i.e., trading coding efficiency for


Multiview video compression and display functionality. However, the overhead data should be limited in order to increase acceptance of new services. 2. View scalability MVC shall support a scalable bit stream structure to allow for access of selected views with minimum decoding effort. This enables the video to be displayed on a multitude of different terminals and over networks with varying conditions. 3. Free viewpoint scalability MVC shall support a scalable bit stream structure to allow for access to partial data from which new views can be generated, i.e., not the original camera views, but the generated views from them. Such content can be delivered to various types of displays. This enables the functionality of free viewpoint navigation on a scalability basis. 4. Spatial/Temporal/SNR scalability SNR scalability, spatial scalability, and temporal scalability should be supported. 5 .Backward compatibility At any instant in time, the bitstream corresponding to one view shall be conforming to AVC. 6 .Resource consumption MVC should be efficient in terms of resource consumption, such as memory size, memory bandwidth, and processing power. 7 .Low delay MVC shall support low encoding and decoding delay modes. Low delay is very important for the real-time applications such as a streaming and broadcasting using multi-view video.



Multiview video compression and display 8 .Robustness Robustness to errors, also known as error resilience, should be supported. This enables the delivery of multiview video contents on error-prone networks, such as wireless networks and other networks.

9. Resolution, bit depth, chroma sampling format MVC shall support spatial resolutions from QCIF to HD. MVC shall support the YUV 4:2:0 format. MVC shall support 8 bits per pixel component. Future applications may require higher bit depths and higher chroma sampling formats. 10 .Picture quality among views MVC should enable flexible quality allocation over different views. For instance, consistent quality might be required for some applications. 11. Temporal random access MVC shall support random access in the time dimension. For example, it shall be possible to access a frame at a given time with minimal decoding of frames in the time dimension. 12. View random access MVC shall support random access in the view dimension. For example, it shall be possible to access a frame in a given view with minimal decoding of frames in the view dimension. 13 .Spatial random access MVC should support random access to a spatial area in a picture. This may be treated as a view random access if a view is composed of several spatially smaller views.


Multiview video compression and display

14. Resource management MVC shall support efficient management of decoder resources. For instance, the output timing of multiple pictures requires efficient management. Especially, the pictures whose time stamps are the same with all views shall be available at the same time or sequentially from a decoder.

15. Parallel processing MVC shall support parallel processing of different views or segments of the multi-view video to facilitate efficient encoder and decoder implementations.

2.2.2 System Support Related Requirements

1. Synchronization MVC shall support accurate temporal synchronization among the multiple views. 2. View generation MVC should enable robust and efficient generation of virtual views or interpolated views. 3 .Non-planar imaging and display systems MVC should support efficient representation and coding methods for 3D display including integral photography and non-planar image (e.g. dome) display systems. 4 .Camera parameters MVC should support transmission of camera parameters.



Multiview video compression and display

Block diagram of MVC system

The overall structure of MVC defining the interfaces is illustrated above.The MVC encoder receives temporally synchronized video streams and generates one video stream. The decoder receives the bit stream, decodes and provides separate view to each eye. The raw YUV 4:2:0 frames are provided as an input, they are encoded, compressed using various algorithms of MVC.The output of encoder is given as an input to decoder, where the frames are decompressed and decoded using


Multiview video compression and display PANDA board which works on LINUX platform.The PANDA board is

interfaced with a suitable 3D device viz 3D-TV, 3D-mobile


Ubuntu is a computer operating system based on the Debian Linux distribution and distributed as free and open source software. It is named after the Southern African philosophy of Ubuntu ("humanity towards others"). Ubuntu packages are based on packages from Debian's unstable branch: both distributions use Debian's deb package format and package management tools (APT and Synaptic). Debian and Ubuntu packages are not necessarily binary compatible with each other, however, and sometimes .deb packages may need to be rebuilt from source to be used in Ubuntu. Many Ubuntu developers are also maintainers of key packages within Debian. Ubuntu cooperates with Debian by pushing changes back to Debian, although there has been criticism that this does not happen often enough. In the past, Ian Murdock, the founder of Debian, has expressed concern about Ubuntu packages potentially diverging too far from Debian to remain compatible.




Multiview video compression and display Ubuntu is composed of many software packages, the vast majority of which are distributed under a free software license. The only exceptions are some proprietary hardware drivers. The main license used is the GNU General Public License (GNU GPL) which, along with the GNU Lesser General Public License (GNU LGPL), explicitly declares that users are free to run, copy, distribute, study, change, develop and improve the software. On the other hand, there is also proprietary software available that can run on Ubuntu. Ubuntu focuses on usability, security and stability .The Ubiquity installer allows Ubuntu to be installed to the hard disk from within the Live CD environment, without the need for restarting the computer prior to installation. Ubuntu also emphasizes accessibility and internationalization to reach as many people as possible. Beginning with 5.04, UTF-8 became the default character encoding, which allows for support of a variety of non-Roman scripts. As a security feature, the sudo tool is used to assign temporary privileges for performing administrative tasks, allowing the root account to remain locked, and preventing inexperienced users from inadvertently making catastrophic system changes or opening security holes. PolicyKit is also being widely implemented into the desktop to further harden the system through the principle of least privilege. Ubuntu comes installed with a wide range of software that includes OpenOffice, Firefox, Empathy (Pidgin in versions before 9.10), Transmission, GIMP (in versions prior to 10.04), and several lightweight games (such as Sudoku and chess). Additional software that is not installed by default can be downloaded and installed using the Ubuntu Software Center or the package manager Synaptic, which come pre-installed. Ubuntu allows networking ports to be closed using its firewall, with customized port selection available. End-users can install Gufw (GUI for Uncomplicated Firewall) and keep it enable GNOME (the current default desktop) offers support for more than 46 languages. Ubuntu can also run many programs designed for Microsoft Windows (such as Microsoft Office), through Wine or using a Virtual Machine (such as VMware Workstation or VirtualBox). For the upcoming 11.04 release, Canonical intends to drop the


Multiview video compression and display GNOME Shell as the default desktop environment in favor of Unity, a graphical interface it first developed for the notebook edition of ubuntu. Ubuntu, unlike Debian, compiles their packages using gcc features such as PIE and Buffer overflow protection to harden their software. These extra features greatly increase security at the performance expense of 1% in 32 bit and 0.01% in 64 bit.


The desktop version of Ubuntu currently supports the x86 32 bit and 64 bit architectures. Unofficial support is available for the PowerPC, IA-64 (Itanium) and PlayStation 3 architectures. A supported GPU is required to enable desktop visual effects.

The variants recognized by Canonical as contributing significantly towards the Ubuntu project are the following: Edubuntu: A GNOME-based subproject and add-on for Ubuntu, designed for school environments and home users. Kubuntu: A desktop distribution using the KDE Plasma Workspaces desktop environment rather than GNOME. Mythbuntu is designed for creating a home theater PC with Myth TV and uses the Xfce desktop environment. Ubuntu Studio: A distribution made for professional video and audio editing, comes with higher-end free editing software and is a DVD .iso image unlike the Live CD the other Ubuntu distributions use.


Multiview video compression and display Xubuntu: A distribution based on the Xfce desktop environment instead of GNOME, designed to run more efficiently on low-specification computers.


All the commands in the Linux are typed in the terminal. To open the terminal go to application in the toolbar, then accessories is selected then click on terminal, a dialogue box appear which is shown in the figure 3.1

Figure 3.1: To open a terminal in linux.

Table 3.1: Commands of Linux and their description



Multiview video compression and display cd filename cd Desktop ls make clean make ./configure ./filename.exe exit gtkterm Opens the specified directory To open a folder on the Desktop Gives the list of the files inside the folder Deletes the previously generated object files To build executable and Objective files To build configuration file To run the executable files on linux To close the terminal To open gtk for serial communication



Exploiting similarities among the multi-view video images is the key to efficient compression. When considering temporally successive images of one view sequence, i.e. one row of the MOP, the same view-point is captured at different time instances. Usually, the same objects appear in successive images but possibly at different pixel locations. If so, objects are in motion and practical compression schemes utilize motion compensation techniques to exploit these temporal similarities. On the other hand, spatially neighboring views captured at the same time instant, i.e., images in one column of the MOP, show the same objects from different view-points. Similar to the previous case, the same objects appear in neighboring views but at different pixel locations. Here, the objects in each image are subject to parallax and practical compression schemes use disparity compensation techniques to exploit these inter-view similarities.

4.1.1. Temporal Similarities



Multiview video compression and display Consider temporally successive images of one view sequence, i.e., one row of the MOP. If objects in the scene are subject to motion, the same objects appear in successive images but at different pixel locations. To exploit these temporal similarities, sophisticated motion compensation techniques have been developed in the past. Frequently used are so-called block matching techniques where a motion vector establishes a correspondence between two similar blocks of pixels chosen from two successive images. Practical compression schemes signal this motion vectors to the decoder as part of the bit-stream. Variable block size techniques improve the adaptation of the block motion held to the actual shape of the object. Lately, so-called multi-frame techniques have been developed. Classic block matching techniques use a single preceding image when choosing a reference for the correspondence. Multi-frame techniques, on the other hand, permit choosing the reference from several previously transmitted images; a different image could be selected for each block. Finally, superposition techniques are also used widely. Here, more than one correspondence per block of pixels is specified and signaled as part of the bit-stream. A linear combination of the blocks resulting from multiple correspondences is used to better match the temporal similarities. A special example is the so-called bidirectionally predicted picture where blocks resulting from two correspondences are combined. One correspondence uses a temporally preceding reference; the other uses a temporally succeeding reference. The generalized version is the so-called bi-predictive picture. Here, two correspondences are chosen from an arbitrary set of available reference images.

4.1.2. Inter-View Similarities

Consider spatially neighboring views captured at the same time instant, i.e., images in one column of the MOP. Objects in each image are subject to parallax and appear at different pixel locations. To exploit these inter-view similarities, disparity compensation techniques are used. The simplest approach to disparity compensation is block matching techniques similar to those used for motion compensation. These techniques offer the advantage of not requiring knowledge


Multiview video compression and display of the geometry of the underlying 3D objects. However, if the cameras are sparsely distributed, the block-based translatory disparity model fails to compensate accurately. More advanced approaches to disparity compensation are depth-image-based rendering algorithms. They synthesize an image as seen from a given view-point by using the reference texture and depth image as input data. These techniques offer the advantage that the given view-point image is compensated more accurately even when the cameras are sparsely distributed. However, these techniques rely on accurate depth images, which are difficult to estimate. Finally, hybrid techniques that combine the advantages of both approaches may also be considered. For example, if the accuracy of a depth image is not sufficient for accurate depth-image-based rendering, block-based compensation techniques may be used on top for selective refinement.


The vast amount of multi-view data is a huge challenge not only for capturing and processing but also for compression. Efficient compression exploits statistical dependencies within the multi-view video imagery. Usually, practical schemes accomplish this either with predictive coding or with sub band coding. In both cases, motion compensation and disparity compensation are employed to make better use of statistical dependencies. Note that predictive coding and sub band coding have different constraints for efficient compression.

Predictive Coding
Predictive coding schemes encode multiview video imagery sequentially. Two basic types of coded pictures are possible: intra and inter pictures. Intra pictures are coded independently of any other image. Inter pictures, on the other hand, depend on one or more reference pictures that have been encoded previously. By design, an intra picture does not exploit the similarities among the multiview images. But an inter picture is able to make use of these similarities by choosing one or more reference pictures and generating a motion- and/or disparityDept of E&C, VVIET MYSORE


Multiview video compression and display compensated image for efficient predictive coding. The basic ideas of motioncompensated predictive coding are summarized in the box Motion-Compensated Predictive Coding. When choosing the encoding order of images, various constraints should be considered. For example, high coding efficiency as well as good temporal multi resolution properties may be desirable. Motion-compensated predictive coding of image sequences is accomplished with intra and inter pictures. As depicted in Figure 4.1(a), the input image xk is independently encoded into the intra picture IIk. The intra decoder is used to independently reconstruct the image xk. The input image xk is predicted by the motion-compensated (MC) reference image xr. The prediction error, also called displaced frame difference (DFD), is encoded and constitutes, in combination with the motion information, the inter picture Pk. The interpicture decoder reverses this process but requires the same reference image xr to be present at the decoder side. If the reference picture differs at encoder and decoder sides, e.g., because of network errors, the decoder is not able to reconstruct the same image xk that the encoder has encoded. Note that reference\ pictures can be either reconstructed intra pictures or other reconstructed inter pictures. Figure 4.1(b) shows the basic inter picture (predictive picture), which chooses only one reference picture for compensation. More advanced are bipredictive pictures that use a linear combination of two motion-compensated reference pictures. Bidirectional motion-compensated prediction is a special case of bipredictive pictures and is widely employed in standards like MPEG-1, MPEG-2, and H.263.



Multiview video compression and display

Fig 4.1: motion compensated predictive coding.


The block diagram showing various steps in encoding are shown in the Fig4.2 .The picture captured by various cameras are denoted as view i picture. It is given as an input to the MVC encoder. The various steps in encoding are described below. .
View i picture

+ -



Entropy Coding IQuantization ITransform


Mode Decision

+ +
Intra Prediction Deblocking Filter +

Motion Compensation Disparity / Illumination Compensation

Reference Picture Store for View i

Estimation Dept of E&C, VVIET MYSORE


Disparity / Illumination Estimation

Reference Picture Store for Other Views


Multiview video compression and display

Figure 4.2: Block diagram of MVC encoder 4.3.1 VIDEO FORMAT YUV is a color space typically used as part of a color image pipeline. It encodes a color image or video taking human perception into account, allowing reduced bandwidth for chrominance components, thereby typically enabling transmission errors or compression artifacts to be more efficiently masked by the human perception than using a "direct" RGB-representation. Other color spaces have similar properties, and the main reason to implement or investigate properties of Y'UV would be for interfacing with analog or digital television or photographic equipment that conforms to certain Y'UV standards. The raw YUV frames used here is of format 4:2:0 i.e. for 4 Y components, one Cb and one Cr are transmitted alternatively. This format is usually used in the video broadcasting because the temporal and spatial resolutions are high. 4.3.2 TRANSFORM The kind of transform used in the Multiview Video Compression is the DCT. A discrete cosine transform (DCT) expresses a sequence of finitely many data points in terms of a sum of cosine functions oscillating at different frequencies. The use of cosine rather than sine functions is critical in these applications: for compression, it turns out that cosine functions are much more efficient. The DCT is
applied on 8x8 block . The DCT equation (Eq.1)computes the i,j th

entry of the DCT of an image.



Multiview video compression and display

p x, y th is the x,y th element of the image represented by the matrix p. N is the size of the block that the DCT is done on. The equation calculates one entry (i,j th ) of the transformed image from the pixel values of the original image matrix. For the standard 8x8 block that JPEG compression uses, N equals 8 and x and y range from 0 to 7. Therefore D i, j th would be as in Equation (3).

Because the DCT uses cosine functions, the resulting matrix depends on the horizontal, diagonal, and vertical frequencies. Therefore an image black with a lot of change in has a very random looking resulting matrix, while an image matrix of just one color, has a resulting matrix of a large value for the first element and zeroes for the other elements.

4.3.3 QUANTIZATION Quantization is the process of mapping a large set of input values to a smaller set such as rounding values to some unit of precision. A device or algorithmic function that performs quantization is called a quantizer. Quantization is involved to some degree in nearly all digital signals processing, as the process of representing a signal in digital form ordinarily involves rounding. Quantization also forms the core of essentially all lossy compression algorithms.



Multiview video compression and display Because quantization is a many-to-few mapping, it is an inherently non-linear and irreversible process (i.e., because the same output value is shared by multiple input values, it is impossible in general to recover the exact input value when given only the output value). 4.3.4 MOTION ESTIMATOR Motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image (global motion estimation) or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom. Closely related to motion estimation is optical flow, where the vectors correspond to the perceived movement of pixels. In motion estimation an exact 1:1 correspondence of pixel positions is not a requirement. Applying the motion vectors to an image to synthesize the transformation to the next image is called motion compensation. The combination of motion estimation and motion compensation is a key part of video compression as used by MPEG 1, 2 and 4 as well as many other video codecs. There are many types of motion estimation techniques. High efficiency is achieved in EPZS. The Enhanced Predictive Zonal Search (EPZS) for motion estimation. EPZS, similar to other predictive algorithms, mainly comprises 3 steps. The initial predictor selection, selects the best MV predictor from a set of potentially likely predictors, the adaptive early termination allows the termination of the search at given stages of the estimation if some rules are satisfied, while the prediction


Multiview video compression and display refinement, employs a refinement pattern around the best predictor to essentially improve the final prediction. All these features are very vital to the performance of EPZS algorithms. 4.3.5 MOTION COMPENSATION Motion compensation is an algorithmic technique employed in the encoding of video data for video .Motion compensation describes a picture in terms of the transformation of a reference picture to the current picture. The reference picture may be previous in time or even from the future. When images can be accurately synthesized from previously transmitted/stored images, the compression efficiency can be improved. Motion compensation exploits the fact that, often, for many frames of a movie, the only difference between one frame and another is the result of either the camera moving or an object in the frame moving. In reference to a video file, this means much of the information that represents one frame will be the same as the information used in the next frame. This is called Spatial Redundancy. Detailed explanation of motion compensation is given in section 2.2

4.3.6 DEBLOCKING FILTER A deblocking filter is applied to blocks in decoded video to improve visual quality and prediction performance by smoothing the sharp edges which can form between macro blocks when block coding techniques are used. The filter aims to improve the appearance of decoded pictures. In H.264 deblocking filter is not an optional additional feature in the decoder. It is a feature on both the decoding path and on the encoding path, so that the in-loop effects of the filter are taken into account in reference macro blocks used for prediction. When a stream is encoded, the filter strength can be selected, or the filter can be switched off entirely. Otherwise, the filter strength is determined by


Multiview video compression and display coding modes of adjacent blocks, quantization step size, and the steepness of the luminance gradient between blocks. The filter operates on the edges of each 44 or 88 transform block in the luma and chroma planes of each picture. Each small block's edge is assigned a boundary strength based on whether it is also a macro block boundary, the coding (intra/inter) of the blocks, whether references (in motion prediction and reference frame choice) differ, and whether it is a luma or chroma edge. Stronger levels of filtering are assigned by this scheme where there is likely to be more distortion. The filter can modify as many as three samples on either side of a given block edge (in the case where an edge is a luma edge that lies between different macro blocks and at least one of them is intra coded). In most cases it can modify one or two samples on either side of the edge (depending on the quantization step size, the tuning of the filter strength by the encoder, the result of an edge detection test, and other factors). one or more reference pictures and generating a motion and/or disparity compensated image for efficient predictive coding.

4.3.7 ENTROPY ENCODER An entropy encoding is a lossless data compression scheme that is independent of the specific characteristics of the medium. One of the main types of entropy coding creates and assigns a unique prefix-free code to each unique symbol that occurs in the input. These entropy encoders then compress data by replacing each fixed-length input symbol by the corresponding variable-length prefix-free output codeword. The length of each codeword is approximately proportional to the negative logarithm of the probability. Therefore, the most common symbols use the shortest codes.


Multiview video compression and display There are two types of Entropy encoding 1. CABAC (Context Based Adaptive Binary Arithmetic Coding). 2. CAVLC (Context Based Adaptive Variable Length Coding).

Context-Based Adaptive Binary Arithmetic Coding (CABAC)

The arithmetic coding scheme selected for H.264, Context-based Adaptive Binary Arithmetic Coding or CABAC [3], achieves good compression performance through (a) Selecting probability models for each syntax element according to the elements context, (b) Adapting probability estimates based on local statistics and (c) using arithmetic coding. Coding a data symbol involves the following stages. 1. Binarization: CABAC uses Binary Arithmetic Coding which means that only binary decisions (1 or 0) are encoded. A non-binary-valued symbol (e.g. a transform coefficient or motion vector) is binarized or converted into a binary code prior to arithmetic coding. This process is similar to the process of converting a data symbol into a variable length code but the binary code is further encoded (by the arithmetic coder) prior to transmission. Stages 2, 3 and 4 are repeated for each bit (or bin) of the binarized symbol. 2. Context model selection: A context model is a probability model for one or more bins of the binarized symbol. This model may be chosen from a selection of available models depending on the statistics of recently-coded data symbols. The context model stores the probability of each bin being 1 or 0. 3. Arithmetic encoding: An arithmetic coder encodes each bin according to the selected probability model. Note that there are just two sub-ranges for each bin (corresponding to 0 and 1).



Multiview video compression and display 4. Probability update: The selected context model is updated based on the actual coded value (e.g. if the bin value was 1, the frequency count of 1s is increased).


A low complexity mode decision algorithm is proposed to reduce complexity of ME and DE. An experimental analysis is performed to study inter-view correlation in the coding information such as the prediction mode and rate distortion (RD) cost. Based on the correlation, we propose four efficient mode decision techniques, including early SKIP mode decision, adaptive early termination, fast mode size decision and selective intra coding in inter frame. Experimental results show that the proposed algorithm can significantly reduce computational complexity of MVC while maintaining almost the same RD performance. 4.4 MVC DECODER The exact reverse process of encoder takes place in decoder.The block diagram of MVC decoder is shown in Figure 4.3.



Multiview video compression and display

Fig 4.3: MVC decoder. Coded bitstream is applied to the entropy decoder then the decoded bits are subjected to inverse quantization ad inverse transformation to get the decoded YUV. There are two ways of decoding, it can be from intra prediction and inter prediction. Intra pictures are coded independently, where as the inter pictures depend on one or more reference pictures that have been decoded previously. By design, an intra picture does not exploit the similarities among the multi-view images. But an inter picture is able to make use of these similarities by choosing


Multiview video compression and display one or more reference pictures and generating a motion and/or disparity compensated image for efficient predictive coding. The signal obtained by the inverse quantization and inverse DCT transform is summed with output of intra prediction or the inter prediction. The mode is the algorithm based switches used to select either inter or intra prediction signals. The summed signals are given to the de-blocking filter. The de-blocking filter is applied to blocks in decoded video to improve visual quality an prediction performance by smoothing the sharp edges which can form between macro blocks when block coding techniques are used. The filter aims to remove discontinuities in the picture block. The filter output is now stored in the picture memory for the further computation. The reference pictures stored in picture memory are pointed by thr reference picture index obtained by the entropy decoder. The decoded and reconstructed signals are finally obtained from the de-blocking filter. In this Chapter discussion was done on the coding and decoding of the YUV frames. Next chapter reveals about experimentation and the test results in the Linux platform



Multiview video compression and display

4.5 Flowchart 1.MVC encoder



Multiview video compression and display

2.MVC Decoder



Multiview video compression and display





Multiview video compression and display


Step 1: CROSS COMPILATION The cross compilation is done by pointing CC in the make file to the gnu arm tool chain. Step2: If any objective files are created then they are deleted using the command make clean Step3: To build objective and executable files make command is used. Step4: The required executable files, input and the configuration files are copied to the SD card. Step5: The SD card is inserted to the PANDA board,5V power supply is given to the board and the serial port of the computer is connected to the Panda board, open gtkterm window to communicate with that of a serial port. Step6:The baud rate is set to maximum. We have used a baud rate of 115200. Step7: Panda board gets booted by 5V power supply, and then the executable files are made to run on the Panda board using the command ./filename.exe. Step8: The output obtained is verified and compression ratio is calculated.



Multiview video compression and display

Figure 5.1: showing Step 2 and 3.




Multiview video compression and display TEST-1 Number of frames to be coded:3

Output of the encoder:

Parsing Configfile encoder_stereo.cfg..................................................................................................... ................................................................................................................................... ................................................................................................................................... .................................... Warning: Hierarchical coding or Referenced B slices used. Make sure that you have allocated enough references in reference buffer to achieve best performance. ------------------------------- JM 17.2 (FRExt) ------------------------------Input YUV file Input YUV file 2 Output H.264 bitstream Output YUV file Output YUV file 2 YUV Format Frames to be encoded Freq. for encoded bitstream PicInterlace / MbInterlace : left_432x240.yuv : right_432x240.yuv : test.264 : test_rec.yuv : test_rec2.yuv : YUV 4:2:0 :3 : 30.00 : 0/0



Multiview video compression and display Transform8x8Mode :1

ME Metric for Refinement Level 0 : SAD ME Metric for Refinement Level 1 : Hadamard SAD ME Metric for Refinement Level 2 : Hadamard SAD Mode Decision Metric : Hadamard SAD

Motion Estimation for components : Y Image format Error robustness Search range : 320x240 (320x240) : Off : 32 :5 :5

Total number of references References for P slices

References for B slices (L0, L1) : 5, 1 Sequence type Entropy coding method Profile/Level IDC Motion Estimation Scheme EPZS Pattern EPZS Dual Pattern EPZS Fixed Predictors EPZS Temporal Predictors EPZS Spatial Predictors

: Hierarchy (QP: I 28, P 28, B 30) : CABAC : (128,40) : EPZS

: Extended Diamond : Extended Diamond : All P + B : Enabled : Enabled 38

Multiview video compression and display EPZS Threshold Multipliers EPZS Subpel ME EPZS Subpel ME BiPred Search range restrictions RD-optimized mode decision Data Partitioning Mode Output File Format : (1 0 2) : Basic : Basic : none : used : 1 partition : H.264/AVC Annex B Byte Stream Format

-----------------------------------------------------------------------------------Frame Ref -----------------------------------------------------------------------------------00000(NVB) 00000(IDR) 0 00000( P ) 1 00002( P ) 0 00002( P ) 1 00001( B ) 0 00001( B ) 1 480 189936 28 36.814 35.359 35.318 135344 28 35.293 39.691 38.779 112176 28 37.830 35.056 34.754 91032 28 40.726 35.247 34.447 147672 30 33.395 31.602 31.631 115024 30 34.741 32.077 33.265 1549 2384 1995 1761 3232 3225 0 344 361 512 1084 1259 FRM FRM FRM FRM FRM FRM 3 2 2 2 0 0 View Bit/pic QP SnrY SnrU SnrV Time(ms) MET(ms) Frm/Fld

------------------------------------------------------------------------------Total Frames: 6 Leaky BucketRateFile does not have valid entries.



Multiview video compression and display Using rate calculated from avg. rate Number Leaky Buckets: 8 Rmin Bmin Fmin

3955920 193416 193416 4944900 189936 189936 5933880 189936 189936 6922860 189936 189936 7911840 189936 189936 8900820 189936 189936 9889800 189936 189936 10878780 189936 189936 ------------------ Average data all frames ----------------------------------Total encoding time for the seq. : 14.148 sec (0.42 fps) Total ME time for sequence : 3.563 sec

Y { PSNR (dB), cSNR (dB), MSE } : { 36.467, 35.888, 16.76049 } U { PSNR (dB), cSNR (dB), MSE } : { 34.839, 34.125, 25.15187 } V { PSNR (dB), cSNR (dB), MSE } : { 34.699, 34.205, 24.69435 }

View0_Y { PSNR (dB), cSNR (dB), MSE } : { 36.013, 35.577, 18.00490 } View0_U { PSNR (dB), cSNR (dB), MSE } : { 34.006, 33.649, 28.06361 } View0_V { PSNR (dB), cSNR (dB), MSE } : { 33.901, 33.580, 28.51354 }


Multiview video compression and display

View1_Y { PSNR (dB), cSNR (dB), MSE } : { 36.920, 36.223, 15.51609 } View1_U { PSNR (dB), cSNR (dB), MSE } : { 35.671, 34.659, 22.24014 } View1_V { PSNR (dB), cSNR (dB), MSE } : { 35.497, 34.935, 20.87516 }

Total bits View 0 Total-bits View 1 Total-bits

: 791664 (I 189936, P 338552, B 262696 NVB 480) : 450104 (I 189936, P 112176, B 147672 NVB 320) : 341560 (I 0, P 226376, B 115024 NVB 160) : 7916.64

Bit rate (kbit/s) @ 30.00 Hz

View 0 BR (kbit/s) @ 30.00 Hz : 4501.04 View 1 BR (kbit/s) @ 30.00 Hz : 3415.60 Bits to avoid Startcode Emulation : 28 Bits for parameter sets Bits for filler data :0 : 480

real 0m0.271s user 0m0.212s sys 0m0.056s




Multiview video compression and display OUTPUT OF DECODER Input H.264 bitstream Output decoded YUV Input reference file : test.264 : test_dec.yuv : test_rec.yuv

POC must = frame# or field# for SNRs to be correct -------------------------------------------------------------------------Frame POC Pic# QP SnrY SnrU SnrV Y:U:V Time(ms)

-------------------------------------------------------------------------00000(IDR) 00000( P ) 00002( P ) 00002( P ) 00001( b ) 00001( b ) 0 0 4 4 2 2 0 0 1 1 2 2 28 0.0000 0.0000 0.0000 4:2:0 28 13.8138 16.1082 15.1999 4:2:0 28 0.0000 0.0000 0.0000 4:2:0 24 16 15 13 17 15

28 18.0149 15.3684 14.0144 4:2:0 30 0.0000 0.0000 0.0000 4:2:0

30 15.6719 13.6850 13.0119 4:2:0

-------------------- Average SNR all frames -----------------------------SNR Y(dB) SNR U(dB) SNR V(dB) : 7.92 : 7.53 : 7.04

Total decoding time : 0.102 sec (58.824 fps)[6 frm/102 ms] -------------------------------------------------------------------------Exit JM 17 (FRExt) decoder, ver 17.2 Output status file : log.dec

real 0m0.870s


Multiview video compression and display

user 0m0.228s sys 0m0.044s

Similarly the encoder and decoder were successfully tried with 100 and 135 frames respectively. The results obtained in various trials can be tabulated as below Table 5.1: Experimental results Number of frames to be coded Input YUV Test.264 Reconstructed encoder) Decoded YUV(output of decoder) Compression Ratio Real User System 338KB .6 .870s .228s .044s 11 MB 5.4 12.25s 9.16s .83s 14.8 MB 7.09 17.23s 13.16s 1.29s YUV(output 3 14.8 MB 97KB of 338KB 100 14.8 MB 1.6MB 11 MB 135 14.8 MB 2.1MB 14.8 MB

As seen from the table 5.1 greater the number of frames to be ccoded more is the time taken for execution. Trial 3 i.e. coding all the frames of the given view has got higher compression ratio and note that the input YUV, Reconstructed YUV and Decoded YUV are having same size. Thus revealing encoding and decoding are done effectively. The application and future enhancement of proposed technique is discussed in the following chapter.

Application and Future enhancement



Multiview video compression and display

1. Application
1. Free view point television
2. The 3D technique using a cellophane sheet was applied to a laparoscope in

order to expand the limited viewing capability of this minimum invasive surgical device. A unique feature of this 3D laproscope is that it includes a virtual ruler to measure distances without physically touching affected areas.
3. 3D games designed using MVC can draw the world at any angle and can

have the player walk in any increment of steps they choose. 4. Immersive teleconference. 5. 3d-mobiles 6. 3d-Television

2.Future Enhancement
As we saw from the statistics in chapter 5,the time reqired for coding is high.MVC can be enhanced by minimizing time required for encoding and decoding such that it can be used for real time application.



Multiview video compression and display The presented prediction structures for multi-view video coding are based on the fact that multiple video bit-streams, showing the same scene from different camera perspectives, show significant inter-view statistical dependencies. The corresponding evaluation pointed out, that these correlations can be exploited for efficient coding of multi-view video data. The multiview prediction structures have the advantage of achieving significant coding gains and being highly flexible regarding their adaptation to all kinds of spatial and temporal setups at the same time. These prediction structures for multi-view video coding are very similar to H.264/AVC and require only very minor syntax changes. Regarding coding efficiency, Coding gains up to 3.2 dB and an average gain of 1.5 dB could be achieved.



Multiview video compression and display



Low-cost mobile software development platform 1080p video, WLAN, Bluetooth & more Dual core ARM CortexTM-A9 MPCore benefits Community-driven projects & support


HDMI v1.3 Connector (Type A) to drive HD displays DVI-D Connector (can drive a 2nd display, simultaneous display; LCD expansion header 46

requires HDMI to DVI-D adapter)


Multiview video compression and display Camera

Camera connector


3.5" Stereo Audio in/out HDMI Audio out

Wireless Connectivity

802.11 b/g/n (based on WiLink 6.0) Bluetooth v2.1 + EDR (based on WiLink 6.0)


1 GB low power DDR2 RAM Full size SD/MMC card cage with support for High-Speed &

High-Capacity SD cards Connectivity

Onboard 10/100 Ethernet


1x USB 2.0 High-Speed On-the-go port 2x USB 2.0 High-Speed host ports General purpose expansion header (I2C, GPMC, USB, MMC, Camera expansion header


Debug Board

10/100 BASE-T Ethernet (RJ45 connector) 47


Multiview video compression and display

Mini-AB USB port (For debug UART connectivity) 60-pin MIPI Debug expansion connector Debug LED 1 GPIO Button


Height: 4.5" (114.3 mm) Width: 4.0" (101.6 mm) Weight: 2.6 oz (74 grams)

PandaBoard component
Function Application Processor Memory Power Management IC Audio IC Connectivity 4 Port USB Hub/Ethernet DVI Transmitter 3.5 MM Dual Stacked Audio Vendor TI Elpida TI TI LSR SMSC TI KYCON Part ID OMAP4430 EDB8064B1PB-8D-F TWL6030 TWL6040 LS240-WI-01-A20 LAN9514-JZX TFP410PAP STX-4235-3/3-N

[1] A. Kubota, A. Smolic, M. Magnor, M. Tanimoto, T. Chen, and C. Zhang,


Multiview video compression and display Multi-view imaging and 3dtv, IEEE Signal Processing Magazine, vol. 24, no. 6, pp. 1021, 2007. [2] Z. Yang, W. Wu, K. Nahrstedt, G. Kurillo, and R. Bajcsy, Viewcast: View dissemination and management for multi-party 3d tele-immersive environments, in ACM Multimedia, 2007.

[3] H. Baker, D. Tanguay, I. Sobel, D. Gelb, M. Goss, W. Culbertson, and T. Malzbender, The coliseum immersive teleconferencing system, Tech. Rep., HP Labs, 2002. [4] M. Flierl and B. Girod, Multiview video compression, IEEE Signal Processing Magazine, vol. 24, no. 6, pp. 6676, 2007. [5] A. Smolic and P. Kauff, Interactive 3-d video representation and coding technologies, Proceedings of the IEEE, vol. 93, no. 1, pp. 98110, 2005. [6] C. Zhang and J. Li, Interactive browsing of 3D environment over the internet, in Proc. SPIE VCIP, 2001. [7] C. Zhang and T. Chen, A survey on image-based rendering representation, sampling and compression, EURASIP Signal Processing: Image Communication, vol. 19, no. 1, pp. 128, 2004. [8] C.L. Zitnick, S.B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, High-quality video view interpolation using a layered representation, in ACM SIGGRAPH, 2004. [9] C. Buehler, M. Bosse, L. McMillan, S. J. Gortler, and M. F. Cohen, Unstructured lumigraph rendering, in ACM SIGGRAPH, 2001.



Multiview video compression and display [10] ITU-T Rec. H.264 / ISO/IEC 11496-10, Advanced Video Coding, Final Committee Draft, Document JVTE022, September 2002 [11] I. Richardson, Video CODEC Design, John Wiley & Sons, 2002. 3 D. Marpe, G Blttermann and T Wiegand, Adaptive Codes for H.26L, ITU-T SG16/6 document VCEG-L13, Eibsee, Germany, January 2001