Sunteți pe pagina 1din 4

DISTRIBUTED VIDEO CODING: BASICS, MAIN SOLUTIONS AND TRENDS Fernando Pereira

Instituto Superior Tcnico Instituto de Telecomunicaes, Lisbon, Portugal E-mail: fernando.pereira@lx.it.pt ABSTRACT
After the great success of the predictive video coding approach, which led to a number of largely deployed MPEG and ITU-T standards, the video coding research community has been working on a new video coding paradigm, so-called distributed video coding (DVC), which is based on some Information Theory results from the 70s: the Slepian-Wolf and the Wyner-Ziv theorems. The first practical solutions have emerged around 2002 at the Stanford University and the University of California, Berkeley. This talk will address the basics, main solutions and trends on distributed video coding with especial emphasis on the Stanford DVC codec which has deserved a larger research investment. The rate-distortion (RD) performance of a state-ofthe-art Stanford based DVC codec will be presented and benchmarked by the relevant alternative standard based video coding solutions. Finally, some trends on the DVC research will be discussed. practical design of Wyner-Ziv (WZ) video codecs, a particular case of DVC related to lossy coding with side information available at the decoder, started around 2002. The first practical WZ solutions have been developed at Stanford University [3,4] and UC Berkeley [5,6]. As of today, the most popular WZ video codec design in the literature is clearly the Stanford architecture, which works at the frame level and is characterized by a feedback channel based decoder rate control. On the other hand, the UC Berkeley architecture, known as PRISM (Power-efficient, Robust, hIgh compression Syndrome based Multimedia coding), works at the block level and is characterized by an encoder side rate control based on the availability of a reference frame. The two early WZ video coding solutions are compared in detail in [7]. After briefly reviewing the DVC basics, this talk will compare the Stanford and PRISM WZ video coding solutions which are briefly described in Section 2. Next, the talk will check the RD performance of an advanced WZ video codec based on the Stanford architecture to learn the state-of-the-art performance for monoview DVC. While this talk will concentrate on monoview video coding, multiview DVC has also been addressed in the literature; multiview DVC has the architectural advantage of not requiring the various cameras to send data to a common encoder but rather using distributed encoding and exploiting the interview correlation only at the decoder; for a review on multiview DVC, see [8].

Index Terms distributed video coding, Wyner-Ziv video coding, RD performance 1. THE BASICS
With the wide deployment of mobile and wireless networks, there are a growing number of applications where many senders deliver data to a central receiver. Contrary to television like applications, these emerging applications typically require light encoding complexity while still requiring high compression efficiency, robustness to packet losses and, often, also low latency/delay. To address these emerging needs, some research groups revisited the video coding problem at the light of some Information Theory results from the 70s: the Slepian-Wolf and the Wyner-Ziv theorems [1,2]. According to the Slepian-Wolf theorem, the minimum rate needed to independently encode two statistically dependent discrete random sequences, X and Y, is the same as for joint encoding. Moreover, under some hypothesis on the joint statistics, the Wyner-Ziv theorem [2] adds that when the side information (i.e. the correlated source Y) is made available only at the decoder there is no coding efficiency loss in lossy encoding X, with respect to the case when joint encoding of X and Y is performed. Together, the Slepian-Wolf and the Wyner-Ziv theorems suggest that it is possible to encode two statistically dependent signals independently and decoding them jointly, while approaching the coding efficiency of conventional predictive coding schemes, which rely instead on joint encoding and decoding. The new coding approach, known as Distributed Video Coding (DVC), avoids the computationally intensive temporal prediction loop at the encoder, by shifting the exploitation of the temporal redundancy to the decoder. This may be a significant advantage for a large range of emerging application scenarios, including wireless video cameras, wireless low-power surveillance, and visual sensor networks. Based on the theoretical foundations above and following important developments in channel coding technology, the

2. THE MAIN WYNER- ZIV VIDEO CODING SOLUTIONS


The Stanford WZ video coding architecture was first proposed in 2002 for the pixel domain [3] and later extended to the transform domain [9] where DCT coefficients are WZ coded. The (more efficient) transform domain WZ video codec which architecture is shown in Fig. 1 starts by dividing the video sequence into WynerZiv (WZ) frames and key frames with the key frames encoded using a conventional intra-frame coding mode, e.g. H.264/AVC Intra, and inserted periodically, determining the GOP size and indirectly the difficulty in creating at the decoder the so-called side information.

Fig. 1 Stanford WZ video coding architecture [9]

978-1-4244-4291-1/09/$25.00 2009 IEEE

1592

ICME 2009

For the WZ frames, a block-based transform, typically a DCT, is applied with the DCT coefficients of the entire WZ frame grouped together, according to the position occupied by each DCT coefficient within a block, forming DCT coefficient bands. Each DCT band is uniformly quantized with a number of levels that depends on the target quality [9]. For a given band, bits of the quantized symbols are grouped together, forming bitplanes, which are then independently channel encoded, e.g. typically using turbo or LDPC (low-density parity-check) codes. The channel encoding of each DCT band starts with the most significant bitplane (MSB). The parity information generated for each bitplane is then stored in a buffer and sent in chunks/packets upon decoder requests, made through the feedback channel. On the decoder side, the so-called side information (SI) is created for each WZ coded frame, by performing a motion-compensated frame interpolation (or extrapolation) process using the closest already decoded frames. The side information for each WZ frame is taken as an estimate (noisy version) of the original WZ frame. The better it is the estimate, the smaller is the number of errors the turbo decoder has to correct and the bitrate needed. The residual statistics between corresponding coefficients in the WZ frame and the side information is assumed to be modeled by a Laplacian distribution whose parameter was initially estimated using an offline training phase. Once the side information DCT coefficients and the residual statistics for a given DCT coefficients band are known, each bitplane is channel decoded (starting from the MSB one). The channel decoder receives from the encoder successive chunks of parity bits following the requests made through the feedback channel. To decide whether or not more bits are needed for successful decoding of a certain bitplane, the decoder uses a request stopping criterion. After successfully channel decoding the MSB bitplane of a DCT band, the decoder proceeds in an analogous way with the remaining bitplanes associated to the same band. Once all the bitplanes of a DCT band are successfully turbo decoded, the channel decoder starts decoding the next band. After channel decoding all the bitplanes associated to each DCT band, the bitplanes are grouped together to form the decoded quantized symbol stream associated to each band. Once all decoded quantized symbols are obtained, it is possible to reconstruct the matrix of DCT coefficients. The DCT coefficients bands for which no WZ bits were transmitted are replaced by the corresponding DCT bands of the side information. After all DCT bands are reconstructed, a block-based inverse transform, typically the IDCT, is performed and the decoded WZ frame is obtained. Finally, to get the decoded video sequence, decoded key frames and WZ frames are conveniently mixed. Over the last few years, many improvements have been proposed for most of the modules in the initial Stanford WZ video codec: e.g. LDPC codes instead of turbo codes [10,11], better side information estimation [12], dynamic correlation noise modeling [13], enhanced reconstruction [14], etc. Other proposed solutions introduced architectural changes, e.g. selective Intra coding of blocks in the WZ frame [15], selective transmission of hash signatures by the encoder [16,17], removal of the feedback channel [18], provision of scalability [19,20,21] and error resilience features [22,23,24], etc. Almost at the same time, another WZ video coding approach has been proposed at UC Berkeley, known in the literature as PRISM [5,6]. The PRISM codec architecture is shown in Fig. 2 and starts by dividing each video frame into 88 samples blocks and applying a DCT over each block. Next, a scalar quantizer is applied to the DCT coefficients corresponding to a certain target quality. Before encoding, each block is classified into one of several pre-defined classes depending on the correlation between the current block and the predictor block in the reference frame.

Depending on the allowed complexity at the encoder, such a predictor can be either the co-located block, or a motioncompensated block [6]. The classification stage decides the coding mode for each block of the current frame: no coding (skip class), traditional Intraframe coding (entropy coding class) or syndrome coding (syndrome coding classes), depending on the estimated temporal correlation. The blocks classified in the syndrome coding classes are coded using a WZ coding approach as described below. The coding modes are then transmitted to the decoder as header information.

Fig. 2 PRISM video coding architecture [6]


For those blocks that fall in the syndrome coding classes, only the least significant bits of the quantized DCT coefficients in a block are encoded, since it is assumed that the most significant bits can be inferred from the side information. The number of least significant bits to be sent to the decoder depends on the syndrome class the block belongs to. Within the least significant bits, the lower part is encoded using a (run, depth, path, last) 4-tuple based entropy codec. The upper part of the least significant bits is coded using a coset channel code, in this case a BCH code, since it works well for small-block lengths as it is the case here. In addition, for each block, the encoder sends a 16-bit CRC checksum as a signature of the quantized DCT coefficients. This is needed in order to select the best SI candidate block at the decoder as explained below. The decoder generates SI candidate blocks, which correspond to all half-pixel displaced blocks in the reference frame, in a window around the block to be decoded. Each of the candidate blocks plays the role of side information for syndrome decoding, which consists on two steps: one step deals with the entropy coded least significant bitplanes and the other step with the coset channel coded bitplanes. Each candidate block leads to a decoded block, from which a hash signature is generated. In order to detect successful decoding, the latter is compared with the CRC hash received from the encoder. Candidate blocks are visited until decoding leads to hash matching. Once the quantized sequence is recovered, it is used along with the corresponding side information to get the best reconstructed block. The minimum mean squared estimate is computed from the side information and the quantized block. In [7], a detailed comparison of the two main WZ video coding architectures is made. With time, some of the differences between the two main WZ video coding solutions have disappeared, e.g. there are nowadays Stanford based coding solutions with selective

1593

block based Intra coding [15], encoder transmitted hash signatures [16,17], and without feedback channel [18].

3. THE RD PERFORMANCE
In order to know the state-of-the-art RD performance of an advanced WZ video codec, this section will adopt the IST-TDWZ codec recently proposed in [25], mainly characterized by: H.264/AVC key frame coding. Advanced side information creation process [12]. On line correlation noise modeling [13]. Optimal reconstruction process [14]. Side information refinement (SIR) algorithm based on a learning approach where the side information is successively improved as the decoding proceeds [25].

Fig. 4 RD performance for the Foreman (top) and Soccer (bottom) sequences, CIF, 30 Hz, GOP size 2 [25]

Fig. 3 IST-TDWZ video codec [25]


All the results here presented were obtained with the following test conditions: Test Sequences: Foreman, Soccer, Hall Monitor and Coastguard, since they represent different types of content; all tests use the full length of each sequence. Temporal and Spatial Resolution: QCIF at 15 Hz and CIF at 30 Hz. Rate-Distortion Points: Eight RD points are considered corresponding to the eight 44 quantization matrices defined in [25]; the values within each 44 matrix indicate the number of quantization levels associated to the various DCT coefficients bands. Quality Balance: To improve the user subjective impact by reaching a smooth quality variation, key frames are H.264/AVC Intra encoded using a quantization parameter which allows reaching a similar quality to the WZ frames for each RD point. The quantization parameters for the key frames as well as the quantization matrixes for the WZ frames are the same as used by the DISCOVER codec [10] to allow fair comparisons. Bitrate and PSNR: As usual for WZ video coding, only the luminance component of each frame is used to compute the bitrate and PSNR; both key frames and WZ frames are included in the RD results. Benchmarking Codecs: H.263+ Intra, H.264/AVC Intra and H.264/AVC No Motion since they also have a rather low encoding complexity and the DISCOVER WZ codec [10] which is another advanced Stanford based WZ video codec.

Fig. 5 RD performance for the Coastguard (top) and Hall Monitor (bottom) sequences, QCIF, 15 Hz, GOP size 2 [25]
Comparing the RD performances of the IST-TDWZ video codec (with SIR) with the selected benchmark standard based video codecs, it can be concluded that the SIR based IST-TDWZ codec performs better than H.264/AVC Intra for the sequences with low motion content, notably for the lower GOP sizes. For QCIF, 15 Hz, it is only better than the H.264/AVC No Motion codec for the

1594

Coastguard sequence where the side information is easier to create due to the well behaved motion; however, for CIF, 30 Hz, also Foreman outperforms H.264/AVC No Motion. As expected, the Soccer sequence is the most critical, even with the proposed SIR algorithm, with a RD performance which roughly manages to stay above H.263+ Intra for GOP size 2, QCIF.

4. THE TRENDS
The RD performance results presented above allow concluding that for the more quiet content, and lower GOP sizes, state-of-theart WZ video codecs can already overcome the RD performance of the best standard based low encoding complexity video codecs while using a lower encoding complexity. For more complex content and larger GOP sizes, this is still not happening showing that DVC research has still some way to go. However, there is also a growing feeling that (only) low complexity encoding is a rather moving target for DVC since Moores law has been teaching that complexity is more affordable every single day. In this context, multiview DVC may propose more specific benefits as mentioned in Section 1. Anyway, further WZ video coding research should address issues such as side information creation, iterative decoding, correlation noise modeling, novel channel codes, rate control, and WZ block selective coding. More recently, the increasing need for high compression efficiencies than those provided by the best predictive video coding standard, H.264/AVC, notably for very high spatial resolutions, has been raising the interest in combining the predictive and distributed video coding approaches. While predictive video coding relies on complex encoders and simpler decoders, DVC takes the opposite approach, leaving the question if it is not possible for some application scenarios to reach higher compression factors by adopting a complex encoding plus complex decoding approach where a DVC inspired decoder helps a predictive coding inspired encoder. The way this should happen is still not clear but there is no strong reason for continuously increasing mainly the encoder complexity and not the decoding complexity, notably for the applications which are prepared to pay this price to reach higher compression factors. Moreover, it is presently more and more accepted that the Distributed Source Coding (DSC) principles, notably the SlepianWolf and Wyner-Ziv theorems, are inspiring a varied set of tools which may help to solve different problems, not only coding, but also e.g. authentication and secure biometrics [26, 27]. While it is difficult to state, at this stage, if any video coding product will ever use DSC principles, and in which way, it is most interesting to study and research towards this possibility.

5. REFERENCES
1. J. Slepian and J. Wolf, Noiseless Coding of Correlated Information Sources, IEEE Trans. on Information Theory, vol. 19, n 4, pp. 471 480, July 1973. 2. A. Wyner and J. Ziv, The Rate-Distortion Function for Source Coding with Side Information at the Decoder, IEEE Trans. on Information Theory, vol. 22, n 1, pp. 1 - 10, January 1976. 3. A. Aaron, R. Zhang and B. Girod, Wyner-Ziv Coding of Motion Video, Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, November 2002. 4. B. Girod, A. Aaron, S. Rane and D. Rebollo Monedero, Distributed Video Coding, Proceedings of the IEEE, vol. 93, n 1, pp. 71 - 83, January 2005. 5. R. Puri and K. Ramchandran, PRISM: A New Robust Video Coding Architecture Based on Distributed Compression Principles, 40th Allerton Conference on Communication, Control and Computing, Allerton, USA, October 2002. 6. R. Puri, A. Majumdar and K. Ramchandran, PRISM: A Video Coding Paradigm with Motion Estimation at the Decoder, IEEE

Trans. on Image Processing, vol. 16, n 10, pp. 2436 - 2448, Oct. 2007. 7. F. Pereira, C. Brites, J. Ascenso, M. Tagliasacchi, Wyner-Ziv Video Coding: a Review of the Early Architectures and Relevant Developments, Int. Conf. on Multimedia & Expo, Hannover, Germany, June 2008. 8. C. Guillemot, F. Pereira, L. Torres, T. Ebrahimi, R. Leonardi and J. Ostermann, Distributed Monoview and Multiview Video Coding, IEEE Signal Processing Magazine, vol. 24, n 5, pp. 67 - 76, Sept. 2007. 9. A. Aaron, S. Rane, E. Setton and B. Girod, Transform-domain Wyner-Ziv Codec for Video, Visual Communication and Image Processing, San Jose, California, USA, January 2004. 10. X. Artigas, J. Ascenso, M. Dalai, S. Klomp, D. Kubasov and M. Ouaret, The DISCOVER Codec: Architecture, Techniques and Evaluation, Picture Coding Symposium, Lisbon, Portugal, November 2007. 11. DISCOVER Page, http://www.img.lx.it.pt/~discover/home.html 12. J. Ascenso, C. Brites, F. Pereira, Content adaptive Wyner-Ziv video coding driven by motion activity, Int. Conf. on Image Processing, Atlanta, USA, October 2006. 13. C. Brites, F.Pereira, Correlation noise modeling for efficient pixel and transform domain Wyner-Ziv video coding, IEEE Transactions on Circuits and Systems for Video Technology, vol. 18, n 9, pp. 1177-1190, September 2008. 14. D. Kubasov, J. Nayak and C. Guillemot, Optimal Reconstruction in Wyner-Ziv Video Coding with Multiple Side Information, Multimedia Signal Processing Workshop, Chania, Crete, Greece, October 2007. 15. A. Trapanese, M. Tagliasacchi, S. Tubaro, J. Ascenso, C. Brites and F. Pereira, Intra Mode Decision Based on Spatio-Temporal Cues in Pixel Domain Wyner-Ziv Video Coding, Int. Conf. on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006. 16. Aaron, S. Rane and B. Girod, Wyner-Ziv Video Coding with HashBased Motion Compensation at the Receiver, Int. Conf. on Image Processing, Singapore, October 2004. 17. J. Ascenso and F. Pereira, Adaptive Hash-Based Side Information Exploitation for Efficient Wyner-Ziv Video Coding, Int. Conf. on Image Processing, San Antonio, TX, USA, September 2007. 18. C. Brites and F. Pereira, Encoder Rate Control for Transform Domain Wyner-Ziv Video Coding, Int. Conf. on Image Processing, San Antonio, Texas, USA, September 2007. 19. H. Wang, N. Cheung and A. Ortega, A Framework for Adaptive Scalable Video Coding Using Wyner-Ziv Techniques, EURASIP Journal on App. Signal Processing Volume 2006, Article ID 60971 20. Q. Xu and Z. Xiong, Layered Wyner-Ziv Video Coding, IEEE Trans. Image Processing, vol. 15, n 12, pp. 3791-3803, Dec. 2006. 21. M. Ouaret, F. Dufaux and T. Ebrahimi, Codec-Independent Scalable Distributed Video Coding, Int. Conf. on Image Processing, San Antonio, TX, USA, September 2007. 22. A. Sehgal, A. Jagmohan, and N. Ahuja, Wyner-Ziv Coding of Video: an Error-Resilient Compression Framework, IEEE Transactions on Multimedia, vol. 6, n 2, pp. 249258, April 2004. 23. S. Rane, P. Baccichet, and B. Girod, Modeling and Optimization of a Systematic Lossy Error Protection System based on H.264/AVC Redundant Slices, Picture Coding Symposium, Beijing, China, April 2006. 24. J. Pedro et al., Studying Error Resilience Performance for a Feedback Channel based Transform Domain Wyner-Ziv Video Codec, Picture Coding Symposium, Lisbon, Portugal, Nov. 2007. 25. R. Martins, C. Brites, J. Ascenso, and F. Pereira, Refining Side Information for Improved Transform Domain Wyner-Ziv Video Coding, IEEE Transactions on Circuits and Systems for Video Technology, to be published. 26. Y.-C. Lin, D. Varodayan and B. Girod, Image Authentication and Tampering Localization using Distributed Source Coding, Multimedia Signal Processing Workshop, Chania, Crete, Greece, October 2007. 27. S. C. Draper, A. Khisti, E. Martinian, A. Vetro and J. S. Yedidia, Using Distributed Source Coding to Secure Fingerprint Biometrics, Int. Conf. on Acoustics, Speech, and Signal Processing, Honolulu, HI, USA, April 2007.

1595

S-ar putea să vă placă și