Sunteți pe pagina 1din 7

The On2 VP6 codec: how it works

Paul Wilkins, On2 Technologies 10/29/2008 12:00 PM EDT

The On2 VP6 codec This article assumes a basic understanding of video compression algorithms. For an intro to video coders, see How video compression works. This article is an introduction to On2's VP6 codec, one of the most widely used codecs on the Internet by virtue of its deployment in Adobe Flash Player. Over nine hundred User Generated Content (UGC) sites (including eight of comScore's Top Ten) use On2 software to encode hundreds of thousands of new Flash videos every day. The purpose of a video compressor is to take raw video and compress it into a more manageable form for transmission or storage. A matching decompressor is then used to convert the video back into a form that can be viewed. Most modern codecs, including VP6 , are "lossy" algorithms, meaning that the decoded video does not exactly match the raw source. Some information is selectively sacrificed in order to achieve much higher compression ratios. The art of the codec designer is to minimize this loss, whilst maximizing the compression. At first glance, VP6 has a lot in common with other leading codecs. It uses motion compensation to exploit temporal redundancy, a DCT transform to exploit spatial redundancy, a loop filter to deal with block transform artifacts, and entropy encoding to exploit statistical correlation. However, the "devil is in the details," so to speak, and in this paper I will discuss a few of the features that set VP6 apart. When is a loop filter not a loop filter? One of the problems with algorithms that use frequency based block transforms is that the reconstructed video sometimes contains visually disturbing discontinuities along block boundaries. These "blocking artifacts" can be suppressed by means of post processing filters. However, this approach does not address the fact that these artifacts reduce the value of the current decompressed frame as a predictor for subsequent frames. An alternative or complementary approach is to apply a filter within the reconstruction loop of both the encoder and decoder. Such "loop filters" smooth block discontinuities in the reconstructed frame buffers that will be used to predict subsequent frames. In most cases this technique works well, but in some situations it can cause problems. Firstly, loop filtering a whole frame consumes a lot of CPU cycles. Secondly, when there is no significant motion in a region of the image, repeated application of a filter over several frames can lead to problems such as blurring. VP6 takes an unusual approach to loop filtering. In fact, some would say that it is not a loop filter at all but rather a prediction filter. Instead of filtering the whole reconstructed frame, VP6 waits until a motion vector is coded that crosses a block boundary. At this point in time it copies the relevant block of image data and filters any block edges that pass through it, to create a filtered prediction block (see Figure 1 below).

(Click to enlarge) Figure 1. VP6 prediction loop filter. Because the reconstruction buffer itself is never filtered, there is no danger of cumulative artifacts such as blurring. Also, because the filter is only applied where there is significant motion, this approach reduces computational complexity for most frames. When we first implemented this approach in VP6, we saw an improvement of up to 0.25 db above a traditional loop filter on some clips.

Golden frames VP6 golden frames In addition to the previous frame, some codecs retain additional frames that can be used as predictors. VP6 and other codecs in the VPx range support a special kind of second reference frame which we call a Golden Frame. This frame can be from the arbitrarily distant past (or at least as far back as the previous Golden Frame) and is usually encoded at a higher than average quality. Background / foreground segmentation One use for Golden Frames is segmentation of the foreground and background in a video. For example, in most video conferencing applications the background is static. As the speaker moves around, parts of the background are temporarily obscured and then uncovered again. By creating and maintaining a high quality image of the background

in the Golden Frame buffer, it is possible to cheaply re-instate these regions as they are uncovered. This allows the quality of the background to be maintained even when there is rapid movement in the foreground. Furthermore, the cost savings can be used to improve the overall encoding quality. The VP6 encoder also uses the Golden Frame to improve quality in certain types of scenes. In slow moving pans or zooms, for example, a periodic high-quality golden frame can improve image quality by restoring detail lost because of repeated application of a loop filter or sub-pixel motion filters. This high quality frame remains available as an alternate reference buffer until explicitly updated. As long as the speed of motion is not too fast, this frame can help stabilize the image and improve quality for a significant number of frames after the update. The VP6 encoder monitors various factors to determine the optimum frequency and quality boost for golden frame updates. These factors include the speed of motion, how well each frame predicts the next and how frequently the golden frame is selected as the best choice reference for encoding macroblocks. The results of this process can be quite dramatic for some clips, as shown below.

(Click to enlarge) Figure 2. Quality improvement with (left) vs. without (right) golden frames. Context predictive entropy encoding Some other advanced video codecs use an entropy coding technique known as "Context Adaptive Binary Arithmetic Coding" (CABAC). This technique, while quite efficient from a compression point of view, is expensive in terms of

CPU cycles because the context needs to be recalculated each time a token is decoded. VP6 employs a proprietary "Context Predictive Binary Arithmetic Coding" technique that relies upon sophisticated adaptive modeling at the frame level. This technique assumes that information from spatially-correlated blocks is relevant when considering the likelihood of a particular outcome for the current block. For example, when considering the probability that a particular DCT coefficient is non zero, information about the equivalent coefficient in neighboring blocks may be important. An important point here is that the encoder performs heuristic modeling at the frame level and passes relevant context information through to the decoder in the bitstream. This means that it is not necessary to compute contexts in the decoder on a token by token basis. Bitstream partitions VP6's bitstream is partitioned to provide flexibility in building a fast decoder. All of the prediction modes and motion vectors are stored in one data partition, and the residual error information is stored in another. The jobs of creating a predictor frame and decoding the residual error signal can thus be easily separated and run on different cores with minimal overhead. Alternatively a VP6 decoder can decode and reconstruct macroblocks one at a time, by pulling the mode and motion vector information from one substream, and the residual error signal for that macroblock from the other. Any compromise between these two extremes is possible, allowing maximum flexibility when trying to optimize performance and minimize data and instruction cache misses.

Dual mode arithmetic, sub-pixel motion estimation, and HD Dual mode arithmetic and VLC encoding In addition to its proprietary "Context Predictive Binary Arithmetic Coding" algorithm, VP6 also supports "Variable Length Coding" (VLC). As with the arithmetic coder, the VLC coder makes use of predictive contexts to improve compression efficiency. The efficiency of the VLC method compared to the arithmetic coding method depends substantially on the data rate. At very high data rates, where most of the DCT coefficients in the residual error signal are non-zero, the difference between the VLC coder and the arithmetic coder is small ( 2%). However, at low data rates, the arithmetic coder may deliver a very substantial improvement in compression efficiency (>20%). Because of the way the bitstream is partitioned between the prediction modes and motion vectors on the one hand and the residual error signal on the other, VP6 can support mixed VLC and arithmetic coding. Here one partition is encoded using arithmetic coding (typically the modes and motion vectors) while the other uses the VLC method. This allows the encoder to trade off decode complexity and quality in a very efficient way. Below we show how we used this approach in the recently announced VP6-S profile in Flash. Adaptive sub-pixel motion estimation One very unusual feature of VP6 is the way that it uses multiple 2- and 4-tap filters when creating the prediction block for sub-pixel motion vectors (for example and pixel vectors). Traditionally, codecs typically use a single filter for all blocks. In contrast, VP6 supports 16 different 4-tap filters, all with different characteristics, as well as a 2-tap bilinear filter. The encoder can either select a particular filter at the frame level, or signal that the choice should be made at the 8x8 block level according to a heuristic algorithm implemented in both the encoder and decoder. This algorithm examines the characteristics of the reference frame at the selected location and attempts to choose an optimal filter for each block, one that will neither over-blur or over-sharpen. The bitstream even allows the parameters

of the filter selection algorithm to be tweaked, so a user can specify a preference for sharper video or less noisy and blocky video at encode time. This feature is provided in recognition of the fact that attitudes to and acceptance of different types of compression artifact varies considerably from person to person and between different cultures. VP6-E and VP6-S encoder profiles Adobe recently announced support for a new VP6 profile in Flash called VP6-S. (The new support is on the encoding side. On the decoding side, both VP6-S and the original profile (VP6-E) have been fully supported since the launch of VP6 video in Flash 8, so there are no problems of backwards compatibility.) The principal difference between the two profiles comes down to decisions made by the encoder in regard to sub-pixel motion estimation, loop filtering and entropy encoding. As mentioned previously, VP6 allows for considerable flexibility in all of these areas. VP6-S targets HD content, which is characterized by high data rates. At these rates, the difference from a compression efficiency standpoint between VP6's "Context Predictive Binary Arithmetic Coding" and its "Context Predictive VLC" coder is less pronounced. However, at high data rates the number of CPU cycles used in the entropy decoding stage rises substantially. To address this problem VP6-S selectively uses the VLC method for the residual error partition (DCT coefficients) if the size of that partition rises above a pre-determined level. This compromise is made possible by VP6's use of two bitstream partitions as described above. In addition, VP6-S is restricted to using bilinear sub-pixel filters, whereas VP6-E automatically chooses an optimal 4tap or 2-tap filter for each macroblock. This significantly reduces decode complexity for VP6-S. Although bilinear filtering can cause some loss of sharpness and detail, this is much less pronounced for HD video. The loss of quality is more pronounced for smaller image formats, making VP6-E the better choice in such cases. A final important difference is that the loop filter is disabled in VP6-S, giving rise to a further reduction in decode complexity. As with the use of bilinear filtering, the detrimental effect of this from a quality standpoint is much less pronounced for HD video. However, this difference makes VP6-S much less suitable for smaller image formats such as QCIF and QVGA, where the lack of loop filtering may result in a very noticeable drop in perceived quality. The tradeoffs described above make possible the smooth playback of HD video encoded using the VP6-S profile on much less powerful legacy computers, without too big a hit on quality. However, the original VP6-E profile should be used for smaller image formats and at low data rates, where it will deliver noticeably better quality. Device ports and hardware implementations In addition to implementations for Windows, Mac and Unix based PCs, VP6 has been ported to a wide variety of devices, chipsets and processors from leading companies including: ARM, TI (OMAP & DaVinci), Philips, Freescale, Marvell, C2, Videantis, Sony, Yamaha and Archos. Furthermore, On2 is currently working on a highly optimized hardware implementation of VP6, which is due to start shipping later this year. This implementation will be used in SoCs for mobile handsets and other low-power applications. This implementation will enable HD playback of VP6 video on mobile phones! Conclusion Thanks to the techniques described here, and others not discussed, VP6 has seen exceptional marketplace adoption. Outstanding quality, combined with low decode complexity and flexible licensing terms, have combined to make it the codec of choice for delivery of video content over the internet. VP6 has many other advantages not covered here and is set to remain a key technology in Internet and mobile video for some time to come.