Sunteți pe pagina 1din 4

FAST FULL-SEARCH BLOCK MATCHING ALGORITHM MOTION ESTIMATION ALTERNATIVES IN FPGA Joaqu n Olivares, Ignacio Benavides Department of Electrics

and Electronics University of C ordoba Spain email: olivares@uco.es


ABSTRACT Block matching motion estimation takes a great part of the processing time for video encoding. To accelerate this process is must to reach real time video coding. The best motion vector is obtained by full-search block matching algorithm which has to be usually implemented by hardware. In recent years, several FPGA based designs have been proposed since these devices support high number of process elements in parallel mode. In this paper a survey of recent architectures to perform the full-search block matching algorithm in FPGAs is presented. A further comparison on terms of frames per second reached, hardware cost in CLB slices and system frequency is presented. 1. INTRODUCTION The full-search block matching algorithm (FSBMA) is usually used in the hardware implementation of motion estimation (ME), because of its simplicity, regularity, and optimum result. The most commonly used metric to determine the best match for FSBMA in hardware is the Sum of Absolute differences (SAD). Computing the minimum SAD from among all the candidate blocks. To do this, a search iteration is performed for each candidate block. The SAD adds up the absolute differences between corresponding elements in the candidate and reference block,
N N

Javier Hormigo, Julio Villalba, Emilio Zapata Department of Computer Architecture University of M alaga Spain

The high performance of current FPGA technology permits to implement new designs to solve the ME problem, this work presents a survey about recent ME designs. 2. RECENT ARCHITECTURES IN FPGA In [1] two ways to parallel the ME process are exposed: To process the same number of pels as the macroblock has, called type-1; and to process the same number of ubications available for the motion vector (MV), called type-2. Most authors implement differents alternatives of the type-1. To implement a variant of the type-1 architecture two main philosophies are used: To shift the pels through an systolic array of process elements (PE), or to parallel the PEs processing without shift the pels. In this work the efciency and the structure of several recent works to parallel the ME will be analysed. 1. Type-1 architectures shifting the pels: A. Ryszko et al. [8] Implements the four architectures presented in [3]: AB 1, AS 1, AB 2 and AS 2. AB1: in this, the data already fetched from memory can be saved in delay lines for later reuse, this limits the local high memory bandwith requirements of this architecture. N data is fetched on every clock cycle from both: area searh and reference block. Where N is the block side. AB2: based on the preloading of the intermediate sum registers of absolute differences (AD) elements. The input data ow allows sequential computation of consecutive search area lines. AB2 architecture is based on AB1, replicating PEs structure to reuse data, and a high bit rate, N data from search area, improves related hardware cost than AS1 but increasing the frame rate. High complex and bandwith are the disadvantages of this architecture. AS1: this systolic array requires only a sequential data input. This architecture requires the same number of

SAD =
i=1 j =1

|ci,j ri,j |,

(1)

where ri,j are the elements of the reference block, and ci,j the elements of the candidate block. Field programmable logic devices supports a high number of processor elements (PE) in parallel mode. This property can be used to process, at the same time, all SAD operations from a MPEG macroblock in a search area. With this real time (RT) video encoder for ME can be reached.

1-4244-0 312-X/06/$20.00 c 2006 IEEE.

clock cycles as AB1 to obtain a MV. Replicating AB1 structure, to reduce the high bit rate to 2 is possible, data is shifter on internal registers to reuse it. Multiply by 10 the hardware is necessary. AS2: is the 2D extension of the AB1 architecture, where N N PEs are used. This architecture requires a high increasing of hardware. M. Mohammadzadeh et al. [5] Also implements the AS1 architecture. An 8 8 solution is proposed and how to extend to 16 16 block is not explained. 2. Type-1 architectures processing pels in parallel mode: S. Wong et al. [9] Implements a 16 1 SAD unit, called SAD16, which is equivalent to a macroblock row for MPEG. This design is inspired on the addertree model presented in [2]. The authors state briey how to extend the design to compute a 16 16 SAD reusing the original SAD16 unit to compute the remaining rows. N. Roma et al. [7] Presents an innovative processing scheme based on a cylindrical structure and on the zig-zag procesing sequence proposed in [1]. This cylinder is based on active and passive PEs. Active PEs processing while new data are being readed in the passive pels. So, regular data ow and a simple reuse of overlapped pels is reached, also the hardware cost increases highly. N active PEs process a new block line on every clock cycle, an adder tree computes the SAD value and a nal comparator evaluates if its appropiate to be the new MV. J. Olivares et al. [6] This work presents a novel design based on online arithmetic (OLA). OLA works in bit-serie mode, this is, each bit is processed in successive clock cycles, operating with the most signicant digit (MSD) at rst, this model facilitates the absolute difference and the comparison operations, also, to process in bit-serial mode simplify to process N 2 pels at ence. In this architecture the absolutte difference improves also the arithmetical representation conversion with no computational cost, an online adder tree allows to process one macroblock at ence, and a nal online comparator allowing the early stop if the current SAD must be reached. Since this architecture improves a pipeline to process to bit level, a singular value of clock cycles to obtain a MV is calculated by the expression showed inTable 2. 3. Type-2 architectures: H. Loukil et al. [4] Propose to process the SAD of 17 1 ubications at time. And N 2 clock cycles are required to obtain each one.

2.1. The processor elements structure Since PE is one of the main indicators to describe each architecture, to explain the structure of everyone is must to understand the element complexity: Wong PE: is based on one carry generator, one inversor, and two exor. It function is to determine the smallest of the two pels, to invert the smallest operand and pass both operands to an adder tree. Roma, Ryszko and Mohammadzadeh PE (called AD): This unit calculates the absolute difference of two pels and adds the result to a previously calculated partial sum of absolute differences. The partial sum of absolute differences is given from one processor element to the next processor element and nally the complete SAD is calculated. Loukil PE (called SAD unit): This calculates the sum of all absolute differences until an external signal resets the accumulator of the SAD summation. Olivares PE: This computes the absolute difference operating to bit level, and is performed multiplexing each bit from both pels. To estimate the number of clock cycles required for each architecture two parameters, N and p, are used, where N is the block side and p is the maximum displacement allowed for the block into the search area, typical MPEG values are N = 16, p = 8 and N = 8, p = 4. 3. COMPARISON In section 2 several motion estimation architectures are presented. In Table 1 a comparison on terms of frames per second reached, hardware cost in CLB slices and system frequency are showed. Also the family of the device used for every implementation is detailed. Because this, a qualitative comparison is necessary, and some appreciations must be exposed. In Table 2 structural parameters to describe and to compare the differents architectures are showed. These are: The number of clock cycles required to obtain the best MV. The number of 8bit inputs is showed as I ports. The eld AD PEs indicates the number of PEs for AD operations. The number of Adders. The number of comparators, Comp. is showed in the last column. Notice that all authors uses 8bit adders excepting [6], that uses 1-bit adders. Also, other difference between all authors and [6] is in the PE architecture. Most author uses a PE based on carry-save adders, but in [6] only three 2-bit registers are used to implement the AD. To explicity this difference is necessary because, the number of components can be used to compare several architectures, but in [6] all OLA components have less hardware cost, than conventional 8bit based arithmetics. The hardware cost (for a implementation in ISE 6.2i platform and Spartan3 technology), of an 8bit

1,6

1,4

1,2

0,8

0,6

0,4

0,2

Wong

Roma

Loukil Olivares

a)
6

tages appears: due to digit serial nature it reduces the number of signal lines connecting modules, the MSD rst computation allows subsequent calculations to occur at much earlier stage, and it eliminates carry propagation chains since it uses a redundant number representation system. Non conventional arithmetics, SD OLA in this case, allows work with non conventional architectures, in the design proposed in [6], this feature is used to operate at bit level. Operate at bit level maybe has no sense for common processors with registers of 8, 16, 32 or more bits, but can be powerful in eld programmable logic devices where nets and registers are congurable. Notice that the low frame rate reached in [8] is using obsolete FPGA technology. To reach a high frame rate is expected for a recent FPGA technology. In fact, AS1 architecture is implemented by [5] reaching RT for 4CIF . Architectures, [7] [6] [5], offers RT compression for ME in 4CIF . [7] obtains about 60% more frame rate than [6] but about 1280% hardware increasing is required. The architecture presented in [7] reachs a high frame rate reducing drastically the number of clock cycles. So, the clock cycle to obtain a MV is about the 13% of [6], the 8% of [9] or the 25% of [4]. But this reduction also involves a high increase of hw components, where about the 40% of the PEs are passive. This increment of PEs involves a more complex connection and a frequency decreasing. 2D architectures (those that have N 2 PEs) are faster and more complex than 1D (with N PEs), a non linear progression appears between the number of PEs and the clock cycle rate. This is because 2D models requires a high number of input ports, and a complex memory interface is required, this must to be computed in a global system to compare strictly with 1D models. Notice the memory interface can contribute over 5070% of the global system for 2D models. In this way, 2D models cannot be compared appropriately with 1D models without the memory interface.

Ryszko Ryszko AS1 AB1

Ryszko Ryszko Mohamma- Olivares AB2 AS2 dzadeh

b)

Fig. 1. Relation between frame rate and hardware cost : (f ramerate/hwcost)x100 for 16 16 models in a) and 8 8 models in b).

adder is 8 LUTs or 4 slices, and the AD operation is performed with 30 LUTs or 17 slices; however OLA AD PE requires only 3 LUTs or 2 slices. This point to an OLA AD PE is 10 times lower than a conventional AD PE. In Figure 1 the relation between the frame rate and the cost in CLB slices, normalized multiply by 100, is showed graphically. Greater values can be interpreted as most efcient architectures. However this value is highly inuenced by the technology, this parameter is not enough to evaluate the architectures in an absolute mode. More information is necessary, and a relation of structural parameters is showed in Table 2. In spite of this, [6] and [9] represents most competitive architectures for 16 16, but [9] doesnt reach RT, and to increase the hardware using 16 SAD16 units multiply by 30 the hardware amount. For 8 8 model [5] and [6] are most efcient. Since a 16 16 block size search area is four times bigger than 8 8, to present a 16 16 architecture for [5] can be interesting to know how the hardware cost increases. In [6] the hardware increases linearly, and its multiply by 4 when search area is multiply by 4. To use a different arithmetic representation can permits other features, in this way, using OLA [6] following advan-

4. CONCLUSION Recent FPGA implementations of FSBMA for ME video coding are analysed. Thanks to the high degree of parallelism that FPGA devices permits, FSBMA is accelerated to reach RT processing. Parallel computation of the pels corresponding to one candidate block, is preferred for most of the authors. The performance of the presented architectures and hw cost are compared based on the number of frames processed per second and the CLB slices respectively. The results show that FPGA is suitable for RT for 4CIF video sequences. Also the frame rate obtained for HDT V 720p towards to be a reference point in future works.

Design [9] [7] [8] AB1 AS1 AB2 AS2 [4] [5] [6]

Block Size 16 16 16 16 88 88 88 88 16 16 88 16 16 88

Table 1. FPGA architectures performance comparison. 4CIF (fps) HDTV 720p (fps) CLB slices Freq. (MHz) 10.25 47.61 1.25 1.20 12.50 18.25 11.70 31.25 30.95 32.41 4.51 20.98 0.55 0.53 5.5 8.03 5.15 13.75 13.62 14.26 955 29430 184 1214 948 3732 1654 300 2296 657 197.0 76.1 25.0 24.0 30.0 22.0 103.8 191.0 366.8 401.9

Device Altera Flex20KE Xilinx XCV3200E Xilinx XC40250

Altera Stratix Xilinx VirtexII Xilinx Spartan3

Design [9] [7] [8] AB1 AS1 AB2 AS2 [4] [5] [6]

I ports 2N 3 2N 2 N 2N + 2p 3N 2 2

Table 2. FPGA architectures functional comparison. AD PEs Adders Comp. N N active 2xN (2p 1) pas. N 2p + 1 N2 (2p + 1)xN 2p + 1 2p + 1 N2
2

Clock cycles

243 8bit N 1 8bit 1 8bit 2p + 1 8bit N 8bit 2p + 1 8bit 2p + 1 8bit 2p + 1 8bit 2x(N 2 1) 1bit

1 1 1 2p + 2 1 2p + 2 N +1 2p + 2 1

(27 + N 1)x(2p + 1)2 2x(2p)2 + N x(2p + N 1) N x(2p + 1)x(2p + N ) N x(2p + 1)x(2p + N ) (2p + 1)x(2p + N ) N x(2p + N ) N 2 (2p + 1) + (2p + 1) + 1 N x(2p + 1)x(2p + N ) (2 log2 (N 2 ) + 9)x(2p + 1)2 + N 2

5. REFERENCES [1] L. De Vos, M. Stegherr, Paramaterizable VLSI architectures for the full-search block-matching algorithm, IEEE Trans. Circuits Syst., vol. 36, no. 10, 1989. [2] Y.-S. Jehng, L.-G. Chen, T.-D. Chiueh, An efcient and simple VLSI tree architecture for motion estimation algorithms, IEEE Trans. Signal Processing, vol. 41, no. 2, pp. 889900, Feb. 1993. [3] T. Komarek, P. Pirsch, Array architectures for block matching algorithms, IEEE Trans. Circuits Syst., vol. 36, no. 10, pp. 13011308, Oct. 1989. [4] H. Loukil, F. Ghozzi, A. Samet, et al. Hardware implementation of block matching algorithm with FPGA technology, 6th International Conference On Microelectronics, Proceedings, pp. 542546, 2004. [5] M. Mohammadzadeh, M. Eshghi, M. M. Azadfar, Parameterizable implementation of full search block matching algorithm using FPGA for real-time applications, ICCDCS 2004: Fifth International Caracas Conference on Devices, Circuits and Systems, pp. 200203, 2004.

[6] J. Olivares, J. Hormigo, J. Villalba, I. Benavides, E. L. Zapata, SAD computation based on online arithmetic for motion estimation, Microprocessors and Microsystems, Accepted with ref IJB/2005/4, 2005. [7] N. Roma, T. Dias, L. Sousa, Customisable core-based architectures for real-time motion estimation on FPGAs, LNCS 2778, pp. 745754, 2003. [8] A. Ryszko, K. Wiatr, An assesment of FPGA suitability for implementation of real-time motion estimation, EUROMICRO Symposium On Digital Systems Design, Proceedings, pp. 364-367, 2001. [9] S. Wong, S. Vassiliadis, S. Cotofana, A Sum of Absolute Differences Implementation in FPGA Hardware, 28th Euromicro Conference (EUROMICRO02), pp. 183188, Dortmund, Germany, 2002.

S-ar putea să vă placă și