A Vlsi Design For Full Search Block Matching Motion Estimation

A VLSI DESIGN FOR FULL SEARCH BLOCK MATCHING MOTION ESTIMATION
Seung Hyun Nam, Jong Seob Baek, Tae Young Lee, and Moon Key Lee VLSI & CAD Lab., Dept. of Electronic Eng. Yonsei University, Seoul Korea
ABSTRACT
In this paper, we describe a Jexible VLSI tor architecture to achieve a real-time processing of fill-search block matching algorithm(FBMA) Jbr video applicntions. The proposed architecture uses a parallel algorithm based on the idea of partial result accumulation. The partial sum results of the candichte block distortions are individually accumulated into cyclic storage b u B r fir each distortion measure. A parameterizable Motion Estimation Processor(MEP1 is designed fir both b) Search window a) Reference Block d i e r e n t rekrence block sizes and various search Fig.1 Block Matching Process and Data Indices ranges. Moreover, fir larger search ranges and high throughput rates, multiple number of MEPs am be m a u l e d . I t has serial data input but 11. DESIGN OF SYSTEM ARCHITECTURE perfirms parallel processing. I t can be easily and cost-eectively implemented into VLSI by its The frame to be compressed is segmented into equal simple one dimensional semi-systolic array size blocks called reference block. The local area to be searched in the previous frame is called search window. architecture and control. FBMA is a process to find the best matched candidate block among all the candidate blocks included in I. INTRODUCTION surrounding rectangular regiodsearch window) in the Motion Estimation play a key role in image data previous frame. Typical criterion to find the minimum compression such as motion-compensated codec[ll[21. distortion block which corresponds to the best matched The computationally intensive nature of FBMA and the candidate block is Mean Absolute DifferencdMAD). demand of real-time processing make a necessity of the MAD is given by VLSI implementation of FBMA. Kommrek and Pirsch[31 (1) mapped FBMA to systolic arrays and the authors of [31-[51 presented two-dimensional systolic arrays. In [51, where N is the horizontal size and M the vertical size we have removed the invalid cycles for useless output of reference block respectively, a(i,j) is the pixel data of occurred on [31[41 by the efficient data input of the the reference block and bfi+u.j+u),that of the candidate search window. In [SI, Yang et a1 developed block with displacement fu,u) which represents a one-dimensional semi-systolic architecture compatible to vertical motion vector U and a horizontal one U in the videoconference application but constrained by the search window as shown in Fig.1. To map FBMA on computational POwer of Processing Elements (PEs 1 when array architecture, lets rewrite expression (1) as it confronts the increased tracking range. Then more follows. PEs are necessary for the increased range. Also PE MAD(u,u)= 1 %?S(u,v) .1 (2) array has to be reconfigured to process different reference block sizes with control logic overhead. Thus, (3) we present one-dimensional array architecture adaptable to the dimensional changes of both search range and reference block size without additional processing elements for the increments of them. U ,v 5 p - I where, - p I 0-7803-2020-4/94 $4.00 0 IEEE
254
architecture is shown in F i g 2 It is mainly composed of pipelined four blocks : input bu&r, one d i m e n s i d army of processing elemem(PE army), parallel adder, and cyclic stomge bu&r. The main idea of the proposed architecture is to associate PE array with calculations of partial sum results PS(u,u) at each clock cycle by the data sequences and to accumulate partial sum results individuallly into cyclic storage buffer for calculating PARALLEL ADDER each distortion. To accommodate the comprehension of the proposed architecm, lets consider an example for typical video application spec, M=N=16, p=8. Note that S and S denote the sequence of the search window data ACCUMULATOR from the different positions of the tracking area, and C that of the current reference block data a s shown in Fig.1. In the first 16 cycles, Mi,j),b(i, j)llsi,$l6) which have been fed into Pj and Rj register respectively are parallelly loaded into each Qj and Sj register, and then Minimum available to each PEj. Thus a t cycle t=17, Distortion {AdJ(-8,-8)l11;k16) can be calculated at once and are Detector sent into parallel adder to calculate Ps(-8,-8) which will be latched into AQ-N.+ accumulation register added with the initially cleared value of AQI(=,,-l) register of cyclic storage buffer. Each content of AQ,, - p ~ t ~ p - l , MV(W are shifted down simultaneously. At cycle t=18(=N+2), {u(l,j),b(l,j+l) lls$16) are available to each PE,. Then the computations of tA@(-8,-7)1%is17) can be calculated at once by PE array, thus PS(-8,-7) is Fig.2 Internal Block Diagram of Motion completed by parallel adder and stacked into AQ-8 Estimation Processor(h4EP) register. Surely, the previous content Ps(-8,-8) of AQ-8 register is shifted down into AQ-7(=-p+1) register. In the {dZ, j),MZ, j)lls$l6) similar way at cycle t-.33(=N+Zl), The AD(u,u) represents the absolute difference are available to each PE,. Thus P9(-8,-8) is completed between a(i,j) and b(i+uj+u). Its range is 1-p,p-11. and accumulated into AQ-8 added with the content of The PS(u,u) represents the partial sum of the absolute AQI register which has Ps(-8,-8). Obviously, at difference at ith row. Lets consider time index t, and t=257(=N+Zp(M-l)+1), PS?-8,-8) is completed and rewrite the expression (2). then stored into AQ-s after summed with the content of AQI register which has the accumulated value of MAD( u,u) = r = l U . U , t r + 2 p t c ( i- 1 ) (5) Is( = M -I ) PS(-8,-8). In other words, the matching result ,=1 Where 2p is the maximum allowable displacement to the horizontal direction. t r is reference time, and fc is for the first candidate block, i.e, MAD(-8,-8) is one clock cycle. Expression (5) means PS(u,u) can be produced. At cycle t=258,Ps6(-8,-7) is completed, at calculated at time t=t,+Z@di-l). The basic objective of the next cycle f=259, PS16(-8,-6) and the operations the algorithm decomposition considering time index is to repeat in the same way until PSi6(-8,7) is calculated. derive the distortion measure from accumulating the Then at cycle t=272(=N+ZpM), the partial sum result partial sum results which are produced by each pixel PSi6(-8,7) is calculated. During 257(=N+Zp(M-1)+1) value of search area and reference block. The partial st<272(=N+ZpM), the candidate block distortions from sum result that contributes to the distortion measure at MAD(-B(=-p),-8(=-p)) to MAD(-8, 7(=p-1)) at U=-8 a candidate block is added immediately to the content of are sequentially calculated at each cycle, which means the cyclic storage buffer, which is designed to hold this one MALI value is fiiured out in a single cycle and distortion result, as the related pixel that produces this loaded into AQ-8 register and shifted down to AQ-7, partial result is processed. A&, AQ-5, and so on. During these clock times, According to the above description, we have developed Minimum Distortion Detector (MDD) compares each a generalized and flexible VLSI architecture with serial distortion value and detects minimum one and its input but parallel processing leading to an efficient positions (u,u), i.e, the required horizontal motion vector adaptation for various reference block sizes and search U at the vertical displacement U=-& The consecutive window formats. The schematic diagram of the proposed output value MAD(u.u) from the accumulator shown in
... ... ...
.*.
CPS(
255
Fig.:! is compared to the current minimum distortion measurements for the different search positions assigned value. If the next coming challenger from the to it. Thus t h e distortion measurements of 64 searches accumulator output is snlaller t h a n current minimum an be performed. MEPI searches the best matched value, it will replace the currently stored one with new value, and also the corresponding matching position f r o m the counter will be recorded. The data flows described above continue for u=-7(=-p+l),-6, -5, . . . , 7 ( = ~ ~ - 1 ) .
At cycle time t=4113, all possible 256 candidate blocks within the search window have been compared. The MDD now contains the best matched candidate block position (uJ). Obviously, it does not consume additive cycles to refresh the consecutive pipeline processing for the next reference block. because the refreshment can be executed while MDD is detecting the best matched candidate block.
1 1 1 . FLEXIBILITY OF BLOCK SIZES AND

SEARCH RANGES
It is worth to design the hardware with the instant flexibility in adaptation to the dimensional changes of search area and reference block sue via simple control on demand of different applications.
motion vedor
@.V)
(a) The cascading structure of MEPs
A. Extendable search range

Any vertical displacements can be easily computed and horizontal one can be calculated with setting the size of the cyclic storage buffer to feedback the output of AQo register into the accumulator by multiplexing it, where 9 is the desirable displacement size. In the case of exceeding maximum displacement Zp, which is the maximum size of the cyclic storage buffer in already designed chip, the operation for tliis larger horizontal tracking ranges can be realized by using multiple MEPs. Simple cascading of MEPs with delay elements makes it possible to operate for larger tracking area without the loss of operation speed. For instance, Computing -16/+15 displacements with MEPs able to compute -8/+7 displacements is made as the block diagram shown in Fig. 3(a). Each MEP computes minimum distortion within some specified displacenients assigned to it. By dividing total search window equally into three sub-search window regions SI, Sz, S3 (See Fig. 3(c)), the caculations can be executed in the following way. During the previous N(=16) cycles, the band data a, in ith row are inputted and available to PE's in both MEPI and MEP3, and the data a-1 delayed N cycles available to both MEPz and AZEP4. Simultaneously the band 4 are available to PE's of both MEPI and MEP2 and the band b ' to those of both MEP3 and h4EP4. For the next N cycles, the band 4' latched into BUFz is serially inputted and available to PE's of both MEP1 and MEPz, and the band b," latched into BlIF3 is serially inputted and available to PE's in both MEP3 and MEP4. Each MEP, calculates distortion
(b) The band data a, of reference block
(c) The band data b,, b;. b; Of sub-search Window SI,5 2 . 5 s
Fig.3 The cascade of MEPs for double tracking range, -16/+15, MXN=16X16 candidate block toward the horizontal direction from -16 to -1, and MEP3 from 0 to +15 but both at ''even" vertical displacement, MEP2 from -16 to -1 and MEP4 from 0 to +15 but both at "odd" vertical one. A f t e r calculating distortion measurements, MDD compares each minimum distortion measurement from four MEP chips and detects minimum one and its position vector, i.e, motion vector (%U) in the more extended search area.
B. Flexible Rekrence Block Size

Larger vertical block size can be performed by just delivering the new reference block data of the exeeded region, but in the case of larger horizontal block size one has to organize the reference block size in several slices which will be processed one after another. When the changed horizontal block size is a multiple times of
256
Table I. SPECIFICATION OF A VIDEO CODEC IMAGE DATA
I 352
X
288 pixels
FRAME RATE
15 fmesedsec
MOTION COMPENSA BLOCK SIZE
16 x 1 6 pixels
TRACKING RANGE
-8
- +7 pixels
architecture. Therefore, for a typical codec for video conference applications given in Table I, the total processing time of a frame will be 2ON~[4Pt,P,MlX 2?8~XSZ~+N+log~+1=31m which s is about 32 f m e s h e c . This is surpassing the required frame rate shown in Table I. In other words, the real-time processing can be obtained.
N, the reference block is equally divided into MXN size slices. By calculating slice to slice, the whole reference block can be processed and in this time it does not consume additive cycles to fill up the pipeline operations. For example, in case of switching the reference block size 16x16 into 32x32 for the different system application, the reference block is sliced into two 32x16 blocks, then the first 32x16 block is calculated, and then the partial distortion results at each search position are stored into cyclic storage buffer by the operation in section 11, and finally the second 32x16 block is calculated and its partial sum results are accumulated individually to the partial distortion results of the fust block by shifting around the contents of the cyclic storage buffer at every cycle. Then all of the calculations of h4AD(u,v), {fu,v) I u = u , - ~ ~ v s jcan ~ l }be calculated for 32x32 reference block. But in this case, 4 times of computations are required compared with that of 16x16 block, but the required number of blocks to be processed in a frame for 32x32 size is 1/4 of the block size 16x16. Thus the computational load for each frame is the same for different block size without the loss of intemal dynamic range. The important point is to provide continuous data flows for various block sizes so that all PE are 100% utilized.
V. CONCLUSION
Based on a parallel algorithm of partial sum result accumulation, we presented a flexible motion estimation procssor VLSI architecture for the Full-Search Block Matching Algorithm. The partial sum results of the candidate block distortions are calculated at every cycle by PE array and accumulated individually by the cyclic storage buffer for each distortion. It allows serial data input but performs parallel processing, and flexibilities to adapt to different reference block sizes and search ranges. Futhermore, multiple MEPs can be cascaded for larger search ranges and fast pixel rates respectively. For a typical codec for video conference applications, it can perform marginal performance with . % MHz clock cycle. The simple structure and the low control overhead of the proposed architecture based on one-dimensional semi-systolic array enable the total system to be easily and cost-effectively implemented into VLSI.
REFERENCES
111 J. R. Jain and A. K.. Jain, "Displacement Measurement and its application in interframe image coding," IEEE Tmns. Commiul, Vol. COM-29, pp.1799-1808, Dec. 1981. 121 M. Gilge, "A high quality videophone coder using hierachical motion estimation and stnicture coding of prediction error," in proc. SPIE, pp.864-874, 1988. [31 T. Komarek and P. Pirsch, "Array architectures for block matching algorithms," IEEE Trans. on Circuits and Syst., Vol. 36, No. 10, pp. 1301- 1308, Oct. 1989. [41 Chaur-Heh Hsieh and Ting Pang Lm, "VLSI architecture for Block Matching Motion Estimation Algorithm," IEEE T m . on Circuits and Syst. fir Video Tech, Vo1.2, No. 2, pp.169-175, June 1992. 151 J. S. Baek, S. H. Nam, and M. K. Lee, "A Fast Array Architecture for Block Matching Algorithm," in proc. IEEE International Cokrence on Circuits and System, London, May 1994. [6] K. M. Yang, M. T. Sun, and L. Wu, "A family of VLSI Design for the motion compensation block matching algorithm," IEEE Trans. on Circuits and Syst, Vol. 36, No. 10, pp. 1317-1325, Oct. 1989.
Iv.
PERFORMANCE ANALYSIS
For a reference block size MXN and Ph, P, maxinium allowable displacement in the horizontal and the vertical direction respectively, the processing cycles of the proposed architecture are a s follows : N cycles for preloading the fust row data of the first reference block in a frame plus [4P#Jfl cycles for array processors plus logzN cycles for the parallel adder plus one cycle for Minimum Distortion Detector(h.IDD). Obviously, one of the critical operation time delays occurs on parallel adder. For a reference block of 16x16, the parallel adder is constructed by an adder tree of 4 (=log~l6) levels. Consequently, the overall delay is the sum of a 12 bit wide adder and a latch delay for the case of 8bit grey level inputs. However, the most critical one occurs on the 16 bit wide adder(accumu1ator in cyclic storage bu&r) to accumulate the partial sum results for the distortion value. Based on 0.8 um CMOS technology. 50 MH, clock cycle is prnlissible for the proposed
257

A Vlsi Design For Full Search Block Matching Motion Estimation

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Vlsi Design For Full Search Block Matching Motion Estimation

Încărcat de

Drepturi de autor:

Formate disponibile

A VLSI DESIGN FOR FULL SEARCH BLOCK MATCHING MOTION ESTIMATION

... ... ...

1 1 1 . FLEXIBILITY OF BLOCK SIZES AND

(a) The cascading structure of MEPs

A. Extendable search range

(b) The band data a, of reference block

(c) The band data b,, b;. b; Of sub-search Window SI,5 2 . 5 s

B. Flexible Rekrence Block Size

Table I. SPECIFICATION OF A VIDEO CODEC IMAGE DATA

MOTION COMPENSA BLOCK SIZE

S-ar putea să vă placă și