High-Throughput Block-Matching VLSI Architecture With Low Memory Bandwidth

508
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 4, APRIL 1998
[7] A. P. Chandrakasan et al., Optimizing power using transformations,

IEEE Trans. Computer-Aided Design, vol. 14, pp. 1231, Jan. 1995.
[8] K. K. Parhi and T. Nishitani, VLSI architectures for discrete wavelet
transforms, IEEE Trans. VLSI Syst., vol. 1, pp. 191202, June 1993.
[9] G. K. Wallace, The JPEG still picture compression standard, Commun.
ACM, vol. 34, pp. 3044, Apr. 1991.
High-Throughput Block-Matching VLSI

Architecture with Low Memory Bandwidth
Seung Hyun Nam and Moon Key Lee
Abstract A full-search block-matching architecture which features
high throughput, low data input lines , and low memory bandwidth is
proposed. It reduces memory I/O requirements by the maximum reuse
of search data using on-chip memory. It also promises a high throughput
rate by the continuous calculation of all block distortions in a search area
using two search data input flows without processing any invalid block
distortion, and by the continuous process of the neighbored reference
blocks removing the initialization period between blocks. The processor
for 16/ 15 search ranges, implemented in the total 220k gates using 0.6
m triple-metal CMOS technology, can operate at a 66 MHz clock rate,
and therefore is capable of encoding H.263(4CIF), MPEG2(MP@ML),
and other multimedia applications.
0 +
I. INTRODUCTION
The block-matching algorithm is used to compress the video image
data in the fields of motion-compensated predictive coding [1]. The
array mapping methods and several architectures for the full-search
architectures have been presented in [2], [3]. In addition, several
dedicated hardware implementations have been realized [4], [5].
Depending on different search strategies, several different blockmatching algorithms, such as hierarchical search [9], hierarchical
telescopic search [10], and multisurvivor search [11], have been
proposed and implemented.
Among all block-matching algorithms, the full-search blockmatching algorithm (FBMA) is the most precise to detect the best
matching block, and also demands the most computation. Due to
its regular formulation, the FBMA can be realized with a pipelined
systolic array to significantly boost the performance. Indeed, a number
of researchers have presented various systolic architectures for FBMA
[2], [3], [6][8] of which many require a rather large number of I/O
pins, high memory bandwidth, and a few inefficient clock cycles
causing the performance degradation. In this paper, we propose a
high-throughput pipelined systolic array for FBMA which requires
low memory bandwidth.
II. ARCHITECTURE DESIGN
A. Full-Search Block-Matching Architecture
The block diagram for the proposed architecture is schematically
shown in Fig. 1(a). It is basically comprised of a processor array,
parallel adder, shift register array (SRA), minimum distortion detector, data selector, and local memory. In each processing element
Manuscript received July 12, 1996; revised February 20, 1997. This paper
was recommended by Associate Editor K. K. Parhi.
The authors are with the Multimedia LSI Team, Semiconductor Division,
Daewoo Electronic Company, Seoul 100-095, Korea.
Publisher Item Identifier S 1057-7130(98)02137-5.
(PE), the absolute pixel difference between the reference block and
the search window can be calculated. In order to make continuous block distortion calculations, it uses the shift register array.
The data selector propagates the control signals to multiplex the
proper search data from two inputs to each PE. During the blockmatching operations, most pixels are used several times to evaluate
the distortions between neighbor reference blocks. In order to reduce
external memory I/O requirements, the data reuse has been adopted
by using local memory on chip, resulting in high data utilization
by the reduced data access from the external frame memory. This
processor is aimed to make serial input for low memory bandwidth
and process in parallel for high throughput rates. The main procedure
is as follows. The pixel differences from the processing elements
are accumulated by the row to form the partial sums of one block
distortion. The partial sums are further accumulated to calculate a
mean absolute difference MAD(h; v ) of the block at the horizontal
and the vertical displacement position (h; v ) by the parallel adder.
Of all the MAD(h; v ), we can choose the best matching block by the
minimum distortion detector.
B. Partition of the Search Window Data
As shown in Fig. 2(a), the search window is divided into three
parts to produce the successive valid operations without making any
dead cycle. The first, upper region (UR) is filled with N rows from
the top row of the search window. The second, lower region (LR)
is filled with N rows from the bottom row of the search window,
and finally, the middle region (MR) is defined by the overlap area
between the above two regions. There are two data input sequences
through USW and LSW input lines. The data from UR and MR are
put into the processor array through USW by line scan order: 1, 2, 3,
1 1 1 , 2ph + N vertical lines along the arrows shown in Fig. 2(b). In
the same way, the search data of MR and LR are serially loaded into
LSW as shown in Fig. 2(c). This input sequence enables the block
distortion to be carried out column by column from left to right;
each column is in turn scanned from top to bottom row per cycle.
Note that both lines get 2pv pixels in one column. The proposed
architecture requires the UR and MR from USW for MAD(h; v )s
of 0pv v 0pv + N 0 1 and the MR and LR from LSW for
those of 0pv + N v pv 0 1: We repeatedly use the same
MR data into both USW and LSW input lines. However, we can use
just one of them. Then the other unused data can be replaced with
dummy data just for regular data input. Referring to Fig. 2(d), let us
explain the main idea for calculating the continuous block distortions
for a block. First, the 2pv pixels of UR1 and MR1 come into the
processor array via the USW input line as the arrow a1 , and the
another 2pv pixels of MR1 and LR1 via LSW get into the processor
array through N cycles delay as the arrow a2 : The processor array
calculates block distortions using N pixels of UR1 from USW and
2pv pixels of MR1 and LR1 from LSW. During the 2pv period for
getting MR1 and LR1 via LSW, the next 2pv pixels of UR2 and MR2
are delivered into the processor array via USW. This regular input
flow makes it possible to calculate continuous block distortions for
one block. In summary, the 2pv + N pixels of the following column
for the next horizontal search are delivered to the processor array
via two input lines for calculating the block distortions at the next
horizontal displacement during the time the processor computes the
block distortions in the current horizontal displacement. Therefore,
the continuous process for the next horizontal displacement can be
executed immediately without any break. To extend the displacement
10577130/98$10.00 1998 IEEE
(a)
509
(b)
(c)
Fig. 1. Block diagram of the proposed architecture: (a) FBMA architecture, (b) PE structure, and (c) structure of SR in (a) and
distance pv ; we can use more shift registers attached to the processor

array. We can extend the displacement distance 2pv to a larger value
by using more shift registers. In the case that 2pv equals N; it does
not need the SRA. On the other hand, if 2pv is larger than N for
a large search area, we have to use (N 0 1)(2pv 0 N 0 1) shift
registers in the SRA.
C. Continuous Block Distortion Calculation for a Reference Block
The following simple example is given to illustrate the operation
of the proposed architecture. Note that the number of processing
elements is defined by the size of the reference block that the
application system needs, and the number of shift registers is defined
both by vertical reference block size and search range. Let us assume
the reference block size N 3 N = 3 3 3; and pv and ph = 3 for
0E3 h; v E +2: In order to get the valid search data from either
LUSW or LLSW in the PE as shown in Fig. 1(b), the select signal
generation can be easily implemented by just cascading the 2pv 0 1
flip-flops, which unload the overhead for complex circuit design. At
the kth cycle after initialization, PE(1; N ); PE(1; N 0 1); 1 1 1 ; and
E
; and other PEs
PE(1; N 0 k + 1) select the LSW data in LLSW
E
select the USW data in LUSW : After 2pv cycles from the start,
it returns to the same initial state. Each PE(i; j ) receives a select
signal from each PE(i 0 1; j ) with one cycle delayed through the
Lm register. To preload all the registers of the processor array, we
serially preload the search window data from both the USW and LSW
LU
and
LL in (b).
input lines, respectively, in 12[= 2pv (N 0 1)] cycles. At the same

time, we must serially preload the reference block data into each PE
for 9(= N 2 ) cycles. Then we can load the initial search data and
the reference block data. After preload, each PE(1; j ); 1 j 3;
calculates PS03;03 (1; j ); all PE(2; j ) and PE(3; j ); 1 j 3; are
idling at the cycle 13[= 2pv (N 0 1) + 1]: The PSh;v (i; j ) means
the accumulated partial sums from the first to the ith column of the
block distortion positioned at (h; v ): At cycle 14, each search data
in the processor array are shifted into upper PE(i; j 0 1) or SRAi01 ;
and then each PE(1; j ) operates for PS03;02 (1; j ); the PE(2; j ) for
PS03;03 (2; j ); and PE(3; j ) is idling. At cycle 15, the array PE(1; j )
operates for PS03;01 (1; j ); the PE(2; j ) for PS03;02 (2; j ); and the
PE(3; j ) for PS03;03 (3; j ): At cycle 16[= 2pv (N 0 1) + 4]; the
PE(1; j ) operates to calculate PS03;0 (1; j ); the PE(2; j ) to calculate
PS03;01 (2; j ); and the PE(3; j ) to calculate PS03;02 (3; j ); and
the parallel adder calculates MAD(03; 03): At cycle 17, other
operations will be continued, and MAD(03; 02) is calculated from
the parallel adder. At the cycle 18[= 2pv (N 0 1)+ 2pv ]; the PE(1; j )
calculates PS03;+2 (1; j ); the PE(2; j ) calculates PS03;+1 (2; j ); and
the PE(3; j ) calculates PS03;0 (3; j ): At cycle 19[= 2pv (N 0 1) +
2pv + 1]; the PE(1; j ) calculates immediately the partial distortion
PS02;03 (1; j ) without any latency at the switch for next horizontal
displacement. By this operation, all of the MAD(h; v ) will be
calculated and put into the minimum distortion detector every cycle.
Finally, we can choose the best matching block among all possible
510
(a)
(b)
(c)
(d)
Fig. 2. Partition of search area and input flows: (a) search window division, (b) data for USW input line, (c) data for LSW input line, and (d)
basic data input flow.
block distortions at the cycle 51[= 2pv (N 0 1) + 4pv ph + N ]: In the

following section, we will describe the data input flow for removing
the initialization period caused by the preload cycle to introduce high
throughput rate.
III. DATA INPUT FOR CONTINUOUS PROCESS BETWEEN BLOCKS
Depending on the maximum displacement, at the end of the
processing of one reference block, the contents of the registers both
in the processor array and the SRA will not necessarily be the data
needed for starting the processing of the next current reference block.
Hence, the processor array and the registers for search data will
have to be initialized at the transition between current blocks. In
this section, we will describe the way to initialize the processor array
at once. The preload of the initial search data for the next block
consumes 2pv (N 0 1) clock cycles. During this period, the processor
array does not produce any valid block distortion. All of the block
processes in frames consume a large number of preload cycles causing
performance degradation of the processor. Then we remove it so as to
improve the performance. Its main scheme is to permit the processor
to preload all the data for the next current block process during the
operations of the current block. But, on the other hand, it needs
additional hardware such as standing registers to save the data of the
initial search area and the next block. In the PE shown in Fig. 1(b),
the standing register (SNR) is contained in the PE as LS
R to latch
the next current reference block, and the standing latches (SSW) are
S and
contained in both the PE and the register of the SRA as LUSW
S
LLSW to reserve the initial search data. We take advantage of the
fact that the successive search windows overlap. We can then reduce
the high bandwidth requirements for the memory system by reusing
this overlap data. The initial search data are stored into the SSW,
and the next consecutive data into the local memory. Meanwhile, the
next reference block data are reserved into the SNR. Therefore, when
a block process switches to the next one, it can be easily performed
to preload the reference block and the initial search data at once.
After completing the current block process, the SNR data are parallel
loaded into the execution register (ENR) which is designed in the
PE as LE
R to hold the current block, and the SSW into the execution
register (ESW) which is designed both in the PE and the SRA as
E
E
LUSW and LLSW ; respectively, at once. In the following, we will
describe the data input schemes in detail to the various k values of
2ph = 2kN:
A. Data Input Flow for
= 1=2
Let us consider the case for k = 1=2; which means ph is half

of the horizontal block size as one of a typical case. For instance,
N 3 N = 16 3 16; the search range becomes 08 h; v +7: In this
case, the processor array can perform the block process without SRA
and local memory. So the pixel data from PE(i; 1) are transferred into
PE(i 0 1; N ): The reason why it does not use both SRA and local
memory is based on the fact that the SSW is enough to reserve the
overlap data. As shown in Fig. 3(a), the overlap area of the successive
search windows is R2; which is fed into the processor array from the
frame memory and saved into the SSW. During that same period,
the next block B enters the SNR. After N cycles from the time
when the last pixel data of R2 have been loaded, the current block
process will be finished, and the next block process can be started by
loading both the data of the SSW and the SNR into each execution
register at once. The next successive data, the R3; are inputted into
the processor array to complete the block process for block B , and
also saved into the SSW for the next block process. In this regular
way, the processor can perform a continuous block process without
any ineffective operation and pipeline break.
511
local memory modules, the last subblock from the frame memory
simultaneously must be put into both the processor array and one of
the local memory modules that has subblock data unuseful for the
next block.
IV. PERFORMANCE ANALYSIS
A. High Throughput Rate and Low Memory Bandwidth
(a)
(b)
Fig. 3. Search windows for two adjacent blocks: (a) for k = 1=2 and (b)
for k = 1. Note that the overlap search area is shaded.
B. Data Input Flow for
1
Let us consider the horizontal displacement 2ph 2N; i.e.,

0kN h kN 01: For an instance k = 1 and N 3N = 16316; the
search range becomes 016 h; v +15: As shown in Fig. 3(b), the
overlap search area of two adjacent blocks is R2 and R3 for k = 1:
The R2 is stored into the SSW and the R3 into the local memory
of N 3 (N + 2pv ) words. We use the local memory for storing the
part of the overlap search area. The local memory is partitioned into
two submodules to store each necessary datum for USW and LSW
input lines. At the next cycle, after the last pixel data from the R2
region are inputted for the block process of reference block A, the
content of the ESW is parallel loaded into the SSW for reserving
the initial search data required for the next block process of block
B . Continuously, the next data sequence from the R3 is fed for the
remaining block distortions without any delay, and is stored into the
local memory to be reused for the next block process. Also during this
period, the next block B is serially loaded into the SNR. Therefore,
on completing the block process for block A, the standing data of
both SNR and SSW are parallel moved into the ECR and the ESW at
once, respectively. This regular data sequence is continued until the
last block process is completed. Similar approaches can be applicable
to the dimensional changes of the search range. When the value of
2ph is 2kN for k 2; which means ph is k times the horizontal
block size N in many typical cases, the number of (2k 0 1) local
memory modules is required to reserve the overlap search data. As
in k = 1; the read/write operations for the local memory modules
are just periodic, so that, after using the required data from all the
The efficiency of the processing element can be defined as the

ratio of total candidate blocks in a search window to the required
block process cycle. The proposed architecture requires (4kpv +1)N;
except when k equals 1/2, when it requires (2pv + 2)N cycles for
one block process. The proposed architecture has one block process
cycle almost the same as the number of block distortions within a
search area. Therefore, in the case of a frame size 720 3 480 with
30 Hz, it can be implemented with a clock rate of 12 MHz for both
vertical and horizontal search ranges of 08/+7 and with a clock
rate of 42 MHz for 016/+15. Fewer accesses implies lower memory
bandwidth, and hence is very desirable. The number of external frame
memory I/O requests is conventionally (2k + 1)N (N + 2pv ); but the
proposed architecture needs N (N + 2pv ): That means that ours can
expand its memory bandwidth to (2k +1) times over the conventional
architecture. We have implemented the FBMA architecture for both
ranges of 016/+15 using 0.6 m triple-metal CMOS technology. It
used 220k gates, and has shown the operation clock rate of 66 MHz on
chip-layout simulation. Therefore, it can provide a feasible solution
for low pin count and low memory bandwidth, and a single-chip
configuration is sufficient to deliver full-search quality CIF video for
H.261, 4CIF for H.263, MP@ML for MPEG2, and other multimedia
applications.
B. Cascading Chips for Wide Horizontal Search Scopes
If the maximum pixel rate achievable is not sufficient for the given
application, more than one chip will be needed for the implementation. In this case, one possible solution is to provide the chip with
each partitioned search area, and to let the chips operate in parallel.
An example with the search range of ph = 2N can be handled by
cascading two chips as depicted in Fig. 4(a). Each chip is designated
to handle ph = N (= 16): The left chip handles search points of
032(= 0ph ) h 01(= 0ph + 2N 0 1); and the right one
handles search points of 0(= 0ph + 2N ) h +31(= ph 0 1):
The reference block data are delivered in parallel, while the search
area data are serially delivered through the right chip and the off-chip
delay buffer into the left chip. At first, as shown in Fig. 4(b), two
chips must be initialized with preloading the necessary search data,
i.e., the left one has to get the R1 and R2; the buffer the R3; and the
right one should have the R4 and R5: Then, the R3 from the buffer is
serially put into the left chip for the search range of 032 h 01;
and the R6 comes to the right one from the external frame memory
for the search range of 0 h +31 of block A. Since then, the
search data needed by the left chip for the next block process can
be obtained from the buffer which has reserved such data streamed
out of the right chip. By this method of cascading several chips, we
can process more larger search ranges proportional to the number of
cascaded chips without increasing the memory I/O requirements for
the necessary pixel rate.
V. CONCLUSION
We have proposed a full-search block-matching architecture which
features high throughput and requires low memory bandwidth by the
reuse of search data using on-chip memory. It also promises a high
throughput rate by the continuous calculation of all block distortions
512
[10] K. Suguri et al., A real-time motion estimation and compensation LSI

with wide-search range for MPEG2 video encoding, in ISSCC, Dig.
Tech. Papers, Feb. 1996, pp. 242243.
[11] A. Horng-Dar Lin et al., A 14GOPS programmable motion estimator
for H.26X, in ISSCC, Dig. Tech. Papers, Feb. 1996, pp. 246247.
[12] Draft for ITU-T Recommendation H.263, Video Coding for Low Bitrate
Communication.
[13] ISO/IEC 13818-2, Generic Coding of Moving Pictures and Associated
Audio Information, Part 2: Video.
(a)
Lossy Synthesis of Digital Lattice Filters

Louiza Sellami and Robert W. Newcomb
(b)
Fig. 4. Cascading two chips for doubling horizontal search range: (a) block
diagram of cascading two chips and (b) search area for two chips.
in a search area using simply two search data input flows, and by
the continuous process between blocks. We have implemented the
processor for 016/+15 search ranges in a total of 220k gates using
0.6 m triple-metal CMOS technology. It has been shown that the
operating clock runs up to 66 MHz. Therefore, its application scope
can contain the encoding H.263(4CIF), MPEG2(MP@ML), and other
multimedia applications.
REFERENCES
[1] J. R. Jain and A. K. Jain, Displacement measurement and its application
in interframe image coding, IEEE Trans. Commun., vol. COM-29, pp.
17991808, Dec. 1981.
[2] L. D. Vos and M. Stegherr, Parameterizable VLSI architectures for the
full search block matching algorithm, IEEE Trans. Circuits Syst., vol.
36, pp. 13091316, Oct. 1989.
[3] T. Komarek and P. Pirsch, Array architectures for block matching
algorithms, IEEE Trans. Circuits Syst., vol. 36, pp. 13011308, Oct.
1989.
[4] STI3220, Image Processing Databook, SGS-Thomson Microelectron.,
Oct. 1992.
[5] K. Ishihara et al., A half-pel precision MPEG2 motion estimation
processor with concurrent three-vector search, in ISSCC, Dig. Tech.
Papers, Feb. 1995, pp. 288289.
[6] K. M. Yang, M. T. Sun, and L. Wu, A family of VLSI design for the
motion compensation block matching algorithm, IEEE Trans. Circuits
Syst., vol. 36, pp. 13171325, Oct. 1989.
[7] S. H. Nam and M. K. Lee, Flexible VLSI architecture of motion
estimator for video image compression, IEEE Trans. Circuits Syst. II,
vol. 43, pp. 467470, June 1996.
[8] C.-H. Hsieh and T. P. Lin, VLSI architecture for block matching motion
estimation algorithm, IEEE Trans. Circuits Syst. Video Technol., vol. 2,
pp. 169175, June 1992.
[9] A. Ohtani et al., A motion estimation processor for MPEG2 video
real time encoding at wide search range, in Proc. CICC, 1995, pp.
17.4.117.4.4.
Abstract A new method for converting a lossless cascade lattice

realization of a real, stable, single-input, single-output (ARMAn; m)
filter, with a lossy constant terminating one-port section, to a lossy
realization is proposed. The conversion process is carried out through the
factorization of the transfer scattering matrix of a two-port equivalent of
the terminating section and the distribution of the loss term, embedded in
this matrix, among the lossless lattice sections according to some desirable
pattern. The cascade is then made computable through the extraction of
right-matched J -unitary normalization sections. The technique applies
to both degree-one and degree-two lattice sections, and is rendered
systematic owing to the particular lossless lattice structure used. The
motivation for this work lies in the synthesis of a pipeline of digital
cochlea lattices with loss suitable for hearing impairment diagnosis via
Kemp echoes.
Index Terms Computable lossy filters, digital lattice filters, passive
synthesis.
I. INTRODUCTION
In [1], we proposed a new technique to synthesize a real, stable, single-input, single-output ARMA(n; m) filter as a cascade
of degree-one or degree-two real lossless lattice sections from the
reflection coefficient and the zeros of transmission (real or complex),
with a minimum number of delay elements. The technique relies on
a four-step complex Richards function extraction where two steps
are used for degree reduction, and the other two for obtaining real
degree-two sections from complex degree-one sections. The resulting
structure is terminated on a lossy constant real one-port section after
all of the dynamics is extracted through repeated lossless extractions.
In the present paper, we develop a new technique to obtain a lossy
cascade structure from a lossless one with lossy termination while
preserving its passivity and realness properties. The key idea is to
distribute the loss term, embedded in the terminating section, among
the lattice sections in such a fashion as to include a loss term locally,
according to some desirable pattern. Since the resulting sections are
not computable, i.e., admit delay-free loops, we transform them to
Manuscript received July 16, 1996; revised March 12, 1997. This paper
was recommended by Associate Editor B. A. Shenoi.
L. Sellami is with the Department of Electrical Engineering, US Naval
Academy, Annapolis, MD 21402 USA (e-mail: sellami@eng.umd.edu) and
also with the Department of Electrical Engineering, University of Maryland,
College Park, MD 20742 USA.
R. W. Newcomb is with the Department of Electrical Engineering,
University of Maryland, College Park, MD 20742 USA (e-mail: newcomb@eng.umd.edu).
Publisher Item Identifier S 1057-7130(98)02151-X.
10577130/98$10.00 1998 IEEE

High-Throughput Block-Matching VLSI Architecture With Low Memory Bandwidth

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

High-Throughput Block-Matching VLSI Architecture With Low Memory Bandwidth

Încărcat de

Drepturi de autor:

Formate disponibile

508

[7] A. P. Chandrakasan et al., Optimizing power using transformations,