Documente Academic
Documente Profesional
Documente Cultură
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 4, APRIL 1998
0 +
I. INTRODUCTION
The block-matching algorithm is used to compress the video image
data in the fields of motion-compensated predictive coding [1]. The
array mapping methods and several architectures for the full-search
architectures have been presented in [2], [3]. In addition, several
dedicated hardware implementations have been realized [4], [5].
Depending on different search strategies, several different blockmatching algorithms, such as hierarchical search [9], hierarchical
telescopic search [10], and multisurvivor search [11], have been
proposed and implemented.
Among all block-matching algorithms, the full-search blockmatching algorithm (FBMA) is the most precise to detect the best
matching block, and also demands the most computation. Due to
its regular formulation, the FBMA can be realized with a pipelined
systolic array to significantly boost the performance. Indeed, a number
of researchers have presented various systolic architectures for FBMA
[2], [3], [6][8] of which many require a rather large number of I/O
pins, high memory bandwidth, and a few inefficient clock cycles
causing the performance degradation. In this paper, we propose a
high-throughput pipelined systolic array for FBMA which requires
low memory bandwidth.
II. ARCHITECTURE DESIGN
A. Full-Search Block-Matching Architecture
The block diagram for the proposed architecture is schematically
shown in Fig. 1(a). It is basically comprised of a processor array,
parallel adder, shift register array (SRA), minimum distortion detector, data selector, and local memory. In each processing element
Manuscript received July 12, 1996; revised February 20, 1997. This paper
was recommended by Associate Editor K. K. Parhi.
The authors are with the Multimedia LSI Team, Semiconductor Division,
Daewoo Electronic Company, Seoul 100-095, Korea.
Publisher Item Identifier S 1057-7130(98)02137-5.
(PE), the absolute pixel difference between the reference block and
the search window can be calculated. In order to make continuous block distortion calculations, it uses the shift register array.
The data selector propagates the control signals to multiplex the
proper search data from two inputs to each PE. During the blockmatching operations, most pixels are used several times to evaluate
the distortions between neighbor reference blocks. In order to reduce
external memory I/O requirements, the data reuse has been adopted
by using local memory on chip, resulting in high data utilization
by the reduced data access from the external frame memory. This
processor is aimed to make serial input for low memory bandwidth
and process in parallel for high throughput rates. The main procedure
is as follows. The pixel differences from the processing elements
are accumulated by the row to form the partial sums of one block
distortion. The partial sums are further accumulated to calculate a
mean absolute difference MAD(h; v ) of the block at the horizontal
and the vertical displacement position (h; v ) by the parallel adder.
Of all the MAD(h; v ), we can choose the best matching block by the
minimum distortion detector.
B. Partition of the Search Window Data
As shown in Fig. 2(a), the search window is divided into three
parts to produce the successive valid operations without making any
dead cycle. The first, upper region (UR) is filled with N rows from
the top row of the search window. The second, lower region (LR)
is filled with N rows from the bottom row of the search window,
and finally, the middle region (MR) is defined by the overlap area
between the above two regions. There are two data input sequences
through USW and LSW input lines. The data from UR and MR are
put into the processor array through USW by line scan order: 1, 2, 3,
1 1 1 , 2ph + N vertical lines along the arrows shown in Fig. 2(b). In
the same way, the search data of MR and LR are serially loaded into
LSW as shown in Fig. 2(c). This input sequence enables the block
distortion to be carried out column by column from left to right;
each column is in turn scanned from top to bottom row per cycle.
Note that both lines get 2pv pixels in one column. The proposed
architecture requires the UR and MR from USW for MAD(h; v )s
of 0pv v 0pv + N 0 1 and the MR and LR from LSW for
those of 0pv + N v pv 0 1: We repeatedly use the same
MR data into both USW and LSW input lines. However, we can use
just one of them. Then the other unused data can be replaced with
dummy data just for regular data input. Referring to Fig. 2(d), let us
explain the main idea for calculating the continuous block distortions
for a block. First, the 2pv pixels of UR1 and MR1 come into the
processor array via the USW input line as the arrow a1 , and the
another 2pv pixels of MR1 and LR1 via LSW get into the processor
array through N cycles delay as the arrow a2 : The processor array
calculates block distortions using N pixels of UR1 from USW and
2pv pixels of MR1 and LR1 from LSW. During the 2pv period for
getting MR1 and LR1 via LSW, the next 2pv pixels of UR2 and MR2
are delivered into the processor array via USW. This regular input
flow makes it possible to calculate continuous block distortions for
one block. In summary, the 2pv + N pixels of the following column
for the next horizontal search are delivered to the processor array
via two input lines for calculating the block distortions at the next
horizontal displacement during the time the processor computes the
block distortions in the current horizontal displacement. Therefore,
the continuous process for the next horizontal displacement can be
executed immediately without any break. To extend the displacement
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 4, APRIL 1998
(a)
509
(b)
(c)
Fig. 1. Block diagram of the proposed architecture: (a) FBMA architecture, (b) PE structure, and (c) structure of SR in (a) and
LU
and
LL in (b).
510
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 4, APRIL 1998
(a)
(b)
(c)
(d)
Fig. 2. Partition of search area and input flows: (a) search window division, (b) data for USW input line, (c) data for LSW input line, and (d)
basic data input flow.
next reference block data are reserved into the SNR. Therefore, when
a block process switches to the next one, it can be easily performed
to preload the reference block and the initial search data at once.
After completing the current block process, the SNR data are parallel
loaded into the execution register (ENR) which is designed in the
PE as LE
R to hold the current block, and the SSW into the execution
register (ESW) which is designed both in the PE and the SRA as
E
E
LUSW and LLSW ; respectively, at once. In the following, we will
describe the data input schemes in detail to the various k values of
2ph = 2kN:
= 1=2
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 4, APRIL 1998
511
local memory modules, the last subblock from the frame memory
simultaneously must be put into both the processor array and one of
the local memory modules that has subblock data unuseful for the
next block.
IV. PERFORMANCE ANALYSIS
A. High Throughput Rate and Low Memory Bandwidth
(a)
(b)
Fig. 3. Search windows for two adjacent blocks: (a) for k = 1=2 and (b)
for k = 1. Note that the overlap search area is shaded.
1
512
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 4, APRIL 1998
(a)
(b)
Fig. 4. Cascading two chips for doubling horizontal search range: (a) block
diagram of cascading two chips and (b) search area for two chips.
in a search area using simply two search data input flows, and by
the continuous process between blocks. We have implemented the
processor for 016/+15 search ranges in a total of 220k gates using
0.6 m triple-metal CMOS technology. It has been shown that the
operating clock runs up to 66 MHz. Therefore, its application scope
can contain the encoding H.263(4CIF), MPEG2(MP@ML), and other
multimedia applications.
REFERENCES
[1] J. R. Jain and A. K. Jain, Displacement measurement and its application
in interframe image coding, IEEE Trans. Commun., vol. COM-29, pp.
17991808, Dec. 1981.
[2] L. D. Vos and M. Stegherr, Parameterizable VLSI architectures for the
full search block matching algorithm, IEEE Trans. Circuits Syst., vol.
36, pp. 13091316, Oct. 1989.
[3] T. Komarek and P. Pirsch, Array architectures for block matching
algorithms, IEEE Trans. Circuits Syst., vol. 36, pp. 13011308, Oct.
1989.
[4] STI3220, Image Processing Databook, SGS-Thomson Microelectron.,
Oct. 1992.
[5] K. Ishihara et al., A half-pel precision MPEG2 motion estimation
processor with concurrent three-vector search, in ISSCC, Dig. Tech.
Papers, Feb. 1995, pp. 288289.
[6] K. M. Yang, M. T. Sun, and L. Wu, A family of VLSI design for the
motion compensation block matching algorithm, IEEE Trans. Circuits
Syst., vol. 36, pp. 13171325, Oct. 1989.
[7] S. H. Nam and M. K. Lee, Flexible VLSI architecture of motion
estimator for video image compression, IEEE Trans. Circuits Syst. II,
vol. 43, pp. 467470, June 1996.
[8] C.-H. Hsieh and T. P. Lin, VLSI architecture for block matching motion
estimation algorithm, IEEE Trans. Circuits Syst. Video Technol., vol. 2,
pp. 169175, June 1992.
[9] A. Ohtani et al., A motion estimation processor for MPEG2 video
real time encoding at wide search range, in Proc. CICC, 1995, pp.
17.4.117.4.4.
I. INTRODUCTION
In [1], we proposed a new technique to synthesize a real, stable, single-input, single-output ARMA(n; m) filter as a cascade
of degree-one or degree-two real lossless lattice sections from the
reflection coefficient and the zeros of transmission (real or complex),
with a minimum number of delay elements. The technique relies on
a four-step complex Richards function extraction where two steps
are used for degree reduction, and the other two for obtaining real
degree-two sections from complex degree-one sections. The resulting
structure is terminated on a lossy constant real one-port section after
all of the dynamics is extracted through repeated lossless extractions.
In the present paper, we develop a new technique to obtain a lossy
cascade structure from a lossless one with lossy termination while
preserving its passivity and realness properties. The key idea is to
distribute the loss term, embedded in the terminating section, among
the lattice sections in such a fashion as to include a loss term locally,
according to some desirable pattern. Since the resulting sections are
not computable, i.e., admit delay-free loops, we transform them to
Manuscript received July 16, 1996; revised March 12, 1997. This paper
was recommended by Associate Editor B. A. Shenoi.
L. Sellami is with the Department of Electrical Engineering, US Naval
Academy, Annapolis, MD 21402 USA (e-mail: sellami@eng.umd.edu) and
also with the Department of Electrical Engineering, University of Maryland,
College Park, MD 20742 USA.
R. W. Newcomb is with the Department of Electrical Engineering,
University of Maryland, College Park, MD 20742 USA (e-mail: newcomb@eng.umd.edu).
Publisher Item Identifier S 1057-7130(98)02151-X.