An FPGA Implementation of Motion Estimation Algorithm For H 264AVC

An FPGA Implementation oI Motion Estimation
Algorithm Ior H.264/AVC

M. Kthiri, P. Kadionik, H. Lvi
IMS laboratory - ENSEIRB-MATMECA, University
Bordeaux 1, CNRS UMR 5218
351, Cours de la Libration, 33 405 Talence Cedex, France
e-mail: kthirienseirb-mat meca.Ir, kadionikenseirb-
mat meca.Ir, herve.leviims-bordeaux.Ir

H. Loukil, A. Ben Atitallah, N. Mas moudi
University oI SIax, High Institute oI Electronics and
Communicat ion,
BP 868, 3018 SIax, TUNISIA
e-mail: ahmed.benatitallahisecs.rnu.tn,
Hassenloukilgmail.com, nouri.mas moudienis.rnu.tn

Abstract The H.264/AVC standard achieves much higher
coding efficiency than previous video coding standards.
Unfortunately this comes with a cost in considerably increased
complexity at the encoder mainly due to motion estimation.
Therefore, various fast algorithms have been proposed for
reducing computation but they do not consider how they can be
effectively implemented by hardware. In this paper, we propose a
hardware architecture of fast search block matching motion
estimation algorithm using Line Diamond Parallel Search
(LDPS) for H.264/AVC video coding system. This architecture
presents pipeline processing techniques, minimum latency,
maximum throughput and full utilization of hardware resources.
The VHDL code has been tested and can work at high frequency
in a Xilinx Virtex-5 FPGA circuit.
Keywords- H.264/AVC, Motion estimation, VHDL, FPGA
I. INTRODUCTION
The video coding standards such as H.26x, use Motion
Estimation (ME) algorithm to achieve high compression
eIIiciency. The ME includes a Block Matching Algorithm
(BMA) which is a popular technique to exploit temporal
redundancy in a video sequence. But it is computationally
complex and takes up to 50 to 70 oI processing time Ior
video encoding |1|. Furthermore, it is obvious that Ior BMA,
the simplest and most accurate strategy is the Full Search (FS)
algorithm |2|. This method is presented in Iigure 1 where one
evaluates exhaustively all the possible candidate motion
vectors within the search window in order to Iind the globally
best matched block in the reIerence Irame.
Current Frame Previous reference Frame
V
W
p
p
Search Window
Candidate MB
Current MB
L
H
j
i

Figure 1. Block mat ching algorithm
The matching algorithm consists in computing an error cost
Iunction, usually the Sum oI Absolute DiIIerences (SAD)
between the MacroBlocks (MB: 16x16 pixels). II x(i, j) and
y(i, j) are the pixels oI the relevant current and candidate MBs
and m and n are the coordinates oI the Motion Vector (MV),
the SAD is then deIined by:

15
0
15
0
n) j m, y(i - j) x(i, ) , (
i j
n m SAD

This exhaustive approach achieves optimal perIormances in
terms oI PSNR (Peak Signal to Noise Ratio) Ior a given
compression Iactor but with a high amount oI computational
processing according to the quadratic dependence oI the search
window size |3|. In Iact, the very complex computation
prevents it Irom practical implementation in a processor Ior
real-time applications. It is considered as the bottleneck oI
video coding system. Hence, many Iast algorithms have been
proposed in the literature such as the Three Step Search (TSS)
|4|, the New Three Step Search (NTSS)|5|, the Diamond
Search (DS) |6|, the HEXagon-Based Search (HEXBS) |7|, the
Nearest Neighbors Search (NNS) |8|, the Horizontal Diamond
Search (HDS) |9|, the Cross-Diamond Search (CDS) |10| the
Predictive Motion Vector Field Adaptive Search Technique
(PMVFAST) |11|, the Enhanced Predictive Zonal Search
Algorithm (EPZS) |12|, the Line Diamond Parallel Search
algorithm (LDPS) |13| which allow to reduce the
computational complexity at the price oI slightly perIormance
loss. The basic principle oI these Iast algorithms is to divide the
search process into a Iew sequential steps and to choose the
next search direction according to the current search result. In
Iact, based on previous studies |13|, the LDPS algorithm
reduces more the processing time than the other algorithms
with approximatively the same video quality.
In this paper, we present a hardware architecture Ior LDPS
algorithm in order to obtain a better video quality with a
minimum area cost and less computing time.
The remainder oI the paper is organized as Iollows:
Section 2 gives a brieI overview oI the block matching
algorithm strategies oI Line Diamond Parallel Search (LDPS).
Section III describes the characteristics oI the proposed
hardware architecture in details. Our simulation results and
analysis are presented in section IV. Finally, the conclusion
will be given in section V.
978-1-4244-5998-8/ 10/$26.00 2010 IEEE
II. LDPS SEARCH ALGORITHM
The LDPS search algorithm is illustrated in Iigure 2. The
LDPS exploits the center-biased characteristics oI the real
world video sequences by using in the initial step, the small
diamond search pattern (SDSP) which is presented by Iigure
2.a. The second dynamic pattern improves search on the
horizontal and vertical motion components as illustrated in
Iigures 2.b and c.
1
2
3
4

(a) Model oI the Line Diamond Parallel Search
algorithm.
First iteration Second iteration Third iteration First iteration Second iteration Third iteration
Linear search Selection of the direction
of research
Linear search Selection of the direction
of research

Figure 2. Operat ions oI Line Diamond Parallel Search algorithm

The search path strategy using the LDPS algorithm can be
summarized as Iollows: at the beginning, the SDSP is placed at
(0, 0), the center oI the search window. The center oI the SDSP
is called the original point. Select the direction search: The
SAD values oI Iive candidate search blocks are compared. II
the minimum SAD is Iound at the center oI SDSP, proceed to
the end. II the minimum SAD point is located at one oI the Iour
vertices, then let the minimum SAD as the line search
point . These operations are repeated until the position which
gives the minimum SAD coincides with the center oI the small
diamond. Moreover, Iigure 3 presents the LDPS Ilow chart
which describes the diIIerent steps oI the LDPS algorithm.
Start
Small Diamond placed in
the centre (0,0)
Calculate the SADs of
various positions of the
small diamond
Calculate the SADs of
various positions of the
line search
END
Determinate
Line search
Compare SADs
Compare SADs
Minsad = Sad of centre of small diamond
Minsad> Previous Minsad of small diamond
yes
no
yes
no

Figure 3. The LDPS algorithm
III. PROPOSED ARCHITECTURE
One oI the main research goals oI the LDPS algorithm is to
reduce computational complexity. In this section, we describe
the proposed hardware architecture Ior this algorithm in order
to obtain maximum perIormances with minimum area cost.
This architecture is based on embedded FPGA memories in
order to optimize the surIace integration. The method uses
sequential hardware architecture Ior computing the motion
vector. ReIerring to Iigure 4, this architecture is decomposed
into two modules. The Iirst module loads the current MB and
the reIerence search area. The second module is a search
module which Iinds the suitable motion vector according to the
LDPS algorithm.

CLK
Reset_n
Data_in
32 bits
Loading
module
Search
module
Pix_cur(0)
Pix_cur(255)
8 bits
8 bits
Pix_ref(0)
8 bits
Pix_ref(1115)
8 bits
MVX_MIN
MVY_MIN
MIN_SAD
Done_chargement
X_MIN
Done_LDPS
Start_top_level

Figure 4. Block diagram oI t he system architect ure Ior LDPS algorithm

The main idea oI this architecture is to use the FPGA
memory blocks Ior data storage with a sequential calculation in
order to minimize the logic elements number used in the FPGA
circuit. We will detail now the loading and the search module.
A. Loading module
Figure 5 presents the loading module which allows storing
the current MB and the reIerence search area in memory.
Reference search Memory
Control unit
Data_in
32 bits
Start_fen_ref
Start
Start_MB_cur
Done_load
BLOCK to store the
data_in_ref in a
register of 288 bits
(36 pixels)
Refernece search area
memory 36x31
(1116 pixels)
Data_int
288 bits
Wren
Adress
5 bits
288 bits
Data_out
BLOCK to store the
data_in_cur in a register
of 128 bits
(16 pixels)
Current macroblock area
memory 16x16
(256 pixels)
Data_int
128 bits
Wren
Data_out_cur
128 bits
adress
4 bits
Current Macroblock memory
adress

Figure 5. Loading module

The current MB memory is used to store 16 x 128-bit
coeIIicients. In Iact, each memory point contains a MB line i.e.
an assembled oI 16 x 8-bit pixel. This method is used in order
to limit memory access. ReIerring to Iigure 6, 4 clock cycles
are needed to store one line oI a MB (Note the 32 bit input data
bus in order to Ieed a group oI 4 pixels at a time). Thus, 64
clock cycles are required Ior all the MB.
(c) Horizont al line search (b) Vert ical line search

Pix3 Pix2 Pix1 Pix0
Pix3 Pix2 Pix1 Pix0
Pix7 Pix6 Pix5 Pix4
Pix7 Pix6 Pix5 Pix4 Pix3 Pix2 Pix1 Pix0
Pix11Pix10 Pix9 Pix8
Pix11Pix10 Pix9 Pix8 Pix7 Pix6 Pix5 Pix4 Pix3 Pix2 Pix1 Pix0
Pix15Pix14Pix13Pix12
Pix15Pix14Pix13Pix12Pix11Pix10 Pix9 Pix8 Pix7 Pix6 Pix5 Pix4 Pix3 Pix2 Pix1 Pix0
First clock
cycle
Second
clock cycle
Third clock
cycle
Fourth
clock cycle

Figure 6. Line loading oI current MB
The reIerence search memory is a 36x31 byte memory
block that contains the search area data Irom the reIerence
Irame. In Iact, each 36 pixels (line oI the reIerence search area)
are stored in a 288-bit buIIer (Iigure 7). We have repeated this
operation 31 t imes Ior having the whole reIerence search area
in the memory. Then 279 clock cycles (nine clock cycles
needed to store one line oI the search area) are then needed to
load one reIerence search area. These loads are synchronized
with a control unit by control signals 'startMBcur and
'startIenreI. Furthermore these memories are controlled by
the signal wren . II wren `1`, the memory receives data and
stores them into appropriate addresses. Else iI wren `0`, we
can read then Irom the memory the diIIerent stored pixels by
giving only the desired pixel address provided by the control
unit.
P3 P2 P1 P0
P3 P2 P1 P0
P7 P6 P5 P4
P7 P6 P5 P4 P3 P2 P1 P0
P11 P10 P9 P8
Pi11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0
P15 P14 P13 P12
P15 P14 P13 P12 P11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0
First clock
cycle
Second
clock cycle
Third
clock cycle
Fourth
clock cycle
P19 P18 P17 P16
P23 P22 P21 P20
P27 P26 P25 P24
P31 P30 P29 P28
P35 P34 P33 P32
P3 P2 P1 P0
P7 P6 P5 P4 P3 P2 P1 P0
P23 P22 P21 P20 P19 P18 P17 P16 P15 P14 P13 P12 P11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0 P27 P26 P25 P24
P23 P22 P21 P20 P18 P16 P14 P13 P12 P11 P10 P9 P8 P6 P5 P4 P3 P2 P1 P27 P26 P25 P24 P0 P15 P17 P19 P7 P31 P30 P29 P28
P23 P22 P21 P20 P18 P16 P14 P13 P12 P11 P10 P9 P8 P6 P5 P4 P3 P2 P1 P27 P26 P25 P24 P15 P17 P19 P7 P31 P30 P29 P28 P0 P35 P34 P33 P32
Fifth
clock cycle
sixth
clock cycle
seventh
clock cycle
eighth
clock cycle
ninth
clock cycle

Figure 7. Line loading oI reIerence search.
B. Search module
The hardware component oI this module is shown in Iigure
8.
Pix_cur(0)
Search module
SAD
module
Extraction
module
Pix_ref_MB(0)
Pix_ref_MB(255)
Comparator
MVX_MIN
MVY_MIN
MIN_SAD
Control unit
SAD
CLK
Reset_n
Done_chargement
MVX
MVY
X
Done_chargement
X_MIN
Start_bloc_SAD
Start_bloc_comp
Pix_cur(255)
Pix_ref(0)
Pix_ref(1115)

Figure 8. Search module
ReIerring to this Iigure, this sequential hardware
architecture is composed by Iour sub-modules: the control unit
which is responsible Ior synchronization between diIIerent
blocks oI the search module, the extraction module, the SAD
module and the comparator module in order to Iind the suitable
MB Ior the reIerence MB in the deIined search area.
In Iact, aIter receiving the reIerence search area Irom the
memories, we will select pixels associated to each reIerence
MB by using parameters generated by the control unit. This
step is realized by the extraction module. The second step
allows evaluating the SAD Ior this reIerence MB. This is why
we have injected the appropriate reIerence MB in the SAD
module (concerning the small diamond or the line search
necessary to the LDPS algorithm).

1) Extraction module: The extraction module extracts the
reIerence MB associated to the appropriate position in a small
diamond or the line search. In one clock cycle, this block
selects the line pixels associated to the reIerence MB which are
presented by DataoutreI. These lines are Ied into the SAD
module in order to calculate the SAD values. Figure 9 shows
an example oI reIerence MB extraction Ior a small diamond
positioned in the center (0,0).
Pixel0 Pixel1 Pixel7 Pixel8 Pixel9 Pixel10
Pixel37 Pixel46 Pixel36 Pixel43 Pixel44 Pixel45
Pixel170 Pixel189
Pixel206 Pixel225
Pixel252 Pixel261
Pixel288 Pixel297
Pixel324 Pixel333
Pixel0
Pixel0
Pixel0
Pixel0
Pixel0
Pixel188 Pixel187
Pixel223
Pixel259
Pixel295
Pixel331
Pixel224
Pixel260
Pixel296
Pixel332
Pixel190
Pixel226
Pixel262
Pixel298
Pixel334
Pixel1045
Pixel1080
Pixel1044
Pixel1081 Pixel1087
Pixel1051 Pixel1052
Pixel1088
Pixel1053
Pixel1089
Pixel1054
Pixel1090
Pixel1079
Pixel1115
Pixel359
Pixel323
Pixel287
Pixel251
Pixel215
Pixel71
Pixel35 Pixel11
Pixel47
Pixel191
Pixel227
Pixel263
Pixel298
Pixel335
Pixel1055
Pixel1091
Center of search (0,0)
Macroblock
at the center
Macroblock
at right position
Macroblock
at hight position
Macroblock
at left position
Macroblock
at low position

Figure 9. extract ion oI the suitable reIerence MB
2) SAD Module: The hardware architecture oI the SAD
module is shown in Iigure 10.
Start_SAD
16 Bytes 16 Bytes
PE0
Difference0
Abs0
1Byte 1Byte
PE1
Difference1
Abs1
PE15
Difference15
Abs15
Accumulator
Adder
1Byte 1Byte
1Byte 1Byte
Data_out_cur
Data_out_ref

Figure 10. SAD archit ect ure.
For two 16 pixel lines (line oI the current MB and line oI the
reIerence MB given by the extraction module using parameters
generated by the control unit), we use 16 parallel units Ior
diIIerence and absolute value calculation. Each oI t hose units
produces a SAD value along a row. The number oI Processing
Elements (PE) in this array is equal to the number oI pixels in
one pixel line. The PE unit gives the output result in one clock
cycle. This architecture allows processing oI 16x16 MB in 16
clocks cycles. Thus, this architecture computes the absolute
diIIerence value oI two pixels in only one cycle. The Iinal
result is obtained aIter 17 clock cycles (16 clock cycles Ior the
diIIerence and absolute value and one clock cycle Ior the
accumulating the Iinal SAD).

3) Comparator module: This module alows determining
the position which gives minimum SAD value Ior diIIerent
positions oI the small diamond or line search. The position
details are applied again to the control unit in order to choose
the new line search. We obtain the Iinal motion vector when
the position which gives minimum SAD coincides whith the
center oI the small diamond.
IV. EXPERIMENTAL RESULTS
The proposed architecture has been implemented in VHDL,
simulated and veriIied by using the ModelSim 6.0 tool. The
VHDL code has been synthesized, placed and routed into the
FPGA target device. The FPGA circuit is a Xilinx Virtex-5
XC5VLX330 used with the Xilinx ISE 11.1 tool. In this
section, we discuss on the experimental results Ior the proposed
design based on memories blocks. The correctness oI the
implemented architecture has been also checked. This has been
done by passing diIIerent input patterns into our architecture
and by comparing the obtained outputs with the reIerence
soItware results. Synthesis results oI sequential hardware
architecture Ior LDPS algorithm are given in Table.I.
TABLE I. SEQUENTIAL ARCHITECTURE RESULTS FOR LDPS
ALGORITHM
Resource Type Usage Percent of FPGA
slices LUTs 3021 1
slices Registers 700 1
Total Memory (Kb) 468 4
Numbers of IOBs 77 6
From this table, we can see that there is enough Iree space
aIter the sequential architecture integration Ior adding other
video applications. Thus, we can use the sequential architecture
in order to optimize the silicon area. Our architectures can
operate at up to 390 MHz. A 2.56 ns delay Ior coded data is
required Ior our architecture.
The experimental results show that 226 clock cycles are
necessary to obtain the suitable motion vector with the
proposed architecture.
With these results, we can process up to 440 Msamples/sec
with our sequential hardware architecture. These results are
suited Ior processing H.264/AVC HDTV (1920x108830Hz)
video sequences. The throughput oI this architecture exceeds
perIormances required Ior HDTV H.264/AVC encoder. In Iact,
this architecture can be used when the optimization in silicon
area is needed. Furthermore, the main robustness Ieatures oI
this architecture are its Ilexibility, reconIigurability,
extensibility and modularity, making it eas ily suited to a
variety oI low-level processing algorithms.
V. CONCLUSION
In this paper, we have presented high perIormance and low
cost hardware architecture Ior real-time implementation oI the
motion estimat ion algorithm used by the H.264/MPEG4 Part
10 video coding standard. We have proposed new hardware
architecture using memory blocks Ior Iast LDPS block
matching algorithm. According to the synthesis results, we note
that we can optimize silicon area and increase video data
throughput by using FPGA memory blocks. Our proposed
sequential hardware architecture is able to reach a processing
rate oI 440 Msamples/sec with minimum silicon area. Indeed,
it produces eIIicient solution Ior real-time motion estimation
required in video applications with low memory bandwidth
requirement. Thus, this architecture will be used and integrated
into our embedded system Ior video processing built around a
Xilinx Virtex-5 board. Finally, this present work is done Ior
the RTEL4I project and Iunded by the French SYSTEMTIC
ICT cluster |14|.
REFERENCES
|1 | Thomas W., "St udy oI Final Committ ee DraIt oI Joint Video
SpeciIicat ion", ITU-T Rec. H.264 , ISO/IEC 14496-10 AVC, draIt 1,
2002.
|2| I Richardson, "Full Search Mot ion Est imat ion," in Video Codec Design,
pp. 99-101, 2002.
|3| A. Ben At it allah, P. Kadionik, N. Masmoudi, H. Levi, 'HW/SW FPGA
Archit ect ure Ior a Flexible Mot ion Est imat ion, IEEE Internat ional
ConIerence on Electronics Circuit s and Systems, pp. 30-33, Marrakech,
Morocco, december 2007.
|4| Her-Ming Jong, Liang-Gee Chen and Tzi-Dar Chiueh. 'Parallel
archit ect ure Ior 3-st ep hierarchical search block-mat ching algorit hm,
IEEE Trans. on Circuit s and Syst ems Ior Video Technology, vol. 2, no. 4,
pp. 407-416, august, 1994.
|5| Li R, Zeng B, Liou M L. 'A new three-st ep search algorit hm Ior block
mot ion est imat ion, IEEE Trans. on Circuits and Systems Ior
Video Technology, vol. 4, no. 4, pp 438-442, 1994.
|6| Tham Y J, Ranganath S, Ranganath M et al, 'A novel unrest ricted cent er-
biased diamond search algorithm Ior block mot ion est imat ion, IEEE
Trans. on Circuit s and Syst ems Ior Video Technology, vol.8, no. 4, pp
369-377, 1998.
|7| C. Zhu, X. Lin, and L.P. Chau, "Hexagon-based search patt ern Ior Iast
block mot ion est imat ion", IEEE Trans. on Circuit s and Syst ems Ior Video
Technology, vol. 12, no. 5, pp. 349355, 2002.
|8| M. Gallant , G. Ct , F. Kossent ini, "An EIIicient Comput at ion-
Constrained Block-Based Mot ion Est imat ion Algorithm Ior Low Bit Rate
Video Coding", IEEE Trans. Image Processing, vol. 8, no. 12, 1999.
|9| A. Samet, N. Souissi, W. Zouch, M. A. Ben Ayed and N. Masmoudi,
'New horizont al diamond search mot ion est imat ion algorit hm Ior
H.264/AVC, Second Symposium on Communicat ion, Cont rol and
Signal Processing, ISCCSP 2006, pp 13-15, Marrakech, Morocco, march
2006.
|10| Cheung CH, Po LM, 'A novel cross-diamond search algorithm Ior Iast
block mot ion est imat ion, IEEE Trans. on Circuit s and Syst ems Ior Video
Technology, vol. 12, no. 12, pp. 1168-1177, 2002.
|11| Alexis M. Tourapis, Oscar C. Au, Ming L. Liou, 'Implement at ion oI the
Predict ive Mot ion Vector Field Adapt ive Search Technique (PMVFAST)
algorithm in t he Opt imizat ion Model 1.0, in ISO/IEC JTC1/SC29/WG11
MPEG2000/M6194, Beijing, China, 2000.
|12| Alexis Michael Tourapis, 'EnhancedPredict ive Zonal Search Ior Single
and Mult iple Frame Mot ion Est imat ion. proceedings oI Visual
Communicat ions and Image Processing, pp. 106979, 2002.
|13| Imen Werda, Haithem Chaouch, Amine Samet, Mohamed Ali Ben Ayed,
Nouri Masmoudi, 'Opt imal DSP-Based Mot ion Est imat ion Tools
Implement at ion Ior H.264/AVC Baseline Encoder, IJCSNS Internat ional
Journal oI Comput er Science and Net work Security, vol. 7, no. 5,
|14| SYSTEMTIC ICT clust er. http://www.syst ematic-paris-region.org/

An FPGA Implementation of Motion Estimation Algorithm For H 264AVC

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

An FPGA Implementation of Motion Estimation Algorithm For H 264AVC

Încărcat de

Drepturi de autor:

Formate disponibile

An FPGA Implementation oI Motion Estimation

Algorithm Ior H.264/AVC

S-ar putea să vă placă și