Sunteți pe pagina 1din 4

ICM 2003, Dec. 9-1 1, Cairo, Egypt.

An Efficient Implementation of the 1D DCT using FPGA Technology


:.
I
Hassan EL-Bannai
r..
Alaa A. EL-Fattah* Waleed Fakh? *
*Electronics Researdh Institute, Cairo, Egypt
** Arab Academy for Science and Technology, Cairo, Egypt

ABSTRACT since it transforms a signal oir image from the spatial


domain to the frequency domain. However one primary
This paper describes and represents different algorithms advantage of the DCT over the DFT is that the former
and efficient implementation of One Dimensional involves only real multiplications, which reduces the
8 point Discrete Cosine Transform on Field total number of required multiplications, unlike the
Programmable Gate Arrays. One of the main objectives latter. Another advantage lies in the fact that for most
is to minimize the complexity of operations as much as images much of the signal energy lies at low
possible while maintaining low delays and high speed frequencies, and are often small - small enough to be
throughput. Distributed Arithmetic is a powerful neglected with little visible distortion. The DCT does a
technique that has been used for fast and efficient better job of concentrating rmergy into lower order
implementation of 1D DCT on FPGA. coefficients than does the DFT for image data. This
characteristic of the DCT, referred to as energy
1. INTORDUCTION compaction efficiency, along with other advantages
resulted in the JPEG and MPEG standards adopting the
The discrete cosine transform DCT forms a key role in DCT as a standard for imagc comprcssion.
several image compression standards including JPEG The N-point I-D DCT is defined by [SI:
[l] for still picture compression, 1TU H.261 [2] and
H.263 for teleconferencing, and IS0 MPEG-I and
IKPEG-2 [3] for audio, visual compression and
where
communication. Some speech enhancement techniques
use DCT [4].
In addition to that, ID DCT hasinost often been used in
2D DCT, by employing the row-column decomposition,
exploiting the fact that the formula of the 2D DCT is Real-time implementation of' the DCT operation is
separable, which means that it can be broken into two highly computationally intenisive. Accordingly, much
sequential 1D DCT operations, one along the row effort has been directed to the development of suitable
vector and the second along the column vector of the
cost effective VLSI architectures to perform this.
preceding row vector results. The Row-Column Traditionally the focus has been on reducing the
decomposition method is the most common method number of multiplications required. Additional design
deployed for computing the 2D DCT. and criteria has included minimizing the complexity of
implementations usually focus on optimizing the 1D control logic, memory requirements, power
DCT so that the Row-Column 2D DCT implementation
consumption and complexity olf interconnect.
performs better when using the optimized 1D DCT
block along rows and columns. 3. Some DCT rilgorithms
This paper is organized as follows. Section 2 provides a
review of ID DCT computation. Section 3 reviews
3.1 Chen et al Algorithm
some DCT algorithms. Section 4 explains the
Distributed Arithmetic technique. The last section The 8-point DCT can be writte:n as a matrix transform.
shows the ID DCT Architecture we implemented using Y=AX
the Distributed Arithmetic. Where
d d d d rl ,I 11 11
2. The 1D Discrete Cosine Transform a c E g - g -E -c a
b f -f -6 -6 -f f b
The Discrete Cosine Transform has long been the basic c -g -a -e e a g --c
A =
transform coding method for the JPEG and MPEG d -d -d d d -d -d d
standards. It helps separate the image into parts (or e -a g c --c -K a -e
spectral sub-bands) of differing importance - with f -b b -f -f b -b f
respect to the spatial quality of the image. In that g -e c -a a -c e -g
respect, it is similar to the Discrete Fourier Transform

278 .
3.2 Lee Algorithm , *

a CI 0 J2(C,-CS,
b Cz p J2Cb * Lee algorithm [ I 11 is based on the matrix
c Cl q J2(CI+C6,
representation. In fact, the first step is nothing than a
butterfly decomposition yielding to an even and an odd
d Ca r - CI + CI + Cq - CI part. The cvcn part will bc just a I-D DCT orordcr N/2.
e CI 8 cl+cl-c~+c, While, the odd part will be computed through a matrix
f c6 I C,+C1+C3-C, multiplication.
g Cl U C1+C3- Cs-c, Figure 2 illustrates I -D DCT of order 8.
h CI+G v CI-C,

!I
i C1-& w CI+CI
j ll2Cz X C]+cJ
k 1/2Cd y C1-G Butterfly
m l12Cr
6-
-7
where
-1 L--..J

c, = Cos[-]nw Figure 2: I : Lee Algorithm


16
Due to the Symmetry of the (8 X 8) multiplication For I-D DCT of order N=8, the number of operation
matrix, it can be replaced by two (4x4) x (4x4) matrices necessarily for these algorithm will be 32
which can be computed in parallel, as can the sums and multiplications and 32 additions.
differences forming the vectors below
3.3 Loeffler Algorithm

Based on equation (I), Loemer f12] has proposed a


new class of a fast ID-DCT algorithm that requires 11
multiplications and 29 additions only.
An algorithm of this class is shown in Figure 3.

The implementations by Madisetti and Willson [6],


Uramoto et al[7], Matsui et al [8],and Jang et al[9] are
based upon this decomposition which requires 32
multiplications. However, Madisetti and Willson reduce
the number of multiplications to 28.
Figure 3: Loemer Algorithm
The frequently referenced algorithm by Chen et al [IO]
only requires 16 multiplications with 2 multiplications
The stages of the algorithm numbered 0 to 3 are parts
on the critical path. The data-flow graph for Chen’s
that have to be executed in serial mode due to the data
algorithm is shown in Figure 1.
dependency. However, computation within the first
stage can be paralleled. In stagel, the algorithm splits in
hvo parts. One for the even coeficients, the other for
the odd ones. The even part is nothing else than
a 4 points DCT, again separated in even and odd parts
in stage2.
The second building block can be calculated using only
3 multiplications and 3 additions only instead of 4
multiplications and 2 additions. This can be done by
using the equivalence showed in the following
equations:

____, SuttradiM
Figure 1: The algorithm by Chen et al

279
The constant C was chosen to be equal to fi which 5. The Architecture 1-mplemented
allows the first DCT coefficient to be evaluated without
From Chen et a1 Algorithm, we can find that the
any multiplication. .
transform matrix A could be divided to 2 smaller
Figure 4 explains the building blocks of the algorithm.
matrices. By using the Distributed Algorithm technique
thcsc 2 matriccs could be inore simplified to givc thc
following equations:
11 -=-.L--_:;
Io 3
y, = C A k , , ( x , + J : ~ - ~ ) for I even
k=O
3
y, =zAk,/(xk-X7-k) for 1 odd
k=O
The Architecture of the ID DCT is shown in
Figure 5

0-4T.I ~

Figure 4: Algorithm operators

3.4 Liu and Chiu Algorithm Figure 5: 8points 1D DCT Architecture


A different approach taken by Liu and Chiu [13], is to The 4-Product MAC could be designed using the
calculate a running (or recursive) DCT in which the conventional arithmetic as shown in Figure 6, or could
values of the DCT are updated with each new sample. be designed by using the serial distributed arithmetic as
Given a sequence of input data a 1 D DCT of the last N shown in Figure 7.
input values is output. Each DCT utilizes the previous
DCT result, the next DCT is obtained by adding the
difference between it and the previous E T . The
discrete sine transform (DST) is needed in the
calculation of the DCT and hence both DST and DCT
outputs are available.
An implementation of this algorithm is provided by
Srinivasan and Liu [14, 151. --- I I

4. Distributed Arithmetic

Distributed Arithmetic is a very commonly used


technique where Multiply-Accumulate ' plays
predominant role in the operation, especially true with
-11
x1 '
Cl MULT
Figure 6: 4-product MAC using conventional arithmetic
signal processing applications. Typically it serves to
eliminate multiplications and replace them with adds,
which is usehl since a multiplication consumes much
more time than an add.
An example of the result is a case where N multiplies
followed by an N-input add has been replaced by a
series of N-input adds followed by a single multiply.

Figure 7: 4-product MAC rising Serial Distributed


Arithmetic

280
It is Clearly Obvious the great reductions have been [4] Woon Hau Chin, and B. Farhang-Boroujeny,
done on the hardware when using the Distributed "Subband Adaptive Filtering with Real-valued
Arithmetic. This will directly decreases the delays and Subband Signals for Acoustic Echo Cancellation",
the area required while increasing the output. conf. 1996.
The following flow chart briefly explains the hardware
implementation of the ID DCT that uses Arithmetic [SI K.R.Rao and P.Yip, "Discrete Cosine Transform.
Distribution we have done. Algorithms,Advantages, Applications". Academic Press.
San Diego, California, 1990.

8 Input Registers (each one 8 bit) A. Madisetti and A. Willson, "A IOOMHz 2-D 8x8
DCTIIDCT processor for HDTV applications"
IEEE Trans .Circuits, Systems for Video Tech.
vol. 5, no. 2, April 1995.
I Add / subtract units I S-Uramoto, Y.Inoue, A. Takabatake, J. Takeda,
Y.Yamashita, H. Terane, and M. Yoshimoto, "A
100 MHz 2D discrete cosine transform core
processor". in IEEE Journal of Solid State Circuits,
vol. 27, pp. 492-499, April 1992.
M.Matsui, M. Hara, Y.Uetani, L.Kim,
T.Nagamatsu, Y.Wantanabc. K.Masuda, TSakurai
"A 200 MHz 13mm2 2-D DCT macrocell using

4,
Sense-amplifying pipeline flip-flop scheme", IEEE
Jour. Solid State Circuits, vo1.29, no. 12 Dec1994.
Y.Jang, J. Kao, J. Yang, and P.Huang," A 0.8 p
Accumulators 100 MHz 2D DCT core processof', in IEEE
Transactions on Consumer Elecironics, vol. 40, pp.
703-709, August 1994.
we have 8 inputs, each input is 8bit width. [IO] W. Chen, C.H.Smith, and S.Fralick, "A fast
First the inputs are registered. Then they are added and computation algorithm for the discrete cosine
subtracted according to the matrix of Chen et al. transform", in IEEE Transactions on
The Bit Serial Architecture is primarily used in the communications, vol. 25, pp. 1004-1009,
context of multipliers, these are architectures September 1977.
[1 IJ Y .P Lee and all," A cost-effective aichitecture
where a single bit bit of each input word is for 8x8 two-dimensional D$ZTIIDCT using
transmitted during each processing cycle. This - direct method", IEEE Thnsachons on circuit
reduces YO, however an n-bit word requires and system for video technology vo1.7,N0.3,
n-processing cycles for transmission. The input June 1997.
word to the bit serial architecture is 10 bit. This means [ 121 C. Loeffler, A. Ligtenberg and G.S. Moschytz,
we need 10 cycles. Lookup tables (ROMS) contain "Practical Fast I-D DCT algorithm with 1 1
partial product terms that are indexed using the bit- multiplications", Proceedings of ICASSP, vol.2.
serial input from the multiplier. An accumulator is used pp. 988-991, 1989.
to add each partial product term. The VHDL code was [ 131 K. Ray-Liu and C.T. Chiu,"Unifed
written using FPGA Advantage and was implemented parallel lattice structures for time-recursive
on Xilinx Spartatdl FPGA, which uses look-up tables discrete cosinelsineltlsrtley transforms", in
and therefore should make an efficient use of the IEEE Transactions on Signal Processing, vol.
design. 41, pp. 1357-1377, March 1993.
[141 V. Srinivasan and K. Ray-Liu,"VLSI design of
References high-speed time-recursive 2D DCT/IDCT
processor for video applications",in IEEE
[I] G. K.Wallace, "The JPEG still picture Transactions on Circuits and Systems for Video
compression standard, in Communications of Technology, vol. 6, pp. 87-96, February 1996.
the ACM, vo1.34, pp.31-44, April 1991. [IS] V. Srinivasan and K.Ray-Liu,"Full custom V U 1
implementation of high-speed -D DCT-IDCT
[2] "CCITT Recommendation H.26 1 ",1990. chip", in IEEE Proceedings ICIP-94, vo1.3, pp.
606-610, November1994.
[3] D.L. Gall," MPEG: a video compression standard
for multimedia applications", in Communications
of the ACM, vol, 34, pp46-58, April 1991.

28 I

S-ar putea să vă placă și