Sunteți pe pagina 1din 6

A Low-Power, High-Speed DCT architecture for

image compression: principle and implementation


M. Jridi and A. Alfalou, Senior Member, IEEE
D epartement opto electronique, Laboratoire L@bISEN
20 rue cuirass e Bretagne CS 42807, 29228 Brest Cedex 2, France
e-mail: (maher.jridi@isen.fr and ayman.al-falou@isen.fr)
AbstractWe present a new design of low-power and high-
speed Discrete Cosine Transform (DCT) for image compression to
be implemented on Field Programmable Gate Arrays (FPGAs) .
The proposed compression method converts the image to be com-
pressed in many lines of 8 pixels and then applies our optimized
1D-DCT algorithm for compression. The DCT optimization is
based on the hardware simplication of the multipliers used
to compute the DCT coefcients. In fact, by using constant
multipliers based on Canonical Signed Digit (CSD) encoding, the
number of of adders, subtracters and registers will be minimum.
To further decrease the number of required arithmetic operators,
a new technique based on Common Subexpression Elimination
(CSE) is examined. FPGA implementations prove that the CSE
implies less computations, less material complexity and a dynamic
power saving of about 22% at 110 MHz of clock frequency in
Spartan3E device.
I. INTRODUCTION
FPGAs are considered as an attractive solution for signal
and image processing implementation [1]. This is due to
two principles reasons. First, the integration of more than
two million of gates on only one chip makes possible the
implementation of image processing applications [2], [3].
Second, the long development time for Application Specic
Integrated Circuits (ASICs) and the number of CPU memory
access of Graphics Processing Units (GPUs) makes FPGAs
promising for image processing applications.
On the other hand, image processing nd applications on
portable systems such us mobile phone or portable commu-
nication equipments. For these battery operated systems, it
becomes urgent to design a low-power and a high speed image
processing algorithms.
One of most commonly used image processing algorithm
is the Discrete Cosine Transform (DCT) employed in many
standards such us JPEG [4] and MPEG [5]. It has been
widely used in image and speech compression due to its
good energy compaction [6] and [7]. Optical realization of
image compression based on DCT algorithm is presented
in [8]. However, for VLSI implementation, computational
requirement of DCT is a heavy burden in the real time portable
applications. Different DCT architectures have been proposed
to exploit signal proprieties to improve the tradeoff between
computation time and hardware requirement. Among these, the
DCT algorithm proposed by Loefer [9], has opened a new
area in digital signal processing by reducing the number of
required multiplications to 11 which is the theoretical limit
for an 8-point DCT [10]. Although the theoretical limit is
reached, multipliers are the least suited arithmetic operators
for FPGA. In fact these operators need many switching events
which increase the critical path and the transition activity.
Consequently, by using these operators, the power dissipation
increases and the speed decreases.
In the context of DCT implementation, multiplier operator
are in the form of constant multiplier. An efcient way of
carrying out constant multiplication is by using a shifts and
adders. The use of subtracters allows more efcient implemen-
tation. Best results are obtained by expressing constants with
Canonical Signed Digit (CSD) form [11] and [12]. Hartley in
[13] identify common elements in CSD constant coefcients
and share required resources. This last technique is named
Common Subexpression Elimination (CSE). The direct use of
this technique on DCT architecture is studied in this paper but
results are not satisfying.
The novelty presented in this paper consists in identifying
multiple subexpression occurrences in intermediate signals
(but not in constant coefcients as in [13]) to compute DCT
outputs. Since the calculation of multiple identical subex-
pression needs to be implemented only once, the resources
necessary of these operations can be shared. The total number
of required adders and subtracters can be reduced.
This paper is organized as follows: the description of previous
works in DCT architecture and multiplication structures are
given in section II. Section III is dedicated to the optimization
of the DCT architecture based on CSD coding and subexpres-
sion sharing. Functional validation for image compression and
resources saving in FPGA are illustrated in section IV.
II. DCT OPTIMIZATIONS
A. Problem denition
The N-point DCT of N input samples x(0),...,x(N-1) is
dened as:
X (n) =
_
2
N
C (n)
N1

k=0
x(k) cos
(2k + 1) n
2N
(1)
where C (0) = 1/

2 and C (n) = 1 if n = 0.
Traditionally, DCT optimization have focused on reducing
the number of required arithmetic operators. In literature,
many fast DCT algorithms are reported. In [14] a summary
of these algorithms is presented. In [10], the authors show
304 978-1-4244-6471-5/10/$26.00 c 2010 IEEE
x(1)
X(1)
MultAddSub1
MultAddSub2
Stage 1 Stage 2 Stage 3
Stage 4
E25
E26
E27
E28 E14
E16
E18
MultAddSub3
X(5)
X(3)
X(7)
MultSqrt(2)
MultSqrt(2)
X(6)
X(2)
X(8)
X(4)
E35
E37
E36
E38
E12
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8)
16
6 sin 2
S
16
6 cos 2
S
16
6 cos 2
S
Fig. 1. Loefer architecture of 8-point DCT algorithm
that the theoretical lower limit of 8-point DCT algorithm is
11 multiplications. Since the number of multiplications of
Loefers algorithm [9] reaches the theoretical limit, our work
is based on this algorithm.
Loefer proposed to compute DCT outputs on four stages as
shown in gure 1. Each stage contains some arithmetic opera-
tions. The rst stage is performed by 4 adders and 4 subtracters
while the second one is composed of 2 adders, 2 subtracters
and 2 MultAddSub (Multiplier, Adder and Subtracter) blocks.
The fourth stage uses 4 MultSqrt(2) blocks to compute the
multiplication by
_
(2).
Recent optimizations of DCT algorithm have been applied in
arithmetic operators. For multipliers, one possible use is by
embedded multiplier in FPGA. However, this solution suffers
from some disadvantages. First, embedded multipliers are not
power efcient and consume a large silicon area. Second, the
obtained design will not be portable to all FPGA families
and ASICs since we use specied IPs cores. To overcome
this problem, we review some existing techniques to replace
multipliers.
B. ROM based design
Authors of [15] present the design of special-purpose VLSI
processors of 8x8 2-D DCT/IDCT chip that can be used
for high rate image and video coding. The proposed design
consider the inner product of (1):
Y =
N1

k=0
x(k) .c(k) (2)
where c (k) are xed coefcient and equal to cos
(2k+1)
2N
and
x(k) are input image pixels. One possible way to ovoid the
use of multiplier is to precompute all these values and store
them in a ROM rather than compute these values on line. As
the dynamic range of input pixels is 2
8
, the number of stored
values in the ROM is equal to N.2
8
. Each value is encoded
using 16 bits. For example, for an 8-point inner product, the
ROM size is about 82
8
16 bits which is equivalent to 32.768
kbits. To obtain 8-point DCT, 8 8-point inner products are
required and consequently, the ROM size becomes exorbitant.
C. Distributed arithmetic
Authors of [16] use the recursive DCT algorithm and their
design requires less area than conventional algorithms. The
design of [16] consider the same inner product in (2) and
rewrite the variable x(k) as follows:
x(k) =
B1

b=0
x
b
(k) .2
b
x
b
(k) = {0, 1} (3)
where x
b
(n) is the b
th
bit of x(k) and B is the resolution of
the input vector. Finally, the inner product can be rewritten as
follows:
Y =
N1

k=0
c (k)
B1

b=0
x
b
(k) .2
b
(4)
The equation (4) can be rearranged as follows:
Y =
B1

b=0
_
N1

k=0
c (k) .x
b
(k)
_
.2
b
(5)
Equation (5) denes distributed arithmetic computation. In
fact, the bracketed term in (5) may have only 2
N
possible
values since x
b
(k) may take on values of 0 and 1 only.
Since c (k) are xed coefcient values, we can compute the
bracketed term in (5) by storing 2
N
possible combinations in a
ROM. Input data can be used to adresse the ROM. Now, for the
computation of the inner product, we can use an accumulator
with a shift operator as mentioned in the gure 2. The result of
the inner product is available after N clock cycles. Techniques
in [16] and [17] show that N-point DCT can be obtained by
computing N N/2-point inner products instead of computing N
N-point inner products. In all cases, for N-point fully parallel
DCT, the number of LUT would be enormous.
2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010) 305
Serial to
parallel
converter
256
Word
ROM
<<
Y
x(k)
x (k) 0
x (k) 1
x (k) 2
x (k) 3
x (k) 4
x (k) 5
x (k) 6
x (k) 7
Fig. 2. Distributed Arithmetic principle
III. CSD
A. Principle
The CSD representation was rst introduced by Avizienis
in [11] as a signed representation. This data representation
was created originaly to eliminate the carry propagation chains
in arithmetic operations. The CSD representation is a unique
signed data representation containing the fewest number of
nonzero bits. Consequently, for constant multipliers, the num-
ber of additions and subtractions will be minimum.
In fact, for constant coefcient c, the CSD representation is
expressed as follows:
c =
N1

i=0
c
i
.2
i
c
i
= {1, 0, 1} (6)
CSD numbers have essentially two properties:
1) No 2 consecutive bits in a CSD number are nonzero;
2) The CSD representation of a number contains the min-
imum possible number of nonzero bits, thus the name
canonic.
An example of conversion between natural binary represen-
tation and CSD representation is given in Table I where the
converted values are constant used in 8-point DCT Loefer
algorithm. Symbols 1,-1 are respectively represented by +,-. In
line 4, 128cos (/16) 126 contains 6 partial products since
its binary representation is 01111110. In CSD convention, 126
is represented by +00000-0 which is +2
7
2
1
. An extended
study of SCD encoding applied to 16-point DCT algorithm
is made. The number of partial products is 71 for binary
representation and 52 for CSD representation (26% saving).
Finnaly, a generalized statistical study about the average of
nonzero elements in N-bit CSD numbers is presented in [11]
and prove that this number tends asymptotically to N/3 + 1/9.
Hence, on average, CSD numbers contains about 33% fewer
nonzero bits than twos complement numbers. Consequently,
for constant multiplier, the number of partial products are
reduced.
B. New CSE technique for DCT algorithm
To enhance the use of adder and shift operators we propose
to employ the Common Subexpression Elimination (CSE)
technique. CSE was introduced in [13] and applied to dig-
ital lters. For FIR lters, this technique uses bit pattern
occurrence in lter coefcients. Identication of occurrences
permits to create subexpression and after that economize hard-
ware resources. Bit pattern occurrence can be detected between
two or more different lter coefcients or in the same lter
coefcient. Since several constant coefcients multiply only
one output data (signal to be ltered), CSE of lter coefcients
implies less computation and less power consumption.
Contrary to FIR lters, constant coefcients of DCT in Table
I multiply 8 different input data since the DCT consists in
transforming 8 points in the input to 8 points in the output.
From this observation, we can not exploit the redundancy
between different constant. Moreover, for bit pattern in the
same constant, Table I shows that only the constant
_
(2)
presents one common subexpression which is +0- repeated
once with an opposite sign. Consequently, we cannot use the
CSE technique in the same manner of the FIR lters.
In order to take benets of CSE technique, we adapt the CSE
for DCT optimization. To explain the effect of sharing subex-
pressions we do not consider occurrences in CSD coefcients
but we consider the interaction of these codes. On the other
hand, according to our compression method (detailed in next
section) we can use only 1, 2 or 3 DCT outputs among 8.
Hence, it is necessary to compute specic outputs separately.
To emphasize the CSE effect, we take the example of X(2)
(X(2)=E35+E37). According to gure 1, we can express E35
as follows:
E35 = (E25 +E28)
= (E18 cos (3/16) +E12 sin(3/16))
+ (E14 cos (/16) E16 sin(/16)) (7)
Using CSD encoding of Table I, (7) is equivalent to:
E35 = E18
_
2
7
2
5
+ 2
3
+ 2
1
_
+E12
_
2
6
+ 2
3
2
0
_
E16
_
2
5
2
3
+ 2
0
_
+E14
_
2
7
2
1
_
(8)
After rearrangement (8) is equivalent to:
E35 = 2
7
(E18 +E14) + 2
6
E12 2
5
(E16 +E18)
+ 2
3
(E12 +E16 +E18) + 2
1
(E18 E14)
2
0
(E12 +E16) (9)
With the same manner, we can determine E37:
E37 = 2
7
(E12 +E16) 2
6
E18 + 2
5
(E14 E12)
+ 2
3
(E12 E14 E18) + 2
1
(E12 E16)
+ 2
0
(E14 +E18) (10)
Equations (9) and (10) give:
X(2) = 2
7
(
CS1
..
E16 +E18 +E12 +E14) + 2
6
(E12 E18)
2
5
(
CS2
..
E12 E14 +
CS1
..
E16 +E18)
+ 2
3
(
CS2
..
E12 E14 +E12 +E16)
+ 2
1
(
CS3
..
E18 E16 +
CS2
..
E12 E14)
306 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010)
TABLE I
8-POINT DCT FIXED COEFFICIENT REPRESENTATION
Real value Decimal Natural binary Partial products CSD Partial products
cos
3
16
106 01101010 4 +0-0+0+0 4
sin
3
16
71 01000111 4 0+00+00- 3
cos

16
126 01111110 6 +00000-0 2
sin

16
25 00011001 3 00+0-00+ 3
cos
6
16
49 00110001 3 0+0-000+ 3
sin
6
16
118 01110110 5 +000-0-0 3
_
(2) 181 10110101 5 +0-0-0+0+ 5
Total Partial products 30 23
TABLE II
MACRO STATISTICS OF X(2) CALCULATION
Multiplier CSD CSD CSE
Adders/Subtracters 11 23 16
Registers 125 188 119
MULT18X18SIOs 4 0 0
Maximum Frequency (MHz) 143.451 121.734 165.888
+ 2
0
(
CS2
..
E14 E12 +
CS3
..
E18 E16) (11)
where CS1, CS2 and CS3 denote 3 common subexpressions.
In fact, the identication of common subexpressions can give
an important hardware and power consumption reductions.
For example, CS2appears 4 times in X(2). This subexpres-
sion is implemented only once and resources needed to
compute CS2 are shared. An illustration of resources shar-
ing is given in gure 3. Symbols n denote right shift
operators by n positions. Finally, it is important to notice
that nonoverbraced terms in (12) are a potential common
subexpressions with other DCT outputs such us X(4), X(6)
and X(8). To emphasize the common element sharing, a
VHDL model of X(2) calculation is developed using three
techniques: embedded multipliers, CSD encoding and CSE
of CSD encoding. Results in terms of number arithmetic
operators and maximum operating frequency are illustrated
in Table II. It is shown in column 3 that the CSD encoding
uses more adders, subtracters and registers to replace the
4 embedded multipliers MULT18x18SIOs. This substitution
between embedded multipliers and arithmetic operators is paid
by a loss in the maximum operating frequency. On the other
side, column 4 shows that the sharing of arithmetic operators
permits to increase the maximum operating frequency and
reduce required adders, subtracters and registers.
IV. VALIDATION FOR IMAGE COMPRESSION
The compression technique used in this work is presented
in [19] and the associated model is illustrated in gure 4.
This compression was realized using optical components to
take benets of speed. However, optical realization in [19]
suffer from bad image quality. Moreover, optical components
are often expensive and bulky. For all these reasons, we decide
to implement the compression with digital components.
16
3
cos
S
E14
E16
16
cos
S
E12
E18
16
3
sin
S
16
sin
S
X(2)
+
+
(a)
<< 1
+
<< 3
E12+E16
<< 5
+ +
<< 7
E16-E14
<< 6
X(2)
E16 E18 E12 E14
+
+
E12+E14
<< 0
CS2
CS3
CS1
Common
Subexpressions
Elimination
(b)
Fig. 3. X(2) calculation (a): Conventional method, (b): Shared subexpression
using CSD encoding
A. Image compression principle and simulation
According to the model of gure 4, each image is paral-
lelized into 8-point blocks. This operation can be done by a
serial to parallel block composed by 8 ip-ops. After that,
we can compute two DCT outputs among 8 (X(1) and X(2))
and we obtain an acceptable compressed image quality. For
the decompression process, the 8-point IDCT takes X(1) and
2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010) 307
Input
image
Serial to
parallel
8-point
DCT
8-point
IDCT
Parallel
to serial
Compressed
image
CR CR
Fig. 4. Image compression technique model
X(2) followed by 6 zeros to obtain a sequence of 8 inputs.
Consequently, the minimum compression rate is obtained
when we take all DCT outputs (from X(1) to X(8)) and the
maximum is reached when we take only the rst DCT output
as low frequency output.
To validate the compression model the 256x256 Lena gray
scale image is used, gure 5.A. Each input block is composed
by 8 pixels encoded using 8 bit for each pixel. On the other
hand, the rst DCT output, X(1), is encoded without loss
using 11 bits. X(2) to X(8) outputs are encoded using 7
bits as illustrated in [20]. The compression ratio, depends on
the considered number of DCT outputs and expressed by the
following equation :
CR
i
=
_
1
(11 + 7 i)
8 8
_
100 (12)
where i is an integer [0, 7]. According to the last equation,
i=0 means that we take only the rst DCT output and for i=7,
all DCT outputs are considered. Finally, PSNR is evaluated
and results are summarized in gure 5. It is obvious that
the PSNR decreases with a low values of compression ratio
CR
i
. This observation is veried in gure 5.A to gure 5.E.
According to these images, a compression ratio of 6.66% is
equal to (1
(11+77)
64
) 100 and means that we take all DCT
outputs (8 points). The PSNR associated to this compression
ratio is about 55 dB, gure 5.B. On the contrary, a compression
ratio of 82.81% is equal to (1
11
64
) 100 and means that we
take only one DCT outputs X(1). The associated PSNR to this
compression ratio is about 24 dB, gure 5.E.
B. Design considerations
We decide to use the standard VHDL which gives the choice
of implementing target devices (FPGA family, CPLD, ASIC)
at the end of the implementation ow. It means that the image
compression model reported here is synthesized and may be
implemented on arbitrary technologies.
Moreover, some design considerations are taken into account.
In fact, as in [21], we use registered adder and subtracter to
perform the high speed implementation. The critical path is
minimized by insertion of pipeline registers and is equal to
the propagation delay of an adder or a subtracter. It should
be outlined that the use of registered outputs comes at no
extra cost of an FPGA if an unused D ip-op is available
at the output of each logic cell. For example, to realize an
addition of two vectors, the bitstream of the addition adapts
the structure of the FPGA to connect slices to each other.
These slices are composed of two 4-input LUT and two D
ip-op (four 6-input LUT and 4 D ip-op for recent Xilinx
target). Consequently, for used slices all D ip-op are free.
The registered operator can use these ip-op to reduce the
critical path.
With these considerations, we can conrm that our models
are able to be implemented in all digital targets but there are
optimized for FPGA devices since they take into account the
FPGA structure.
C. Synthesis results
In this section, VHDL codes of different DCT architectures
are synthesized using ISE software of Xilinx and the Spar-
tan3E XC3S500 target. In order to illustrate the differences
in hardware consumption, the FPGA synthsis results are
presented in Table III. The second column presents required
logic and arithmetic elements for the implementation of DCT
based on Xilinx embedded multipliers. Column 3 substitutes
embedded multipliers MULT18x18 by 4-input LUTs: the
total number of required LUTs reaches 1424. Columns 4
and 5 detail the required resources for DCT implementation
using respectively CSD encoding and CSD with the proposed
CSE technique. Thanks to element sharing, required resources
decrease from column 4 to column 5.
Furthermore, the use of the proposed technique imply less
computation and consequently high maximum operating fre-
quency. This throughput allows easily the processing of more
than 30 frames per second. Finally, note that an old FPGA
(Spartan 3E) is used in this work. The offered throughput
exceeds 300 MS/s with Virtex 6 target.
D. Power analysis
To have an idea of the estimated power consumption, com-
parison of the dynamic power between three DCT architectures
is made. We used the XPower tool of Xilinx to estimate the
dynamic power consumption. As it is shown in gure 6 and for
clock frequency of 110 MHz, the share subexpressions reduces
the dynamic power by 22% compared to the DCT based CSD
architecture and by 9% compared to the DCT based embedded
multipliers. Furthermore, these values are compared favorably
with other works as for example 11.5 mW at 2V design and
40 MHz in [22] or 3.94 mw at 1.6 V design and 40 MHz in
[23].
V. CONCLUSION
In this paper, a low-power and high-speed DCT design
is presented. The proposed architecture uses CSD encoding
to perform the constant multiplication. Furthermore, a new
technique based on common subexpression sharing is em-
ployed to have less computation. The required silicon area and
power consumption are reduced and the maximum operating
frequency is increased.
Future works related to this paper will focus on the use of em-
bedded processors such us PicoBlaze for the test of the overall
system. Furthermore, we plan to employ a programmable
compression rate for real-time applications.
308 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010)
Fig. 5. VHDL model simulation of image compression
TABLE III
FPGA RESOURCES ESTIMATION OF DIFFERENT DCT DESIGN
Embedded multipliers LUT multipliers CSD Proposed CSD CSE
Slices 356 - 581 510
Flip-Flop 512 512 454 516
4 input LUTs 404 1424 1012 900
MULT18x18 11 0 0 0
Maximum Frequency (MHz) 119.748 - 118.032 137.678
Fig. 6. Dynamic power consumption estimation per sample with 1.6 V design
REFERENCES
[1] S. Hauck and A. DeHon, Recongurable computing: the theory and
practice of FPGA-based computation, ISBN 987-0-12-370522-8, Elsvier
Inc 2008.
[2] E. Monmasson and M. N. Cirstea, FPGA Design Methodology for
Industrial Control SystemsA Review, IEEE Transactions on industrial
electronics, Vol. 54, No. 4, pp. 1824-1842, August 2007.
[3] M.W. James, An Evaluation of the Suitability of FPGAs for Embedded
Vision Systems, CVPR 05: Proceedings of the 2005 IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, Wash-
ington, DC, USA, pp. 131-137, 2005.
[4] ISO/IEC DIS 10 918-1, Digital compression and coding of continuous-
tone still image, 1992.
[5] ISO/IEC JTC1/SC2/WG11, MPEG 90/176, Coding of moving picture and
associated audio, 1990.
[6] K. F Blinn, Whats the deal with the DCT?, IEEE Computer Graphics
and Applications, pp. 78-83, July 1993.
[7] A. Alfalou, M. Elbouz, M. Jridi and A. Loussert, A new simultaneous
compression and encryption method for images suitable to optical corre-
lation, Optics and Photonics for Counterterrorism and Crime Fighting V,
edited by Colin Lewis, Proc. of SPIE Vol. 7486, 74860J-1-8, 2009.
[8] A.Alfalou and C. Brosseau, Optical image compression and encryption
methods, Adv. Opt. Photon. 1, 589-636 (2009).
[9] C. Loefer and A. Lightenberg and G.S. Moschytz , Practical fast 1-D
DCT algorithm with 11 multiplication, IEEE, ICAPSS, pp. 988-991, May
1989.
[10] P. Duhamel and H. Hmida, New 2
n
DCT algorithm suitable for VLSI
implementation, IEEE, ICAPSS, pp. 1805-1808, November 1987.
[11] A. Avizienis Signed-Digit Number Representations for Fast Parallel
Arithmetic, IRE Transaction on Electron. Computer, Vol. EC-10, pp. 389-
400, 1961.
[12] R. Seegal, The canonical signed digit code structure for FIR lters,IEEE
Transactions on Acoustics, Speech and Signal Processing, Vol. 28, No.
5, pp. 590-592, 1980.
[13] R.I. Hartley, Subexpression sharing in lters using canonic signed digit
multipliers,IEEE Transactions on Circuits and Systems II: Analog and
Digital Signal Processing, Vol. 43, No. 10, pp. 677-688 October 1996.
[14] C.Y Pai, W.E. Lynch and A.J. Al-Khalili, Low-Power data-dependant
8x8 DCT/IDCT for video compression, IEE, Proceedings. Vision, Image
and Signal Processing, Vol. 150, pp. 245-254, August 2003.
[15] D. Slawecki and W. Li, DCT/IDCT processor design for high data rate
image coding, IEEE Transactions on Circuits and Systems for Video
Technology, Vol. 2, No.2, pp, 135-146, Jun 1992.
[16] S. Yu and E.E. Swartzlander Jr, DCT implementation with distributed
arithmetic, IEEE Transactions on Computers, Vol. 50, No.9, pp, 985-991,
September 2001.
[17] S.A White, Application of distributed arithmetic to digital signal
processing: a tutorial review, IEEE ASSP Magazine,pp, 4-19, July 1989.
[18] K. Rao and P. Yip, Discrete cosine transform algorithms advantages
applications, Academic Press, New York, 1990.
[19] A. Alkholidi, A. Alfalou and H. Hamam, A new approach for optical
colored image compression using the JPEG standards, Journal of signal
processing, Vol. 87, No. 4, pp. 569-583, April 2007.
[20] M. Jridi and A. Alfalou, A VLSI implementation of a new simultaneous
images compression and encryption method, to appear in IEEE Imaging
Systems and Techniques.
[21] Y. Mitsuru and N. Akinori, A high-speed FIR digital lter with CSD
coefcients implemented on FPGA,ASP-DAC 01: Proceedings of the
2001 Asia and South Pacic Design Automation Conference, pp. 7-8,
2001.
[22] M. Matsui, H. Hara, K. Seta, Y. Uetani, L.S. Klim, T. Nagamatsu
and T. Sakura, 200 MHz video compression macrocelles using low-swing
differential logic, Proceedings of ISSCC, pp. 76-77, 1994.
[23] T. Xanthopoulos, Low power data-dependent transform and still image
coding, PhD Thesis, MIT, February 1999.
2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010) 309

S-ar putea să vă placă și