Implementation of JPEG2000 Using SSE Instruction Set: ECE 734 VLSI Array Structures For Digital Signal Processing

Implementation of JPEG2000
Using SSE Instruction Set
ECE 734
VLSI Array Structures
For Digital Signal Processing
Ami Mehta
Gilles Muller Spring 2004
2
Table of Contents
Introduction..........................................................................................................................4
1 - JPEG2000 Algorithm.....................................................................................................5
1.1 - Architecture.............................................................................................................5
1.2 – 2D-DWT.................................................................................................................5
2 – Implementation..............................................................................................................7
2-1 – Non optimized version............................................................................................7
2.2 – Optimizations..........................................................................................................8
3 – Verification..................................................................................................................10
4 – Profiler.........................................................................................................................11
5 – Results..........................................................................................................................11
6 – References....................................................................................................................13
3
Introduction
JPEG2000 is a new image compression standard being developed by the Joint
Photographic Experts Group (JPEG), part of the International Organization for
Standardization (ISO). It is designed for different types of still images (bi-level, gray-
level, color, multicomponent) allowing different imaging models (client/server, real-time
transmission, image library archival, limited buffer and bandwidth resources, etc), within
a unified system.
JPEG2000 is intended to provide low bit rate operation with rate-distortion and subjective
image quality performance superior to existing standards, without sacrificing
performance at other points in the rate-distortion spectrum.
JPEG2000 addresses areas where current standards fail to produce the best quality of
performance, such as:
• Low bit rate compression performance (rates below 0.25 bpp for highly-detailed
gray-level images)
• Lossless and lossy compression in a single code stream
• Seamless quality and resolution scalability, without having to download the entire
file. The major benefit is the conservation of bandwidth
• Large images: JPEG is restricted to 64k x 64k images (without tiling). JPEG2000
will handle image sizes up to ( 232-1 )
• Single decompression architecture
• Error resilience for transmission in noisy environments, such as wireless and the
Internet
• Computer generated imagery
• Compound documents
• Region of Interest coding
• Improved compression techniques to accommodate richer content and higher
resolutions
• Metadata mechanisms for incorporating additional non-image data as part of the
file
• JPEG2000 will be able to handle up to 256 channels of information, as compared
to JPEG, which is limited to only RGB data. Thus, JPEG2000 will be capable of
describing complete alternate color models, such as CMYK, and full ICC
(International Color Consortium).
4
1 - JPEG2000 Algorithm
1.1 - Architecture
The encoding procedure is as follows:

• The source image is decomposed into components.
• The image and its components are decomposed into rectangular tiles. The tile-
component is the basic unit of the original or reconstructed image.
• The wavelet transform is applied on each tile. The tile is decomposed in different
resolution levels.
• These decomposition levels are made up of subbands of coefficients that describe the
frequency characteristics of local areas (rather than across the entire tile-component) of
the tile component.
• Markers are added in the bitstream to allow error resilience.
• The codestream has a main header at the beginning that describes the original image
and the various decomposition and coding styles that are used to locate, extract, decode
and reconstruct the image with the desired resolution, fidelity, region of interest and other
characteristics.
• The optional file format describes the meaning of the image and its components in the
context of the application.
1.2 – 2D-DWT
This process decomposes the original image into two sub-bands, usually denoted as the
coarse scale approximation (lower band) and the detail signal (higher band). This
transform can be easily extended to multiple dimensions by using separable filters, i.e. by
applying separate 1-D transforms along each dimension. In particular, we have
implemented the most common approach, commonly known as the square
decomposition. This scheme alternates between operations on rows and columns, i.e. one
stage of the 1-D DWT is applied first to the rows of the image and then to the columns.
5
This process is applied recursively to the quadrant containing the coarse scale
approximation in both directions. In this way, the data on which computations are
performed is reduced to a quarter in each step.
From a performance point of view, the main bottleneck of this transformation is caused
by the vertical filtering (the processing of image columns) or the horizontal one (the
processing of image rows), depending on whether we assume a row-major or a column-
major layout for the images.
The lifting filter computes the low frequency and high frequency coefficients in 1-D.
The following filter coefficients are used for JPEG2000 lossy filter:
6
2 – Implementation
2-1 – Non optimized version

We chose to implement the lifting transform using Daubichies 9/7 Analysis. This was
done mainly because lifting requires less working memory and fewer arithmetic
computations. The lossy filter was implemented since our focus is to make JPEG2000
faster on general purpose architectures, where some loss of resolution is tolerable and
lossy compression gives a better compression ratio.
This version was implemented only using C code. The lifting filter described above
computes high and low frequency coefficients in the same time, reusing intermediate
results. This is a good point as only one filter is needed to compute both coefficients.
The wavelet transform poses a major hurdle for the memory hierarchy, due to the
discrepancies between the memory access patterns of the two main components of the 2-
D wavelet transform: the vertical and horizontal filtering. Consequently, the improvement
in the memory hierarchy use represents the most important challenge of this algorithm
from a performance perspective.
As a result, we needed to think about memory use even if this version isn’t optimized for
SSE. The goal of this project was to optimize the algorithm implementation for speed. As
a result, we chose the strategy to store all intermediate results independently preventing
unnecessary false dependences, but thus increasing the amount of memory used.
The Mallat strategy uses an auxiliary matrix to store the results of the horizontal filtering.
In this way, as the following figure shows, the horizontal high and low frequency
components are not interleaved in memory. This method prevents memory scattering
and enable to exploit better SIMD parallelism.
The vertical filtering reads these components and writes the results into the original
matrix following the order expected by the quantization step.
For our implementation we did the down sampling before calculations, to ensure
none of our calculations were wasted and also reused calculations across
computations whenever possible.
7
2.2 – Optimizations
The analysis of this C code using VTune Analyzer showed that the major bottleneck of
our code was the memory access, and particularly a low hit rate in the L1 data cache. To
solve this problem, the following optimizations are used:
• Data alignment
Strictly speaking, data alignment is not required in our codes since the SSE instruction
set includes instructions that allow unaligned data to be copied into and out of the vector
registers. However, such operations are much slower than aligned accesses, which may
cause a significant overhead. To avoid this drawback we have employed 16-byte aligned
data in all our codes, although for the scalar versions this optimization has no significant
effect.
8
access access
Cache layout without alignment Cache layout with alignment
• Matrix juxtaposition
Input and output matrices are juxtaposed in the memory to prevent associativity conflicts.
Indeed, after juxtaposition, the 2 matrices can be loaded entirely in the L1 data cache
without having to switch between one matrix to another.
Here is a part of our code showing these optimizations:

/* Allocate memory to the 2 matrices */
/* The 2 matrices are juxtaposed and aligned on the cache row size:128
bits=16 Bytes */
inline void load_matrix(float **matrix_in, int row, int column, float
**matrix_out){
unsigned long int size;
size=2*row*column*sizeof(float);
if( (*matrix_in=(float*)malloc(size+16)) == NULL ){
printf("Not enough memory available.\n");
exit(1);
}
//align the address at the beginning of a cache line

*matrix_in=(float*)((((int)(*matrix_in))+16)-((((int)(*matrix_in))
+16)%16));
*matrix_out=(float*)((*matrix_in)+row*column);
}
Once these were done we transformed the code using SSE2. This was done to make use
of the inherent parallelism in the code. SSE does four calculations at a time using special
registers allocated for SSE. We used these 128 bit floating point registers to store four 64
bit values. Then these computations were done in parallel. We used intrinsics for this
part. The following piece of code is typical of our use of SSE intrinsics.
m1 =_mm_add_ps(*pscr2,*pscr0);
m2 = _mm_mul_ps(m0_a0, m1);
m3 = _mm_add_ps(*pscr1,m2);
// high1[i]=start1[i]+a0*(start2[i] + start0[i]);
m1, m2, m3, pscr2, m0_a0, etc are all 64 bit values, stored in groups of four as mentioned
before. The following figure shows how the 128 bit registers are used for four
calculations to take advantage of parallelism.
9
3 – Verification
We verified our code using MATLAB. We took an image and converted it to a matrix.
We used this matrix as an input to our DWT code. Then the output matrix of our code
was dumped into a file. We took this output matrix and normalized it by subtracting the
smallest number and dividing by the greatest number and displayed it is an image. The
following results were produced.
Original Image Matlab output for DWT
After row transform using our code After DWT, using our code
The results from our code match those of MATLAB which shows our code works.
10
4 – Profiler
As mentioned before, we used VTune Analyzer to profile our results. The VTune™
Performance Analyzer helps you analyze the performance of your application by locating
hotspots. Hotspots are areas in your code that take a long time to execute. Not only that it
also helps you find out what is causing these hotspots and decide what type of
improvements to make. You can also track critical function calls and monitor specific
processor events, such as cache misses, triggered by sections in your code. You can also
calculate event ratios to determine if processor events are causing the hotspots.
The VTune analyzer collects performance data on your application and system, and
displays it in graphs or tables. From this display, you can analyze the performance of
your application and determine which portions of an application are slowest and why.
To optimize the performance of your application or system, you can do one or more of
the following to find the performance bottlenecks:
• Determine how your system resources, such as memory and processor, are being
utilized to identify system-level bottlenecks.
• Measure the execution time for each module and function in your application.
• Determine how the various modules running on your system affect the
performance of each other.
• Identify the most time-consuming function calls and call sequences within your
application.
• Determine how your application is executing at the processor level to identify
microarchitecture-level performance problems.
The VTune™ Performance Analyzer can find this information by automating the process
of data collection with three types of data collectors, namely, sampling, call graph, and
counter monitor.
Thus we decided to use VTune Analyzer. With the help of profiling we were able to
identify that memory access was the biggest bottle neck for our code.
5 – Results
We conducted a set of simulation using the pre-optimized and post optimized versions of
our code. The simulations were done on Pentium Celeron processor 2.4 GHz, with a 8K
uops L1 trace cache and a 128 Kb L2 cache. We did a number of simulations, since
profiling take samples to get a good estimate we ran a set of three simulations on images
of size 256 x 256, 512 x 512 and 1024 x 1024. The results for the biggest image are
shown here.
A set of three simulations were done on a 1024 x 1024 image. The results for the
improvement in cycles per uops are given below.
11
4
As can be seen on an average there is an improvement of 44.61% was seen during
optimization. The results for 1st level cache load misses are shown below.
3.5
6000000
3
12
As can be seen we dramatically reduced 1st level cache load misses by the various
optimization techniques. The misses reduced on an average by 74%, from about a little
below 5,000,000 to a little over 1,000,000. The second level cache improvement is as
shown below.
350000
There is an improvement of 53% on an average.
Thus it can be shown that our cache optimizations and using SSE code to exploit
parallelism worked well causing an improvement on performance and cache access.
6 – References
[1] Taubman, David and Marcellin, Michael, JPEG 2000, Image compression
Fundamentals, Standards and practice.
300000
[2] Michael D. Adams, The JPEG-2000 Still Image Compression Standard.
[3] C. Tenllado, D. Chaver, L. Piñuel, M. Prieto and F. Tirado, Vectorization of the 2D

Wavelet Lifting Transform using SIMD extensions.
[4] D. Chaver, C. Tenllado, L. Piñuel, M. Prieto and F. Tirado, 2-D Wavelet Transform
Enhancement on General-Purpose Microprocessors: Memory Hierarchy and SIMD
Parallelism Exploitation.
[5] IA 32 Intel Software Developers Manual, Volume 2: Instruction Reference.
13

Implementation of JPEG2000 Using SSE Instruction Set: ECE 734 VLSI Array Structures For Digital Signal Processing

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Implementation of JPEG2000 Using SSE Instruction Set: ECE 734 VLSI Array Structures For Digital Signal Processing

Încărcat de

Drepturi de autor:

Formate disponibile

Implementation of JPEG2000

Using SSE Instruction Set

The encoding procedure is as follows:

2-1 – Non optimized version

Cache layout without alignment Cache layout with alignment

Here is a part of our code showing these optimizations:

//align the address at the beginning of a cache line

Original Image Matlab output for DWT

[3] C. Tenllado, D. Chaver, L. Piñuel, M. Prieto and F. Tirado, Vectorization of the 2D

[5] IA 32 Intel Software Developers Manual, Volume 2: Instruction Reference.

S-ar putea să vă placă și