Sunteți pe pagina 1din 6

A Methodology and Tooling Enabling Application Specic

Processor Design
A. Hoffmann, F. Fiedler, A. Nohl
CoWare, Inc.
Dennewartstrasse 25-27
52068 Aachen, Germany
frank, achim, andreas@coware.com
Surender Parupalli
CoWare India Private Ltd.
Noida Special Economic Zone
Noida, 201305 U.P. India
surender.parupalli@coware.com
ABSTRACT
This paper presents a highly efcient processor design methodol-
ogy based on the LISA 2.0 language. Typically the architecture de-
sign phase is dominated by an iterative processor model renement
based on the results of hardware and software simulation and pro-
ling. Thus, traditionally huge teams of hardware and software ex-
perts are required to design new programmable architectures. The
proposed design ow reduces the design time and enables even non
processor experts to overcome the typical design challenges.
The presented design methodology is based on a workbench that
automates the generation of all required software tools and further-
more closes the gap between high level modeling and hardware im-
plementation via automatic generation of a Register Transfer Level
(RTL) model for the target processor.
A case study demonstrates the design approach discussing the ap-
plication specic instruction-set processor (ASIP) design for a Fast
Fourier Transformation (FFT) algorithm. Several processor types
such as SIMD and VLIW with various characteristics have been
explored to nd an optimal processor implementation for this algo-
rithm.
Programming Languages: LISA 2.0
General Terms: Design, Languages
Keywords: ASIP, SIMD, VLIW
1. INTRODUCTION
Most of todays SoC designs involve one or more embedded
processor cores that execute the software components of a system.
While a large number of mixed hardware/ software SoC designs
are still based on standard off-the-shelf RISC or DSP cores, there
is a clear trend towards specialization of processors towards the in-
tended applications, so as to achieve an optimum balance between
computational efciency, re-use opportunities, cost, and exibility
[3]. Within such designs often intended for mobile applications the
importance of power consumption is growing. Additionally the life
time of devices is decreasing, thus, the efciency of the architec-
ture design is getting more and more important. For this reason the
design of ASIPs has received high attention in academia and in-
dustry.
In opposite to the design of traditional general purpose proces-
sors the ASIP design ow is driven by a set of target applications,
frequently specied in C/C++, SPW or Matlab. The design gen-
erally starts with an successive architecture exploration process
that involves stepwise renement of the architecture and timing ab-
straction levels based on the simulation and proling results dur-
ing architecture development. The abstraction levels range from
un-timed high-level language instruction set accurate (ISA) models
down to cycle-accurate RTL HDL synthesis models.
Due to the model renement capabilities of an instruction set lan-
guage (ISL) the designer can abstract from architecture details and
concentrate on the essentials in early design phases which is a big
advantage in opposite to the traditional purely HDL based proces-
sor design ow.
Tensilicas customizable Xtensa processor [4] is a popular example
for this type of design ow. While the Xtensa approach utilizes a
partially predened RISC core, this paper focuses on an architec-
ture description language which has a higher degree of freedom in
modeling the ASIPs instruction set and the micro-architecture. The
architecture design framework presented in this paper is based on
the hierarchical processor modeling language LISA 2.0 [2].
The LISA 2.0 language has been developed by the Institute for Sig-
nal processing Systems (ISS) and was licenced to CoWare Inc [1].
CoWare has commercialized the LISA 2.0 based tool-suite under
the name LISATek. LISA 2.0 allows the designer to create a cus-
tom processor model on various abstraction levels. In every design
phase the designer can generate a complete set of software devel-
opment tools (C-compiler, assembler, linker) and an instruction-set
simulator to verify and to prole the current state of the architec-
ture implementation. These hardware and software proling capa-
bilities are essential to the stepwise processor adaption to the ap-
plications needs. Handwriting these tools after each architecture
renement would not allow an iterative exploration process since
the manual creation is a lengthy and error-prone process. Besides,
hardware implementation capabilities are provided via automatic
generation of synthesizable RTL code taking hardware properties
such as chip area, clock, etc. into account.
The rest of the paper is organized as follows: Section 2 introduces
the role of ASIPs in todays SoCs. Section 3 analyzes the work done
in academia and industry in the eld of ASIP design. Section 4 out-
lines the LISA 2.0 language principles and the design ow. Section
5 discusses the design space exploration for a FFT algorithm using
the Architecture Description Language (ADL) LISA2.0 . Section 6
elaborates the results of the different ASIP implementations which
have been carried out. Section 7 concludes the paper.
2. PROCESSOR LANDSCAPE
Figure 1 shows a classication of ASICs vs. programmable solu-
tions. ASICs have a high energy efciency measured in MOPS/mW
this means on the one hand they offer very high performance
while keeping low power consumption and small area. The draw-
back is that they have no exibility to adapt to changes. The other
extreme are RISCor DSP microprocessors, which are programmable
and thus, offer the required exibility but cannot cope with the
requirements of modern portable devices with respect to power
consumption. It is widely accepted to have one controller e.g.
ARM/MIPS and one DSP in such systems. However, with more
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID05)
1063-9667/05 $20.00 2005 IEEE
functions running on the device, it is hardly acceptable to spend
more RISC and DSP cores to process things like digital video, mul-
timedia and wireless communication.
The truth is a compromise between an ASIC and a programmable
solution. These architectures will be very close to ASIC with re-
spect to energy efciency and size while being programmable. This
is at the rst glance a paradoxon. It is made possible by proces-
sors which are very application specic. They only provide limited
programmability which is just enough to adapt to changes e.g. in
standards. They often feature complex instructions which can exe-
cute in some cases things like FFT calculations in one instruction.
This is the key to the energy efciency.
Time to market
Reuse
Flexibility
High Performance
Low Power
Low Area
Programmable
Fixed Hardware
RISC
Microprocessors
Programmable
DSP
Fixed Hardware
Data path
Extensible
Processors
Control Domain Data Domain
Embedded
Control
D
ig
it
a
l
C
a
m
e
r
a
W
ir
e
le
s
s
V
o
ic
e
Domain
Specific
Engine
E.g.
SIMD
Domain
Specific
Engine
E.g.
VLIW
Domain
Specific
Engine
E.g.
SIMD
Domain
Specific
Engine
E.g.
NPU
Digital
Video
Network
Switches
Wireless
Comm.
Domain
Specific
Engine
E.g.
MCU
Memory
Controller
Multimedia
Domain
Specific
Engine
E.g.
SIMD
Domain
Specific
Engine
E.g.
VLIW
Domain
Specific
Engine
E.g.
SIMD
Domain
Specific
Engine
E.g.
NPU
Digital
Video
Network
Switches
Wireless
Comm.
Domain
Specific
Engine
E.g.
MCU
Memory
Controller
Multimedia
Figure 1: ASIC vs. Programmable Solutions
Due to the high degree of specialization of these processors,
there will be separate processors for different applications like dig-
ital video, wireless communication, multimedia, etc. One major
benet of ASIPs is the re-use of proven and stable programmable
IP.
3. RELATED WORK
The intention of abstract machine description languages is to ab-
stract from the implementation details of a processor architecture
and to describe them from higher level of abstraction than Regis-
ter Transfer level. The different languages can be categorized into
those focusing on the architecture, on the instruction-set or a com-
bination of both.
The languages oriented towards the architecture are close to the
RT-level and thus, not suitable for a fast and efcient design space
exploration phase. Due to this fact projects such as MIMOLA [5]
are not further discussed her. Some of the languages strongly ori-
ented towards the instruction set are nML [7] and ISDL [6]. Both
langugages allowto describe simple processor pipelines but proces-
sors with more complex execution schemes and instruction-level
parallelism like the Texas Instruments TMS320C6x cannot be de-
scribed, even at the instruction-set level, because of the numerous
combinations of instructions. Hierarchical memories (caches) and
buses cannot be modeled with nML. The hardware implementation
step based on this language may be done via the HDL generator
GO from the company Target Compilers Technologies [14]. The
structure of the HDL model is derived from the instruction encod-
ing, thus, the developer can only implicitly affect senseful grouping
of functionality to dedicated hardware units. This is a big disadvan-
tage with respect to resource sharing capabilities in the processor
implementation. ISDL allows the generation of a complete tool-
suite consisting of HLL compiler, assembler, linker, and simulator.
Even the possibility of generating synthesizable HDL code is re-
ported, but no results on the efciency of the generated tools nor
on the generated HDL code are given. The project Sim-HS [8] is
also based on the nML description language and generates synthe-
sizable Verilog models from Sim-nML models. The result here is a
non-pipelined architecture with a xed base structure of the gener-
ated hardware.
In the following approaches based on an instruction set/architecture
combination are discussed. The EXPRESSION language [9] al-
lows the cycle-accurate processor description based on a mixed
behavioral/structural approach. Currently there is no information
whether the implementation step can be done based on this lan-
guage. FlexWare [10] is not suitable for a fast design exploration,
since it is more related to RT-level than to the level of ADLs. The
PEAS-III [11] and the ASIP-Meister [12] work with a set of prede-
ned components which limits the resulting exibility in modeling
arbritrary processor architectures. On the other hand these projects
are able to fulll tight constraints regarding the resulting hardware
implementation results.
In addition to ADLs design systems have to be discussed here. The
XTensa [4] environment from Tensilica allows the user to choose
elements from a predened set of hardware components which can
be adapted to the users requirements. For this reason the design
space exploration can be performed efciently but the designer has
not the exibility of modeling arbritrary ASIPs. The PICO (pro-
gram in, chip out) [13] system from the HP-labs is based on a
congurable architecture, including nonprogrammable accelerators
and cache subsystems. Due to the usage of predened components,
this approach also offers limited modeling exibility.
None of the introduced approaches provides the designer with ef-
cient design exploration and implementation capabilities coupled
with the required exibility for the development of arbritrary ASIPs.
The LISA 2.0 based design ow can cope with all of these is-
sues. It allows efcient design space exploration on a high ab-
straction level, software development tool generation while also
providing low power hardware implementation capabilities. Addi-
tionally, easy system integration of simulation models is supported;
moreover, coupling to a third party test environment is part of the
introduced design methodology.
4. LISATEK DESIGN METHODOLOGY
The idea of the LISATek design owis to dene a programmable
platform tailored to a specic application domain. This puts a
heavy burden on the ASIP designer to compose a capable platform
from a huge design space for the target application. The goal of
the LISA 2.0 based processor design ow is to guide the designer
from the algorithmic specication of the application down to the
implementation of the micro-architecture. In every phase of the
processor design the designer maintains an abstract model of the
target architecture written in the LISA 2.0 language.
The language LISA 2.0 is aiming at the formalized description of
programmable architectures, their peripherals and interfaces. LISA
2.0 is not a completely new language it is an extension to C.
The hardware behavior as well as processor resources like registers
are modeled in pure C, whereas LISA 2.0 adds on top of the C-
language capabilities to describe an instruction-set with its binary
encoding and assembly syntax. Also, LISA 2.0 allows to express
timing in processors. An example is a pipelined architecture where
instruction execution is spread over multiple cycles. LISA 2.0 is
very easy to learn so that a couple of days is sufcient to become
familiar with this language.
From such a model, a working software development tool set sup-
porting the evaluation tasks for the current development phase can
be generated automatically. The rst stage of the design process
is concerned with the examination of the application to be mapped
onto the processor architecture. Critical portions of the applica-
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID05)
1063-9667/05 $20.00 2005 IEEE
tion need to be identied that will later require parallelization and
specic hardware acceleration. For this reason the design space
exploration starts with the denition of the processors instruction-
set.
The two main ASIP development phases of the LISA 2.0 based
design ow are shown in gure 2. On the left hand side the ar-
chitecture exploration phase with the software development tool
generation is visualized, on the right hand side the implementation
phase which starts with the automatic creation of an RTL model of
the ASIP.
Implementation Exploration
Autom. Generation
RTL ASIP Model
SW Development
Tools
Evaluation Results Evaluation Results
HL LISA Model
R
E
F
I
N
E
M
E
N
T
R
E
F
I
N
E
M
E
N
T
Implementation Exploration
Autom. Generation
RTL ASIP Model
SW Development
Tools
Evaluation Results Evaluation Results
HL LISA Model
R
E
F
I
N
E
M
E
N
T
R
E
F
I
N
E
M
E
N
T
Figure 2: ASIP Exploration and Implementation Phases
As a starting point for model creation CoWare LISATek pro-
vides a library of sample models which contains processors that
are already tailored to specic applications. These processors ef-
ciently implement algorithms like turbo decoding and the FFT.
Also, there are sample models for different architecture categories
available which cover DSPs, micro controllers with specic fea-
tures like SIMD (single instruction multiple data) which is pop-
ular in the multimedia domain as well as the increasingly popu-
lar VLIW architectures which comprises massively parallel func-
tionality. It is important to distinguish these sample models from
congurable template models where only some parameters may be
changed. Taking such models as a basis has the major advantage
to directly have compiler support for the architecture due to the
existence of an instruction-set. This makes the C- and instruction
proling of the application possible from the very beginning of the
architecture development. The simulator which is derived from this
model constitutes a virtual machine executing the application di-
rectly. The proling capabilities of the simulator are used to gener-
ate execution statistics of the application code. Once the proling
information is gathered, critical portions which require paralleliza-
tion are identied. Based on the proling results the instruction-set
is adapted until the application proling meets the given criteria.
At this point in the design phase the designer has to consider dif-
ferent aspects of the micro architecture.
The major exploration and optimization point is the pipeline. The
designer has to decide how many pipeline stages are required with
respect to control ow instructions and the efcient implementation
of hardware loops and interrupts. For the performance of the archi-
tecture it is important to avoid data hazards during program exe-
cution. For this reason data bypassing may be implemented when
specifying the pipeline of the processor. This mechanism serves al-
ready calculated results to pipeline stages prior in the pipeline. The
intention of this mechanism is to bypass data storage in registers
or memories. Having an optimal pipeline in an ASIP requires a
memory subsystem to support this pipeline with memory data fast
enough because otherwise the pipeline has to be stalled working on
the application while waiting for the memory data. The memory hi-
erarchy directly contributes to the performance of the memory sub-
system, thus, the developer has to consider it carefully. Caches with
varying parameters are widely used to enhance the performance of
the memory subsystem. Here the cache parameters i.e. the cache
size and the cache read and write policies must be determined with
respect to the target application. Additionally, the designer has to
evaluate the role of a memory managing unit (MMU) and has to
check the performance of the utilized bus to ensure the optimal
conguration of the utilized memories. A very powerful capabil-
ity of the LISA 2.0 language besides its ability to model arbitrarily
complex processors is a special template library with memory mod-
ules which can be easily parameterized from within the LISA 2.0
model. Using these library elements, caches, MMUs and buses can
be easily modeled.
When assigning different parts of the instruction execution to the
already dened pipeline stages the developer must care about re-
source sharing and the length of the critical path in the emerging
architecture. This is important for an efcient hardware implemen-
tation of the ASIP. Here the required chip area and the gate count
are commonly used constraints which directly refer to the power
consumption of the resulting hardware. The critical path is impor-
tant, since its length limits the clock speed which is another impor-
tant criteria when designing an ASIP.
When the architecture meets the design criteria and efciently im-
plements the application, in a nal architecture development step
hardware implementation is done. Synthesizable HDL RTL code
(currently VHDL and Verilog) for the control and data path of the
processor can be derived from the abstract processor model auto-
matically. This includes the entire hardware model structure such
as the pipeline, pipeline controller including complex interlocking
mechanisms, forwarding, etc. to steer the architectures behavior
and an implementation of the data path which is directly derived
from the behavioral specication in the LISA 2.0 model. Having
the RTL generation capabilities in the processor exploration loop
allows to easily explore the trade off area vs. timing (clock speed)
vs. exibility. Based on the simulation and synthesis results of
the hardware, the abstract LISA 2.0 model might be modied to
meet power-, area- and frequency constraints. Due to the fact that
RTL and ISS simulator are derived from a sole processor model,
they are automatically consistent. It is obvious that deriving both
software tools and hardware implementation model from the same
architecture specication in LISA 2.0 has signicant advantages.
Only one model needs to be maintained, even if changes to the
micro-architecture or the behavior in the hardware model must be
realized.
Once the processor design is nished, a set of production quality
software development tools is generated from the LISA 2.0 model.
These software tools (C-compiler, assembler, linker) can compete
well in terms of functionality and feature richness with state-of-the-
art tools from Greenhills, ARM, etc. The generated C-Compiler is
an optimizing compiler which is capable of generating code which
is close to handwritten assembly code. In addition to these gener-
ated software development tools a macro assembler and an archiver
are provided for the LISATek product family.
In order to be able to integrate into system simulation environments
(SoC) to gather realistic stimuli from the system, LISATek gener-
ates processor simulators which couple directly with the following
tools: CoWare ConvergenSC, Cadence Incisive, Mentor Graphics
Seamless CVE, OSCI reference simulator as well as with any C-
based environments. The generated simulators automatically inter-
face with popular busses like AMBA AHB and can be extended
to work with proprietary buses easily. The generated instruction-
set simulators (ISS) support system simulation on different levels
of abstraction from cycle accurate to untimed system simulations.
Utilizing the patent pending Just-In-Time-Cache-Compiled (JIT-
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID05)
1063-9667/05 $20.00 2005 IEEE
CC) simulation technology LISATek simulators run at a very high
speed.
Moreover, LISATek tools support multi-processor debugging in such
a system. Here, the designer can debug one ore more processors
with a single graphical debugger.
The verication of the ISS vs. the RTL model can be performed
by using the IBM Genesys [17] test-generation tool. Genesys is a
test-generator which has been exclusively developed for validating
processors. It works based on a test plan and generates test pro-
grams automatically which are run directly on an ISS. Finally the
test-program together with the expected result values in the proces-
sor are given.
A major benet of the LISATek approach is the fact that the de-
signer has neither to be a software nor a hardware expert. So a sin-
gle person can cover a broad spectrum of development tasks which
cannot be covered by the traditional ASIP design approach.
5. FFT ASIP
To demonstrate the strength of the presented design ow and
howa highly specialized processor can be designed with the LISATek
technology, a case study has been carried out. Different ASIP types
with different instruction-sets have been explored to nd an opti-
mal processor implementation for the image processing domain.
This means the processor is highly optimized for compression al-
gorithms and image transformation.
5.1 Design Space Exploration
In order to nd the processor that optimally ts the application, it
is mandatory to explore the huge design space (see section 4). The
required instruction-set has to be identied based on a high level
application proling. The decisions about the micro-architecture
must be made and the critical path must be as short as the imple-
mentation constraints dene. Also the ASIP must be prepared for
later inclusion into a SoC, the interaction with the peripherals must
be considered when designing this new processor.
Due to the limited scope of this paper, this case study concentrates
on the discussion of some very essential high level design steps.
The effects of VLIW (parallel instructions), SIMD (single instruc-
tion multiple data) and special purpose instructions on the ASIP
performance and RTL are explored here. The starting point for the
design is one of the sample models (gure 3) that come with the
LISATek tools.
alu
register file
register file
data_bus
load & store branch
logic
FE DC EX
instruction_bus
branch address
(BPC)
program
counter
(FPC)
branch flag
(BSET)
branch address
(BPC)
program
counter
(FPC)
branch flag
(BSET)
32 bit
multiplier
Figure 3: Architecture Exploration Starting Point
To take this model as basis makes sense for a very quick de-
sign start. This tiny model has already a minimal instruction-set
that makes it compiler programmable. Additionally, the micro-
architecture etc. is very simple and let the designer easily adjust
or tailor the architecture to meet his requirements.
The chosen architecture is a 32bit processor with a three stage
pipeline, a branch logic, a simple ALU, a multiplier, and a load-
store unit to communicate with the data-bus. This processor is later
tailored to the execution of a FFT and functionality which is not
needed is discarded. A 32bit architecture has been chosen as start-
ing point since the FFT requires three 8bit values for calculation.
For this reason 32bit is the minimum to encode these values within
one instruction word.
The rst step in the exploration phase is to perform a high level
proling of the FFT algorithm. When analyzing the C-code pro-
ling and instruction (assembly-) level proling results shows that
the hot-spot of the target application is the FFT kernel computation.
Section Calls Total Steps Total Steps %
FFT 6144 35610624 53.82
IFFT 6144 34292736 51.83
kernel 196808 58982400 89.15
Table 1: C Proling Results
The C-proling results (see table 1) clearly indicate that the most
time is spent in the FFT and inverse FFT transformation which both
make use of the kernel functionality. For this reason the focus is set
on the efcient implementation of this kernel computation. The
FFT algorithm kernel data ow is shown in gure 4. The FFT ker-
nel is processing two complex samples indexed with i and j. Each
sample has real and imaginary part. For the computation the FFT
kernel furthermore needs a sin and cosin factor for complex mul-
tiplication. The output of the FFT kernel are the two transformed
complex samples.
FFT Kernel FFT Kernel FFT Kernel
Multiply
Normalize
Re{ f(jT) } Im{ f(jT) }
Register
sin(jT)
Register
cos(jT)
Multiply
Normalize
Multiply
Normalize
Multiply
Normalize
Register Register
Sub
Re{ f(iT) }
Im{ f(iT) }
Register
Register
wr
wi
fr[j] fi[j]
tr
Add
ti
Sub
Sub
Add
Add
4 Outputs
Figure 4: FFT Kernel Computation Scheme
The computation itself consists of two phases: Four normalized
multiplications are followed by multiple sequential add and sub-
traction instructions. This means that from input to output a sample
is going through a multiplier and two adder stages. A pure soft-
ware implementation on an un-optimized processor would require
the following sequential code (see gure 5) for the FFT execution.
This execution costs hundreds of cycles.
For the above mentioned reasons such an implementation would
lead to a cycle and code-size expensive software implementation.
A better solution is a special purpose FFT hardware unit which is
part of the targeted ASIP. Analyzing the data dependencies of the
FFT, it turns out that each color of each sample can be processed
in parallel. A xed data-path for this FFT unit would result in a
faster and smaller implementation than the software solution (less
memory for program code).
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID05)
1063-9667/05 $20.00 2005 IEEE
5.2 SIMD FFT ASIP
The introduced FFT unit is able to apply the same transforma-
tions on all 3 colors (RGB) of all 4 samples (real and imaginary
part) in parallel (12 channel SIMD unit 3 Colors * 2 Samples *2
Parts)
Decompose RGB (Mask and Shift)
FFT Software
Kernel Red
Multiply
Multiply
Multiply
Multiply
Add and Sub
FFT Software
Kernel Green
Multiply
Multiply
Multiply
Multiply
Add and Sub
FFT Software
Kernel Blue
Multiply
Multiply
Multiply
Multiply
Add and Sub
Compose RGB (Mask and Shift)
Software Kernel (simplified) Software Kernel (simplified)
Unpack RGB (extract R & G & B)
Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul
Adder Adder
Adder Adder Adder Adder
Pack RGB (extract R & G & B)
Parallel Hardware Kernel Parallel Hardware Kernel
AddAddAddAddAddAdd AddAddAddAddAddAdd
Add Add Add Adder Add Add Add
Unpack RGB (extract R & G & B)
Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul Mul
Adder Adder
Adder Adder Adder Adder
Pack RGB (extract R & G & B)
Parallel Hardware Kernel Parallel Hardware Kernel
AddAddAddAddAddAdd AddAddAddAddAddAdd
Add Add Add Adder Add Add Add
Figure 5: Software Kernel Scheme vs. HW Kernel Scheme
Figure 5 opposites the FFT computation ow in a pure software
solution where each color value is transformed after another in a
block. In the special FFT unit all colors (RGB) are handled simul-
taneously which reduces the number of cycles needed for calcu-
lation dramatically. Due to this fact it is very clear that the ASIP
solution is much faster in this case, caused by this parallelity.
MUL
8 8 8
8
MUL MUL
Unpack
Normalize
Multiply
Register
Register
Re{ f(nT) }, Im{ f(nT) }
cos(nT), sin(nT)
Sample
Sin/Cos
Factors
SIMD Instruction SIMD Instruction SIMD Instruction SIMD Instruction
Figure 6: Packed Multiply
The parallel FFT hardware kernel will compute the 12 multipli-
cations in parallel followed by two adder stages. Before and after
the transformation the 8bit RGB values are extracted (UNPACK)
and re-assigned (PACK) to the 32 register in a xed logic.
Figure 6 shows a scheme of the packed multiplication performed in
the SIMD unit of the ASIP.
All three colors RGB (multiple data) are handled within one multi-
plication. Thus, three parallel 8bit multiplications are performed in
a single step. The broken carry path between the 8bit slices allows
the separate calculation.
The critical path is going through an 8bit multiplier and two adder
stages, this is even less (1/4) than the critical path of a standard
32bit multiplier we used 12 times in a sequence in the software im-
plementation. Hence the complete FFT can be processed in a single
cycle instead of hundreds.
The xed PACK/UNPACK data path is very popular in image pro-
cessing machines. It is comparable with the Intel multimedia instruction-
set extensions (MMX) [16]. It makes the expensive shifting and
masking operations in software unnecessary. Figure 7 shows the
structure of the resulting SIMD ASIP which has been developed.
alu
register
file
register
file
data_bus
load & store branch
logic
FE DC
instruction_bus
branch address
(BPC)
program
counter
(FPC)
branch flag
(BSET)
branch address
(BPC)
program
counter
(FPC)
branch flag
(BSET)
32 bit
e
S
u
b
A
d
d
S
u
b
S
u
b
A
d
d
A
d
d
S
u
b
e
S
u
b
A
d
d
S
u
b
S
u
b
A
d
d
A
d
d
S
u
b
fft
special
registers
fft
special
registers
simd
fft
unit
Figure 7: SIMD ASIP
Compared with the unoptimized 32bit architecture (starting point
of the processor design) this processor has additional special FFT
registers and a SIMD FFT unit which is located in the execute
stage. This unit comprises the functionality introduced above. At
this point it could be checked which impact the number of pipeline
stages, the memory structure and all the other criteria mentioned
above have on the efciency of the ASIP, but the full design space
exploration capabilities provided by the LISATek ASIP design method-
ology are out of the scope of this paper.
During the exploration phase two further ASIPs have been devel-
oped to implement an effective FFT algorithm, a VLIW architec-
ture and an image processing ASIP which has special instructions
well suited for the given problem. To discuss all architecture de-
tails, the hardware and software proling results which caused the
design decisions cannot be covered in this paper, thus, only a brief
overview on the specialties of these processors is given. Section 6
discusses the results of the different ASIP implementations.
5.3 Alternative ASIPs for FFT
The developed VLIW ASIP has the big advantage of highly par-
allel application execution, which is possible via multiple slots run-
ning simultaneously. In comparison with the SIMD approach this
processor is more generally applicable for a broad spectrum of tar-
get applications, since it has no special instructions for FFT com-
putation. The drawback is a greater resulting chip area in the im-
plementation phase. The base structure of the VLIW ASIP is the
same as the processor which has been the starting point for our ex-
ploration. Only the logic of the execute stage exists twice. Due to
this fact it is possible to perform two load/store, ALU and multiply
operations in parallel.
The third ASIP which has been created is an image processing
ASIP which has special instructions for the different phases of the
FFT kernel computation. Special units in the execute stage of our
ASIP consist of Pack and Unpack instructions. Packing means to
compose one 32bit register value by three 8bit values. Unpacking
is the opposite, the rst, second and third 8bit values are extracted
from a given 32bit register value. Additionally, this processor has
a special unit for a dual 32bit multiply and accumulate operation
which makes it very easy to compute the FFT with this processor.
6. RESULTS
Three very different processor types have been developed to
solve the problem of the FFT computation in hardware while hav-
ing the exibility of programming these architectures.
This section discusses the allover results based on the following
topics: Cycle counts, compiler support, re-usability and hardware
implementation results. Figure 8 illustrates these categories for ev-
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID05)
1063-9667/05 $20.00 2005 IEEE
ery processor analyzed here. In the left most column of the table
the results for the unoptimized 32bit architecture (the starting point
of the architecture exploration) are given. These results are not dis-
cussed here, they are only shown here to give a hint on how a not
optimized processor performs for a special application like the FFT.
182 / 47.3M
YES
YES,
automatically
all RGB
instructions
53k/
26k
127Mhz
Image Proc.
ASIP
146 / 37.1M
YES
YES,
automatically
all
94k/
82k
236Mhz
VLIW
ASIP
Specialization
Compiler Support
Specialization
Re-usability
294 / 70.4M
YES
33k/
21K
166Mhz
Results Results Results
Cycles
Total App /FFTKernel*.
*16 samples
General
Compiler Support
First
Iteration
of RTL
Gates
Total/
Registerfile*
Clock
@ 0.13,
Sample Core
14 / 11.4M
14 / 11.4M
YES
YES,
via intrinsic
FFT only
30k/
22k
192Mhz
SIMD FFT
ASIP
* Registerfile: 32 x 32bit registers, can be easily reduced to 16 x 32bit
Not required
Not required
Figure 8: Architecture Exploration Results
The required cycle count for the FFT calculation (total applica-
tion) ranges from 11.4 (14) million cycles for the SIMD ASIP to
47.3 (182) million cycles for the image processing ASIP with the
special instructions for the FFT. The efciency of the SIMD im-
plementation is caused by the fact that only this processor is able
to perform the FFT computation in a single cycle which is tremen-
dously fast.
For all processors, except for the SIMD ASIP, full compiler sup-
port is available, thus, the programming of them is very comfort-
able. The SIMD processor requires inline assembly (intrinsic) for
the special SIMD instruction. Instructions with more than one out-
put (the SIMD has three) cannot be supported by a compiler. For
all other instructions of the SIMD compiler support is available.
A big benet of programmable architectures is the re-usability. The
VLIW architecture is not specialized for this problem, so it is re-
usable for all applications. The image processing ASIP contains
special instructions which can be used for all RGB operations, but
the SIMD approach is so specialized that this processor is only in-
tended to run the FFT application. This result demonstrates the
trade-off between re-usability and efciency.
In order to get an idea of the later hardware implementation, a rst
iteration of RTL code generation is carried out for all of these pro-
cessors. The gate count for the different processors ranges from
30k gates for the SIMD to 94k gates for the VLIW ASIP. While the
maximum clock speed has been 236MHz for the VLIW processor,
the lowest speed has been 127MHz for the image processing ASIP.
Three different architectures have been examined for the given al-
gorithminclusively the design space exploration as introduced above.
The entire design ow for all of these processors have been per-
formed beginning from the functional description of the problem
down to the hardware implementation within two man-weeks. This
time also includes the creation of architecture simulators and pro-
duction quality software development tools. This extremely short
development time demonstrates how effective the LISATek design
methodology is.
7. CONCLUSION
This paper introduced a highly efcient methodology to design
ASIPs. The presented design approach is based on the LISA 2.0
language and enables even non experts to cover all processor de-
velopment tasks from the abstract functional specication of the
target architecture down to the hardware implementation. Using
LISA 2.0 allows to do architecture exploration and implementation
on a very high abstraction level.
It has been highlighted within the scope of this paper that the im-
portance of ASIPs in todays industry is enormously growing. Due
to extremely short product life times and thus, development cycles
for new systems, the efciency of the architecture design tools di-
rectly contributes to the success of the companies. The LISA 2.0
based architecture development approach can cope with these re-
quirements which was demonstrated by a case study which focused
on the ASIP design for the FFT algorithm.
In this study three different types of ASIPs have been explored for
the FFT computation. It has been possible to provide for every
of the processors an architecture simulator, software development
tools (C-compiler, macro-assembler, assembler, linker, archiver)
and an RTL hardware model within two man-weeks.
The usage of this methodology in big IP vendors and the chip indus-
try has shown that the overall performance gain in ASIP design is
at least 50 percent [15]. This number does not include the amount
of time which would be necessary to do architecture exploration
using the traditional processor design approach.
8. REFERENCES
[1] Coware, Inc. http://www.coware.com.
[2] A. Hoffmann, H. Meyr and R. Leupers. Architecture Exploration for
Embedded Processors with LISA.. Kluwer Academic Publishers, 2002.
[3] M.J. Bass, C.M. Christensen et al.: The Future of the Microprocessor
Business, IEEE Spectrum, April 2000
[4] R. Gonzales: XTensa: A congurable and extensible processor, IEEE Micro,
Mar. 2000
[5] R. Leupers and P. Marwendel. Retargetable Code Generation Based on
Structural Processor Descriptions.. In Design Automation for Embedded
Systems. Kluwer Academic Publishers, 1998.
[6] G. Hadjiyiannis, S. Hanono et al. ISDL: An Instruction-Set Description
Language for Retargetability. In Proc. of the DAC, Jun. 1997.
[7] A. Fauth, J. van Praet and M. Freericks. Describing Instruction Set Processors
Using nML. In Proc. of the European Design and Test Conference (ED&TC),
March 1995.
[8] V. Rajesh and R. Moona. Processor Modeling for Hardware Software
Codesign. In Int. Conf. on VLSI Design, Jan 1999.
[9] A. Halambi and P. Grun and V. Ganesh and A. Khare and N. Dutt and A.
Nicolau. EXPRESSION: A Language for Architecture Ex-ploration through
Compiler/Simulator Retargetability. In Proc. of the Conference on Design,
Automation & Test in Europe (DATE), Mar 1999.
[10] P. Paulin and C. Liem and T.C. May and S. Sutarwala. FlexWare: A Flexible
Firmware Development Environment for Embedded Sys-tems. In P. Marwedel
and G. Goosens, editors, Code Generation for Embedded Processors, Kluwer
Academic Publishers, 1995.
[11] A. Kitajima and M. Itoh and J. Sato and A. Shiomi and Y. Takeuchi and M.
Imai. Effectiveness of the ASIP Design System PEAS-III in Design of
Pipelined Processors. In Proc. of the Asia South Pacic Design Automation
Conference (ASPDAC), Jan. 2001.
[12] ASIP Meister. http://www.eda-meister.org.
[13] V. Kathail and S. Aditya and R. Schreiber and B.R. Rau and D. Cron-quist and
M. Sivaraman. Automatically Designing Custom Computers. IEEE Computer,
35(9):39 47, Sept. 2002.
[14] Target Compiler Technologies. http://www.retarget.com.
[15] M. Steinert, O. Schliebusch and O. Zerres. Design Flow for Processor
Development using SystemC. In Proc. of the SNUG Europe, 2003.
[16] Intel. http://www.intel.com.
[17] IBM http://www.ibm.com.
Proceedings of the 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design (VLSID05)
1063-9667/05 $20.00 2005 IEEE

S-ar putea să vă placă și