Floating Point Fpga

Floating-Point FPGA (FPFPGA)
Architecture and Modeling

(A paper review)
Jason Luu
ECE
University of Toronto
Oct 27, 2009
Motivation
• Goal: Build faster, cheaper, lower power FPGAs
• How? Fixed-Functionality (hard) blocks!
▫ FPGA reconfigurability comes at the price of
area, delay, and power
▫ Some reconfigurability is unnecessary, remove it
for savings
What to Make Hard?
• What hard blocks to use?
▫ If not used, block is wasted
▫ Industry suggests including memories and
multipliers
▫ Paper suggests adding floating-point units (FPU)
• Given a hard block, how fractured should it be?
▫ Eg. Stratix III FPGA multipliers can be
configured in a set of four 18x18 multipliers or
one 36x36 multiplier
▫ How fractured should the FPU be?
Introducing FPFPGA
• Contains soft and hard blocks
▫ Soft blocks are composed of standard LUTs, FFs
▫ Hard blocks are FPUs called Coarse-grained
units (CGU)
• CGU characteristics:
▫ Floating-point (FP) adds and multiplies only
▫ Bus-based LUT operations using “wordblock”
▫ Dedicated output registers
▫ Accessible to soft blocks and vice-versa
Architecture of FPFPGA
FGU
CGU
CGU parameters
• # of each type of FP block
• Bus Width
• Number of Input Buses
• Number of Output Buses
• Number of Feedback Paths
Modeling Methodology
• Need to measure how “good” FPFPGA is
• Use empirical measurement method
FPFPGA
Benchmark
Circuit Commercial CAD FLow Measure Quality of Results
Very Nice! Commercial tools are

unaware of FPFPGA , authors
introduce “VEB” as solution
Virtual Embedded Block (VEB) Flow
• Manually map benchmark circuit into
▫ CGU
▫ Soft logic
• Put VEB representing CGU into commercial CAD
tool
• Compile
• Gather area and timing measurements
VEB
• Create standard cell ASIC CGU and get
area/timing numbers
• Implement area and timing of ASIC CGU using
soft logic of commercial FPGA (different
functionality, similar silicon timing, area, and
pin demand)
• Assumes all internal paths == critical path to
simplify timing of soft logic implementation
VEB
VEB Details
• Model delay with carry-chains
• Model area with shift registers
• Use LUT inputs and outputs for pin demand
• Note: Area and delay models use independent
resources
VEB Placement Challenge
• Hard block locations are fixed on an FPGA
• Commercials tools can’t do that for VEB since it’s
just a group of clustered soft logic constrained
to be placed in a particular relative distance
from each other
• Solution:
▫ Let commercial tools place VEB anywhere
▫ Then manually place VEB to fixed locations
VEB Quality
• 11% delay error when modeling embedded
multiplier (non-fp to compare with existing
multiplier)
• Area is accurate (no number given)
• Important repeatability hint: Must determine
timing post-bitstream because of significant
false paths (most CGUs do not use the longest
path and this is detected post-bitstream)
Benchmarks
• 32-bit single-precision floating-point
• 8 benchmarks
▫ 5 Core computation blocks
▫ 1 application
▫ 2 synthetic
Experimental Settings
• Xilinx Virtex 2: XC2V3000-6-FF1152
• 16 CGUs each implemented as a VEB
▫ Each CGU takes up 122 Logic Cells
• 2 FP multipliers, 2 FP adders, 5 wordblocks
▫ In the order: W M A W W M A W W
• 4 input buses
• 3 output buses
• 3 feedback registers
Results
• Average area reduced by 25x
• Average delay reduced by
▫ 3.6x for single precision
▫ 4.3x for double precision
• Results are comparable to Kuon FPGA vs ASIC
measurements
• Critical path of all circuits is in FPU
Reason for Good Results
• Removed reconfiguration bits (area reduction)
• Efficient directional routing
• Embedded FP operators
Contributions
• Exploration of FPGA architectures with
embedded floating-point cores
• VEB methodology to leverage commercial tools
to explore new embedded hard blocks even
when commercial tools are unaware of those
new hard blocks
•
Weaknesses
• Significant amounts of speculation
▫ Try to claim scope for stuff that should be in
future work
• Especially weak was the paper’s analysis of a
FPFPGA compiler which is outside of scope and
should be listed as such

My 2 Cents
• Primary advantage of FPFPGA vs GPU in the
floating-point high computation domain is low
latency
• Several applications demand very low latency
and very high computational power
▫ Plant monitoring of high-speed reactions
▫ Financial automatic buy-sell algorithms
• Secondary advantage is energy consumed to
perform the same computations.
My 2 Cents
• Comparison unfair
▫ Most FPGA designers would convert floating-
point to fixed point and not leave it as floating-
point
 Double precision fp add requires 701 slices
Fixed point add 64 LUTs == 16 slices
• Critical path is in FPU suggests benchmark
circuits are unusually geared to use FPU cores
and this is admitted by the authors

Floating Point Fpga

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Floating Point Fpga

Încărcat de

Drepturi de autor:

Formate disponibile

Floating-Point FPGA (FPFPGA)

Architecture and Modeling

Very Nice! Commercial tools are

S-ar putea să vă placă și