Sunteți pe pagina 1din 23

Floating-Point FPGA (FPFPGA)

Architecture and Modeling


(A paper review)

Jason Luu
ECE
University of Toronto
Oct 27, 2009
Motivation
• Goal: Build faster, cheaper, lower power FPGAs
• How? Fixed-Functionality (hard) blocks!
▫ FPGA reconfigurability comes at the price of
area, delay, and power
▫ Some reconfigurability is unnecessary, remove it
for savings
What to Make Hard?
• What hard blocks to use?
▫ If not used, block is wasted
▫ Industry suggests including memories and
multipliers
▫ Paper suggests adding floating-point units (FPU)
• Given a hard block, how fractured should it be?
▫ Eg. Stratix III FPGA multipliers can be
configured in a set of four 18x18 multipliers or
one 36x36 multiplier
▫ How fractured should the FPU be?
Introducing FPFPGA
• Contains soft and hard blocks
▫ Soft blocks are composed of standard LUTs, FFs
▫ Hard blocks are FPUs called Coarse-grained
units (CGU)
• CGU characteristics:
▫ Floating-point (FP) adds and multiplies only
▫ Bus-based LUT operations using “wordblock”
▫ Dedicated output registers
▫ Accessible to soft blocks and vice-versa
Architecture of FPFPGA
FGU
CGU
CGU parameters
• # of each type of FP block
• Bus Width
• Number of Input Buses
• Number of Output Buses
• Number of Feedback Paths
Modeling Methodology
• Need to measure how “good” FPFPGA is
• Use empirical measurement method

FPFPGA

Benchmark
Circuit Commercial CAD FLow Measure Quality of Results

Very Nice! Commercial tools are


unaware of FPFPGA , authors
introduce “VEB” as solution
Virtual Embedded Block (VEB) Flow
• Manually map benchmark circuit into
▫ CGU
▫ Soft logic
• Put VEB representing CGU into commercial CAD
tool
• Compile
• Gather area and timing measurements
VEB
• Create standard cell ASIC CGU and get
area/timing numbers
• Implement area and timing of ASIC CGU using
soft logic of commercial FPGA (different
functionality, similar silicon timing, area, and
pin demand)
• Assumes all internal paths == critical path to
simplify timing of soft logic implementation
VEB
VEB Details
• Model delay with carry-chains
• Model area with shift registers
• Use LUT inputs and outputs for pin demand
• Note: Area and delay models use independent
resources
VEB Placement Challenge
• Hard block locations are fixed on an FPGA
• Commercials tools can’t do that for VEB since it’s
just a group of clustered soft logic constrained
to be placed in a particular relative distance
from each other
• Solution:
▫ Let commercial tools place VEB anywhere
▫ Then manually place VEB to fixed locations
VEB Quality
• 11% delay error when modeling embedded
multiplier (non-fp to compare with existing
multiplier)
• Area is accurate (no number given)
• Important repeatability hint: Must determine
timing post-bitstream because of significant
false paths (most CGUs do not use the longest
path and this is detected post-bitstream)
Benchmarks
• 32-bit single-precision floating-point
• 8 benchmarks
▫ 5 Core computation blocks
▫ 1 application
▫ 2 synthetic
Experimental Settings
• Xilinx Virtex 2: XC2V3000-6-FF1152
• 16 CGUs each implemented as a VEB
▫ Each CGU takes up 122 Logic Cells
• 2 FP multipliers, 2 FP adders, 5 wordblocks
▫ In the order: W M A W W M A W W
• 4 input buses
• 3 output buses
• 3 feedback registers
Results
• Average area reduced by 25x
• Average delay reduced by
▫ 3.6x for single precision
▫ 4.3x for double precision
• Results are comparable to Kuon FPGA vs ASIC
measurements
• Critical path of all circuits is in FPU
Reason for Good Results
• Removed reconfiguration bits (area reduction)
• Efficient directional routing
• Embedded FP operators
Contributions
• Exploration of FPGA architectures with
embedded floating-point cores
• VEB methodology to leverage commercial tools
to explore new embedded hard blocks even
when commercial tools are unaware of those
new hard blocks

Weaknesses
• Significant amounts of speculation
▫ Try to claim scope for stuff that should be in
future work
• Especially weak was the paper’s analysis of a
FPFPGA compiler which is outside of scope and
should be listed as such

My 2 Cents
• Primary advantage of FPFPGA vs GPU in the
floating-point high computation domain is low
latency
• Several applications demand very low latency
and very high computational power
▫ Plant monitoring of high-speed reactions
▫ Financial automatic buy-sell algorithms
• Secondary advantage is energy consumed to
perform the same computations.
My 2 Cents
• Comparison unfair
▫ Most FPGA designers would convert floating-
point to fixed point and not leave it as floating-
point
 Double precision fp add requires 701 slices
Fixed point add 64 LUTs == 16 slices
• Critical path is in FPU suggests benchmark
circuits are unusually geared to use FPU cores
and this is admitted by the authors

S-ar putea să vă placă și