Documente Academic
Documente Profesional
Documente Cultură
FPGAs
Christober Rayappan - May 10, 2005
A new type of speed control called Adaptive Cruise Control (ACC) (Figure 1) is being used on some
new model vehicles. These systems allow you to set a following distanceor time intervalbetween
your vehicle and the vehicle ahead and lets you set the maximum speed. When you come up to a
slower moving vehicle in front of you, or if a vehicle changes into your lane, the system automatically
closes the throttle and/or applies brake pressure to slow the vehicle and maintain the set following
distance. All of the ACC systems available today are built around sensors that detect the vehicle
ahead through the use of radar.
The frequency shift between the original sent signal and the received signal gives the distance and
the relative velocity between the two vehicles. Capturing and processing this information in real
time requires the use of math intensive digital signal processing (DSP) algorithms. Software
processing cannot meet these performance requirements, and it often takes several conventional
DSP processors to perform such high-speed tasks. The speed critical process (signal processing
block) can be designed using FPGAs.
Velocity Analysis
Another important part of the ACC system is to determine the velocity of the objects. This is done in
the frequency domain of the input signal. In order to detect the frequency shift between the original
sent signal and the received signal, the DC offset is taken into account.
This is obtained by subtracting the mean of the signal from the signal itself. The FFT of both the sent
and received signal is taken to see the spectrum of the two. The shift of the spectrum from the one
to the other signal shows the Doppler shift. The peak of the received signal is subtracted from the
peak of the sent signal to obtain the Doppler shift.
The velocity is calculated by (Equation 1). Note that because sampling of the signal is done at a
given sampling rate, this equation needs to be modified a little by dividing the right hand side by the
sampling frequency as well.
(1)
Solving this equation for the velocity (v), you can identify the velocity of the moving object.
From the above discussion it is clear that the FFT processing is performed to display the spectrum of
the Doppler return, which translates directly into velocity, after considerations for geometry. The
character of the spectrum (noisy, smooth, wide band, narrow band) is an indication of the vehicle.
The processing required by the FFTs depends on the rates selected by the operator. The 25 KHz
Doppler signal can be rate converted up to give finer frequency (velocity) resolution, and can be
performed lapped (overlapping sliding buffers) to improve low velocity sensitivity of the system. All
of these effects translate into an equivalent number of 1K complex FFTs per second. As the number
of FFTs per second increase, the computational demand on the processor increases. At the same
time, the demands of the processor for data increase as well.
Since FFT needs lots of computational power and efficiency than the other signal processing blocks
(Hilbert transform, Matched filter), this paper concentrates on developing the FFT core for ACC
systems.
● Adders
● Multipliers
● Two's complement circuit
● Twiddle factor ROM
This paper will cover the design of the multiplier and adder with saturation in detail. Following this
design entry, layout and simulation strategies are described. The reader is assumed to be familiar
with various FFT algorithms and their classification .
(2)
(3)
where a and b are the complex inputs to the butterfly, A and B are the complex outputs and
(4)
is the complex twiddle factor. The number of adders and multipliers in the design depends on the
input (complex or real).
The Radix-2 butterfly requires six adders and four multipliers with some two's complement circuit to
represent the negative numbers. The design block is given in Figure 5.
The carry look ahead adder is used for the design of adders.
(5)
Where Cn-1 and Cn-2 are the carry values obtained when the most significant magnitude digit an-2
and bn-2 and the sign digits an-1 and bn-1 are added. In order to identify the sign of the overflow,
two other functions can be defined
(6)
There is an implied binary point between the most significant bit and sign bit. The above equations
can be implemented with multiplexer and some control circuitry, as shown in Figure 6a.
Figure 6: Circuitry of overflow detection and saturation
Let A=a7a6a5a4a3a2a1a0 B=b7b6b5b4b3b2b1b0 be the two 8-bit two's complement integers. The
Baugh-Wooley algorithm of multiplication is given by figure 6b.
To obtain the final product, these partial products are added column wise.
OS Tree Structure
The structure of the OS tree is determined by the number of words that need to be added. As shown
in Figure 7, the width and height of the tree increases when the number of input (words) is
increased. The tree height, nOS is given by
(7)
When the input bits are high the tree height of a Wallace tree is less than that of an OS tree. For
most common DSP applications, the word length is not particularly longfor instance, 12 to 24 bits.
In these cases, there is no direct advantage to using Wallace trees . The Wallace tree height is only
one level lower compared to the OS tree .
Some basic building blocks can be identified in the trees (Figure 7): body, root, connector, and
branch. These building blocks can be used to form the general structure of the OS tree. The general
structure can be easily derived. Assume that we have a tree of height nOS = j+1. A body of height j-1,
a connector and a branch can construct the body of height j. The connector is constructed by two
CSAs. The three outputs from the connector appear as inputs to the root, which is constructed by a
single CSA. The branch height j-2 is formed by j-2 CSAs on top of each other with proper
interconnections. Using these properties, an OS tree of arbitrary tree height can be constructed in
an iterative way.
Figure 7: Structure of an OS Tree (7-9 inputs)
In order to keep the word length constant as required in many applications, only the n most
significant bits are retained out of 2n bits in fractional multiplication .
The method used to retain 'n' most significant bits is either truncation or rounding. In truncation, the
n least significant bits are truncated to retain n most significant bits. In rounding, the n least
significant bits are truncated and for the remaining result after truncation a value 0.00...1 is added
so that the final value is the value nearest to the original value. Thus for fractional multiplication,
the least 'n' significant bits are retained. If the truncated number is a positive fraction, then add
0.0...1 to the retained value if truncated bits does not contain all zeros. If the truncated number is a
negative fraction, then no correction factor is added and just 'n' most significant bits are related as
the final value. The hardware circuit for the above concept can be implemented using multiplexer
and some logic circuits (Figure 9).
After place and route and back annotation, the maximum frequency of operation was limited to 50
MHz and resources utilized were 1% of the total available. For implementing a butterfly block of
radix-2 FFT the resources utilized are 6% of the total available slices. The simulation results are
shown in Figure 11 and Figure 12.
Figure 12: Simulation result of an 8-bit by 8-bit multiplier (saturated)
After place and route and back annotation, the number of slices consumed by the Radix-2 FFT is
2402 out of 12288, which is 19% of the total capacity. The maximum frequency of operation was
limited to 18.485 MHz. The frequency operation is reduced due to more hardware, which is required
to built the FFT. However if the hardware is optimized, frequency of operation can be increased.
Conclusion
This paper involved the design, simulation and implementation of the Radix-2 FFT. The basic
building blocks for an FFT butterfly, the adders and multipliers, are designed first. The carry look
ahead adder and overturned stairs multiplier are selected for the design purpose. The algorithm
used for partial product generation is Baugh-Wooley algorithm. VHDL was used for simulation and
synthesis .
Implementing a butterfly block of the Radix-2 FFT uses 6% of the total available slices in the Xilinx
XCV-1000-5bg 560 FPGA (with a gate capacity of 1000K). The whole FFT block consumes 19% of the
total capacity. The maximum frequency of operation was limited to 18.485 MHz. The power
consumed by the butterfly unit is 540mw.
This butterfly design remains the same for any length of the Radix-2 FFT. The average error due to
the quantization of input to 8 bits and due to the 8-bit length of the multiplier and adder is 35%
compared to the Matlab output (32 bit precision). Therefore, it is necessary to increase the input bit
size and the length of the multiplier and the adder, but the design remains the same. The matched
filter which has multipliers and adders can be designed easily in FPGAs, since the basic building
blocks are constructed. This increases the speed of operation, which plays a vital role in adaptive
cruise control.