A 32 BIT MAC Unit Design Using Vedic Mul

FFT Core for Adaptive Cruise Control Using
FPGAs
Christober Rayappan - May 10, 2005
A new type of speed control called Adaptive Cruise Control (ACC) (Figure 1) is being used on some
new model vehicles. These systems allow you to set a following distanceor time intervalbetween
your vehicle and the vehicle ahead and lets you set the maximum speed. When you come up to a
slower moving vehicle in front of you, or if a vehicle changes into your lane, the system automatically
closes the throttle and/or applies brake pressure to slow the vehicle and maintain the set following
distance. All of the ACC systems available today are built around sensors that detect the vehicle
ahead through the use of radar.
The frequency shift between the original sent signal and the received signal gives the distance and
the relative velocity between the two vehicles. Capturing and processing this information in real
time requires the use of math intensive digital signal processing (DSP) algorithms. Software
processing cannot meet these performance requirements, and it often takes several conventional
DSP processors to perform such high-speed tasks. The speed critical process (signal processing
block) can be designed using FPGAs.
Signal Processing Block

To meet the performance requirements, the signal processing block is implemented using FPGAs.
Radar works by bouncing electromagnetic energy off the target, recording the echo and making
some useful observation from the data. Hilbert transform is used for the phase shift of the incoming
signal. Decimating the filter input by four involves discarding three out of four samples. The
matched filter can be decomposed into a sum of four sub-filters, each of which filters a particular
phase of the input signal. Bit parallel structure can be used for implementing matched filters in
FPGAs (Figure 2).
Velocity Analysis
Another important part of the ACC system is to determine the velocity of the objects. This is done in
the frequency domain of the input signal. In order to detect the frequency shift between the original
sent signal and the received signal, the DC offset is taken into account.
This is obtained by subtracting the mean of the signal from the signal itself. The FFT of both the sent
and received signal is taken to see the spectrum of the two. The shift of the spectrum from the one
to the other signal shows the Doppler shift. The peak of the received signal is subtracted from the
peak of the sent signal to obtain the Doppler shift.
The velocity is calculated by (Equation 1). Note that because sampling of the signal is done at a
given sampling rate, this equation needs to be modified a little by dividing the right hand side by the
sampling frequency as well.
(1)
Solving this equation for the velocity (v), you can identify the velocity of the moving object.
From the above discussion it is clear that the FFT processing is performed to display the spectrum of
the Doppler return, which translates directly into velocity, after considerations for geometry. The
character of the spectrum (noisy, smooth, wide band, narrow band) is an indication of the vehicle.
The processing required by the FFTs depends on the rates selected by the operator. The 25 KHz
Doppler signal can be rate converted up to give finer frequency (velocity) resolution, and can be
performed lapped (overlapping sliding buffers) to improve low velocity sensitivity of the system. All
of these effects translate into an equivalent number of 1K complex FFTs per second. As the number
of FFTs per second increase, the computational demand on the processor increases. At the same
time, the demands of the processor for data increase as well.
The DSP Power of an FPGA

The following characteristics describe the DSP power of FPGAs:
● Possible to efficiently and cost effectively implement these DSP functions

● Increases integration possibilities and potentially reduces the number of discrete components
● Multi-billions MACs per second.
● Tremendous parallel processing capability
● High memory bandwidth
● High I/O bandwidth and flexible interfaces
● Supports 20-High speed signal and memory interface standards (for example, LVD, SSTL).
Since FFT needs lots of computational power and efficiency than the other signal processing blocks
(Hilbert transform, Matched filter), this paper concentrates on developing the FFT core for ACC
systems.
Internal Architecture of an FFT Processor

Figure 3 shows the architecture of the FFT processor. The design is based on the Radix-2
decimation in frequency algorithm and employs in-place computation to optimize memory usage. To
operate, first the processor data must be loaded into the internal RAM. The processor is then
instructed to compute the FFT, overwriting the input data in the RAM with the results. On
completion of the FFT, the results can be read out from the RAM via the output data port.
Detailed Design Description
The architecture of the FFT butterfly is depicted in Figure 5. From this it can be seen that the
design contains the following components:
● Adders
● Multipliers
● Two's complement circuit
● Twiddle factor ROM
This paper will cover the design of the multiplier and adder with saturation in detail. Following this
design entry, layout and simulation strategies are described. The reader is assumed to be familiar
with various FFT algorithms and their classification .
Design of the Butterfly

This function implements the Radix-2 decimation in frequency butterfly described by the following
equations:
(2)
(3)
where a and b are the complex inputs to the butterfly, A and B are the complex outputs and
(4)
is the complex twiddle factor. The number of adders and multipliers in the design depends on the
input (complex or real).
Figure 4 shows decimation in frequency butterfly.
Assumptions made before the design process:
● Input is a two's complement number

● Input is a sign extended 8-bit number
● 8-bit adders and multipliers are used
● Input is a fractional number
● Fixed point representation is followed.
The Radix-2 butterfly requires six adders and four multipliers with some two's complement circuit to
represent the negative numbers. The design block is given in Figure 5.
The carry look ahead adder is used for the design of adders.
Overflow Detection and Saturation in Fractional Addition

Let A=an-1,an-2,...a1a0 B=bn-1,...b1b0 be two n-bit integers. The resultant sum S=A+B is also an n-bit
integer. Overflow occurs when the result S cannot be accommodated in an n-bit result. Thus, when
two number are added, the detection of overflow is given by the expression
(5)
Where Cn-1 and Cn-2 are the carry values obtained when the most significant magnitude digit an-2
and bn-2 and the sign digits an-1 and bn-1 are added. In order to identify the sign of the overflow,
two other functions can be defined
(6)
The saturated values for overflow detection are as follows:
● If D=1 and D+ = 1 then saturated value S=sn-1,sn-2..s1s0 is S=011...11

● If D=1 and D- = 1 then saturated value S=sn-1,sn-2..s1s0 is S=10...00
● If D=0 then S=sn-1,sn-2..s1s0
There is an implied binary point between the most significant bit and sign bit. The above equations
can be implemented with multiplexer and some control circuitry, as shown in Figure 6a.
Figure 6: Circuitry of overflow detection and saturation
Design of Digital Multipliers

Multipliers are one of the most important components in digital signal processing applications. Most
of the digital signal processors provide built-in hardware multipliers for high-speed operations.
These multiplier units should be designed for high speed, low complexity in hardware
implementation and have regular structure for ease in implementing in the hardware. There are
various multiplier algorithms for implementing the same. The most popular multiplier algorithms
widely used are Baugh-Wooley and Booth's algorithms. The Baugh-Wooley algorithm is used for
partial product generation.
The Baugh-Wooley Algorithm

Charles Baugh and Bruce Wooley gave an algorithm for 2's complement multiplication . The two's
complement multiplication is converted to an equivalent parallel addition, in which each partial
product bit is the AND of a multiplier bit and multiplicand bit, and signs of the entire partial product
bit are positive. With this algorithm, all the subtractor in the hardware can be replaced by
equivalent adders.
Let A=a7a6a5a4a3a2a1a0 B=b7b6b5b4b3b2b1b0 be the two 8-bit two's complement integers. The
Baugh-Wooley algorithm of multiplication is given by figure 6b.
To obtain the final product, these partial products are added column wise.
Partial Product Addition

The majority of hardware for multipliers comes from the addition of partial products to get the final
product output. This requires adders for adding the partial products. There are various algorithms
for adding partial products in a multiplier, such as the Wallace tree and "overturned stairs" adder
trees.
"Overturned Stairs" Adder Trees

Multioperand adders are widely needed for designing multipliers, FFT, digital filters, and inner
product computers. Wallace trees are known for their optimal computation time, when adding
multiple operands to two numbers using carry save adders (CSA). Wallace trees are very difficult to
implement because of their complex wiring. The design of a Wallace tree is often too expensive to
justify the speed obtained. The Overturned Stairs tree introduced by Zhi-Jian (Alex) Mou is
implemented for multiplier design . An important feature of OS trees is their recursive structure,
which leads to a regular and compact design.
OS Tree Structure
The structure of the OS tree is determined by the number of words that need to be added. As shown
in Figure 7, the width and height of the tree increases when the number of input (words) is
increased. The tree height, nOS is given by
(7)
where NOS is the input words to the OS tree.
The width of the OS tree is given by NOS/3
When the input bits are high the tree height of a Wallace tree is less than that of an OS tree. For
most common DSP applications, the word length is not particularly longfor instance, 12 to 24 bits.
In these cases, there is no direct advantage to using Wallace trees . The Wallace tree height is only
one level lower compared to the OS tree .
Some basic building blocks can be identified in the trees (Figure 7): body, root, connector, and
branch. These building blocks can be used to form the general structure of the OS tree. The general
structure can be easily derived. Assume that we have a tree of height nOS = j+1. A body of height j-1,
a connector and a branch can construct the body of height j. The connector is constructed by two
CSAs. The three outputs from the connector appear as inputs to the root, which is constructed by a
single CSA. The branch height j-2 is formed by j-2 CSAs on top of each other with proper
interconnections. Using these properties, an OS tree of arbitrary tree height can be constructed in
an iterative way.
Figure 7: Structure of an OS Tree (7-9 inputs)
An 8-bit OS tree, which is designed for implementing FFT, is shown in Figure 8.
Underflow Detection/Overflow or Saturation in Fractional Multiplication

When two n-bit integers A&B are multiplied, the result is a 2n bit number. The final product value is
always greater then or equal to the greatest number among A&B. In the final result, when the only n
least significant number is retained, an overflow may occur. However, in two n-bit fractional
numbers, the result is still a 2n bit number, but the magnitude of the resultant product is lower than
the lowest number among the A&B. Hence there exists a possibility of underflow in fractional
multiplication. Overflow in fractional multiplication occurs when the multiplication -1X-1 is
performed.
In order to keep the word length constant as required in many applications, only the n most
significant bits are retained out of 2n bits in fractional multiplication .
The method used to retain 'n' most significant bits is either truncation or rounding. In truncation, the
n least significant bits are truncated to retain n most significant bits. In rounding, the n least
significant bits are truncated and for the remaining result after truncation a value 0.00...1 is added
so that the final value is the value nearest to the original value. Thus for fractional multiplication,
the least 'n' significant bits are retained. If the truncated number is a positive fraction, then add
0.0...1 to the retained value if truncated bits does not contain all zeros. If the truncated number is a
negative fraction, then no correction factor is added and just 'n' most significant bits are related as
the final value. The hardware circuit for the above concept can be implemented using multiplexer
and some logic circuits (Figure 9).
Implementation of an 8-Bit By 8-Bit Adder

The 8-bit by 8-bit Adder was implemented in the Xilinx XCV-1000-5bg 560 FPGA. This FPGA has an
1000K-gate capacity. After place and route and back annotation, the resources utilized were 1% of
the total available. The simulation result is shown in Figure 10.
Figure 10: Simulation result of an 8-bit by 8-bit adder (saturated)
Implementation of an 8-bit by 8-bit Multiplier

The 8-bit by 8-bit multiplier 2's complement Baugh-Wooley multiplier was implemented in the Xilinx
XCV-1000-5bg 560 FPGA. This FPGA has an 1000K-gate capacity.
Figure 11: Simulation result of an 8-bit by 8-bit multiplier (without saturation)
After place and route and back annotation, the maximum frequency of operation was limited to 50
MHz and resources utilized were 1% of the total available. For implementing a butterfly block of
radix-2 FFT the resources utilized are 6% of the total available slices. The simulation results are
shown in Figure 11 and Figure 12.
Figure 12: Simulation result of an 8-bit by 8-bit multiplier (saturated)
Implementation of a Radix-2 FFT

The 8-point Radix-2 FFT was implemented in the Xilinx XCV-1000-5bg 560 FPGA. The gate capacity
of the FPGA is 1000k.
After place and route and back annotation, the number of slices consumed by the Radix-2 FFT is
2402 out of 12288, which is 19% of the total capacity. The maximum frequency of operation was
limited to 18.485 MHz. The frequency operation is reduced due to more hardware, which is required
to built the FFT. However if the hardware is optimized, frequency of operation can be increased.
Figure 13: FPGA floorplan design of the Radix-2 FFT
Conclusion
This paper involved the design, simulation and implementation of the Radix-2 FFT. The basic
building blocks for an FFT butterfly, the adders and multipliers, are designed first. The carry look
ahead adder and overturned stairs multiplier are selected for the design purpose. The algorithm
used for partial product generation is Baugh-Wooley algorithm. VHDL was used for simulation and
synthesis .
Implementing a butterfly block of the Radix-2 FFT uses 6% of the total available slices in the Xilinx
XCV-1000-5bg 560 FPGA (with a gate capacity of 1000K). The whole FFT block consumes 19% of the
total capacity. The maximum frequency of operation was limited to 18.485 MHz. The power
consumed by the butterfly unit is 540mw.
This butterfly design remains the same for any length of the Radix-2 FFT. The average error due to
the quantization of input to 8 bits and due to the 8-bit length of the multiplier and adder is 35%
compared to the Matlab output (32 bit precision). Therefore, it is necessary to increase the input bit
size and the length of the multiplier and the adder, but the design remains the same. The matched
filter which has multipliers and adders can be designed easily in FPGAs, since the basic building
blocks are constructed. This increases the speed of operation, which plays a vital role in adaptive
cruise control.
About the Author
Christober Rayappan is working as a technical consultant at Infineon Technologies. His

areas of intrest include finite word length effects in signal processing, spectral analysis, and coding
theory. He holds a Masters degree in communication engineering from the Vellore Institute of
Technology. He is currently involved in the development of DSP libraries for Infineon 16-bit
microcontrollers. His email address is Rayappan.External@infineon.com.

A 32 BIT MAC Unit Design Using Vedic Mul

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A 32 BIT MAC Unit Design Using Vedic Mul

Încărcat de

Drepturi de autor:

Formate disponibile

FFT Core for Adaptive Cruise Control Using

Signal Processing Block

The DSP Power of an FPGA

● Possible to efficiently and cost effectively implement these DSP functions

Internal Architecture of an FFT Processor

Design of the Butterfly

Figure 4 shows decimation in frequency butterfly.

Assumptions made before the design process:

● Input is a two's complement number

Overflow Detection and Saturation in Fractional Addition

The saturated values for overflow detection are as follows:

● If D=1 and D+ = 1 then saturated value S=sn-1,sn-2..s1s0 is S=011...11

Design of Digital Multipliers

The Baugh-Wooley Algorithm

Partial Product Addition

"Overturned Stairs" Adder Trees

where NOS is the input words to the OS tree.

The width of the OS tree is given by NOS/3

An 8-bit OS tree, which is designed for implementing FFT, is shown in Figure 8.

Underflow Detection/Overflow or Saturation in Fractional Multiplication

Implementation of an 8-Bit By 8-Bit Adder

Implementation of an 8-bit by 8-bit Multiplier

Figure 11: Simulation result of an 8-bit by 8-bit multiplier (without saturation)

Implementation of a Radix-2 FFT

Figure 13: FPGA floorplan design of the Radix-2 FFT

About the Author

Christober Rayappan is working as a technical consultant at Infineon Technologies. His

S-ar putea să vă placă și