High Performance Multipliers in Quicklogic Fpgas

High Performance Multipliers in QuickLogic FPGAs
Introduction
Performing a hardware multiply is necessary in any system that contains Digital Signal Processing (DSP) functionality such as filtering, modulation, or video processing. Often there is an off-the-shelf component that the engineer can use to solve his problem. However, in some designs, the expense of a dedicated DSP chip is not justified and so the use of a Field Programmable Gate Array (FPGA) is more viable. Another reason for using a programmable device to perform hardware multiplication is if the system parameters are non-standard and would require expensive circuitry outside the DSP, or running over-powered DSP chips for a large and small operand combinations (a 12- by 4-bit multiplier, for example). Using an FPGA in cases like these gives complete flexibility, high performance, and lower cost over an off-the-shelf solution. This application note describes various techniques available to the designer for performing high-speed signed multiply operations in QuickLogic FPGAs. A brief discussion on the basics of digital multiplication will be introduced, followed by discussions on how to implement the signed multiplier to meet your design requirements. Example implementations will be shown, from the slower, purely combinatorial version to the fully pipe-lined version that runs at 200MHz. The example designs shown in this application note are done in schematic as well as in Verilog and VHDL. These are generic designs that can be easily modified to suit various design needs. These are not hard macros. Therefore they can be modified by users. All timing information presented in this application note is derived from worst case simulations performed within QuickWorks, QuickLogics integrated Windowsbased design software.
Twos Complement
Before starting with signed multiplication, a quick review of the 2s complement system of signed number representation for a binary number would be helpful in understanding the derivation of the algorithm. Basically, in the 2s complement system, the left most bit (MSB) indicates the sign of the number, with 0 being positive, 1 being negative. To obtain a negative number, simply subtract the corresponding positive number from 2 n, where n is the number of bits of the original number. For example, to obtain the 4 bit signed number -4, take 24(b10000), and subtract from it the corresponding positive number, 4(b0100), the result is (b1100), -4 in the 2s complement representation. Alternatively, it can be obtained by subtracting 2(n-1) from the corresponding positive number to obtain the negative number. Using the same example, to obtain the number -4, take the number 4(b0100) and subtract 2(4-1)(b1000)from it. The result is (b1100). The multiplier algorithm discussed in this application note takes advantage of the latter method. One important note on using 2s complement number representation is the need for sign extension. That is, to obtain a negative number using a greater number of bits, simply repeat the sign bit to the left until the desired number of bits are filled. For example, to extend the number -4 (b1100) from 4 bits to 8 bits, the resultant sign extended number would be (b11111100).
Binary Multiply Basics

Similar to the familiar long hand decimal multiplication, binary multiplication involves the addition of shifted versions of the multiplicand based on the value and position of each of the multiplier bits. As a matter of fact, its much simpler to perform binary multiplication than decimal multiplication. The value of each digit of a binary number can only be 0 or 1, thus, depending on the value of the multiplier bit, the partial products can only be a copy of the multiplicand, or 0. In digital logic, this is simply an AND function. Figure 1 shows the multiplication of two unsigned 4 bit numbers as an example.
a3 X b3
a2 b2
a1 b1
a0 b0
a3b0 a2b0 a1b0 a0b0 a3b1 a2b1 a1b1 a0b1 a3b2 a2b2 a1b2 a0b2 a3b3 a2b3 a1b3 a0b3 a0b0 a1b0+a0b1 a2b0+a1b1+a0b2+carry a3b0+a2b1+a1b2+a0b3+carry a3b1+a3b2+a1b3+carry a3b2+a2b3+carry a3b3+carry carry Product P, p0 p1 p2 p3 p4 p5 p6 p7
Figure 1. Basic Binary Multiplication
Baugh-Wooley Algorithm
The Baugh-Wooley algorithm for the unsigned binary multiplication is based on the concept shown in Figure 1. The algorithm specifies that all possible AND terms are created first, and then sent through an array of half-adders and full-adders with the carry-outs chained to the next most significant bit at each level of addition. For signed multiplication (by utilizing the properties of the twos complement system) the Baugh-Wooley algorithm can implement signed multiplication in almost the same way as the unsigned multiplication shown above. The algorithm for the signed multiplication is listed in Figure 2.
P=
a[m-1]b[n-1]2(m+n-2)
i = n-2
NOT(a[m-1]b[i])2(i+m-1)
i=0
+ +
j = m-2
NOT(a[j]b[n-1])2(j+n-1)
j=0
i = m-2, j = n-2
a[i]b[j]2(i+j)
i = 0, j = 0
2(n-1) + 2(m-1) - 2(m+n-1)
Figure 2. Baugh-Wooley algorithm for signed multiplication Although the equation in Figure 2 looks complicated, the hardware implementation is actually very similar to the unsigned implementation. The difference is that except for the AND term involving the MSB of both the multiplicand and multiplier, all other AND terms involving the MSBs of the multiplicand or the multiplier are inverted before being feed into the adder array. The last two addition terms in the equation are easy to implement, involving only the addition of a 1 to the appropriate adder stage. The last term in the equation is a subtraction, which can be implemented with a XOR gate in the last stage. The architecture block diagram in Figure 3 illustrates the flow for the multiplication of two 4-bit signed numbers using the equation in Figure 2 (m= n=4). The mathematics involved in deriving the equation in Figure 2 are left as an exercise for the reader.
a0b3 a0b2 a0b1 a0b0
HA
a1b3
a1b2
HA
a1b1
HA
a1b0
FA
a2b3
a2b2
FA
a2b1
FA
a2b0
FA
a3b3
a3b2
FA
a3b1
FA
a3b0
HA
HA
FA
HA
HA
HA
p7
p6
p5
p4
p3
p2
p1
p0
Figure 3: Architectural block diagram of the Baugh-Wooley multiply algorithm.
Implementing an 8-bit signed Baugh-Wooley multiplier

To implement an 8-bit signed Baugh-Wooley multiplier, set m=n=8 for the equation in Figure 2. Solve the equation to obtain the AND terms and the arrangement of the adder array. The array approach of the algorithm allows clean hardware implementation that clearly shows the data flow. The schematic in Figure 4 shows the combinatorial implementation of the 8-bit multiplier. Since QuickLogics pASIC2 and pASIC 3 families of FPGAs can implement a full adder in a single logic cell, the combinatorial 8-bit signed multiplier takes up less than 110 logic cells, less than
20% of the logic resources available in the QL2009 device targeted. This implementation of the multiplier can run at 25MHz.
A[7] B[0] A[6] B[1] A[6] B[0] A[5] B[1] A[5] B[0] A[4] B[1] A[4] B[0] A[3] B[1] A[3] B[0] A[2] B[1] A[2] B[0] A[1] B[1] A[1] B[0] A[0] B[1] B[0] Cin AND2i0 A[0]
Cin
Cin
Cin
Cin
Cin
Cin
Cout A[7] B[1] A[6] B[2] A[5] B[2]
SUM Cout A[4] B[2]
SUM
Cout A[3] B[2]
SUM Cout A[2] B[2]
SUM
Cout A[1] B[2]
SUM Cout A[0] B[2]
SUM
Cout C[0]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
Cout A[7] B[2] A[6] B[3] A[5] B[3]
SUM
Cout A[4] B[3]
SUM Cout A[3] B[3]
SUM
Cout A[2] B[3]
SUM Cout A[1] B[3]
SUM
Cout A[0] B[3]
SUM Cout C[1]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
Cout A[7] B[3] A[6] B[4] A[5] B[4]
SUM
Cout A[4] B[4]
SUM
Cout A[3] B[4]
SUM Cout A[2] B[4]
SUM
Cout A[1] B[4]
SUM Cout A[0] B[4]
SUM
Cout C[2]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
Cout A[7] B[4] A[6] B[5] A[5] B[5]
SUM
Cout A[4] B[5]
SUM
Cout A[3] B[5]
SUM
Cout A[2] B[5]
SUM Cout A[1] B[5]
SUM
Cout A[0] B[5]
SUM Cout C[3]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
Cout A[7] B[5] A[6] B[6] A[5] B[6]
SUM
Cout A[4] B[6]
SUM
Cout A[3] B[6]
SUM
Cout A[2] B[6]
SUM
Cout A[1] B[6]
SUM Cout A[0] B[6]
SUM
Cout C[4]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
Cout A[7] B[6] A[6] B[7] A[5] B[7]
SUM
Cout A[4] B[7]
SUM
Cout A[3] B[7]
SUM
Cout A[2] B[7]
SUM
Cout A[1] B[7]
SUM
Cout A[0] B[7]
SUM Cout C[5]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
A[7] B[7]
Cout Cin
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout Cin
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
P[15]
P[11]
P[10]
P[14]
P[13]
P[12]
P[9]
P[8]
P[7]
P[6]
P[5]
P[4]
P[3]
P[2]
Figure 4: Combinatorial 8-bit Signed Multiplier
In addition to the schematic multiplier illustrated in Figure 4, the synthesis-friendly logic cells in the pASIC2 and pASIC3 families of FPGAs are also well suited for high performance Verilog and VHDL implementations. These HDL versions of the multiplier have comparable performance to the schematic version, while offering the ease of use of HDLs, with powerful features such as scaleable width for design reusability. To verify the performance of the multiplier, the Path Analyzer static timing analysis tool included with QuickWorks and QuickTools were used to obtain the flip-flop-to-flip-flop delay (in order to get the true performance of the multiplier, the inputs and outputs were registered). The Silos III Verilog simulator included in QuickWorks was used to perform both pre-layout functional simulation and post-layout timing simulation. A self-checking Verilog test fixture is used to randomly generate up to 400 input vectors, and automatically notify the user when a failure occurs. A VHDL version of the test fixture is also included for users who wish to use their own VHDL simulator to perform the simulation.
P[1]
P[0]
The results for different versions of the combinatorial 8-bit signed multiplier are listed in the Comparison of Methods Presented section later in the application note.
Speeding Up the Multiplier

Looking at the schematic in Figure 4, it is obvious that the performance of the multiplier can be increased by reducing the carry delay through the half-adder array. Upon closer inspection, one will see that the half-adder array is nothing more than a ripple adder. Since the half-adder array consists of 8 stages, think of it as an 8 bit ripple adder. By replacing the ripple adder with, for example, a carry-select adder, several nanoseconds can be saved. However, although several fast adder options exists, the increase in performance by replacing the ripple adder is limited to no more than 15% due to the delays through the AND/ADD array. To gain higher performance, the multiplier needs to be pipelined.
Pipelining for Performance

Throughput is often a greater concern than one-cycle response in DSP designs. In this case, the engineer may opt to tolerate some latency in the multiply operation in return for a faster maximum clock rate. This is accomplished in a multiplier by breaking the carry-chains up and inserting flip-flops at strategic locations. Care must be taken to ensure that all inputs to an adder are created from signals at the same stage in the pipeline (i.e., the registers must effect a clean break across the matrix of adders). Once the maximum acceptable latency is determined, the placement of the registers needs to be carefully considered. The delay should be evenly distributed across the pipeline, so each stage has the same delay as the other stages. The path delays can be analyzed with the Path Analyzer in QuickWorks or QuickTools. The locations to insert the pipeline registers can then be determined by the distribution of the path delays. The example in Figure 5 shows the 8-bit signed multiplier in Figure 4 with a single pipeline stage inserted.
A[7] B[0]
A[6] B[1]
A[6] B[0]
A[5] B[1]
A[5] B[0]
A[4] B[1]
A[4] B[0]
A[3] B[1]
A[3] B[0]
A[2] B[1]
A[2] B[0]
A[1] B[1]
A[1] B[0]
A[0] B[1]
B[0]
Cin
Cin
Cin
Cin
Cin
Cin
Cin
A[0] AND2i0
Cout A[7] B[1] A[6] B[2] A[5] B[2]
SUM Cout A[4] B[2]
SUM
Cout A[3] B[2]
SUM Cout A[2] B[2]
SUM
Cout A[1] B[2]
SUM Cout A[0] B[2]
SUM
Cout C[0]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
Cout A[7] B[2] A[6] B[3] A[5] B[3]
SUM
Cout A[4] B[3]
SUM Cout A[3] B[3]
SUM
Cout A[2] B[3]
SUM Cout A[1] B[3]
SUM
Cout A[0] B[3]
SUM Cout C[1]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
Cout A[7] B[3] A[6] B[4] A[5] B[4]
SUM
Cout A[4] B[4]
SUM
Cout A[3] B[4]
SUM Cout A[2] B[4]
SUM
Cout A[1] B[4]
SUM Cout A[0] B[4]
SUM
Cout C[2]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
Cout A[7] B[4] A[6] B[5] A[5] B[5]
SUM
Cout A[4] B[5]
SUM
Cout A[3] B[5]
SUM
Cout A[2] B[5]
SUM Cout A[1] B[5]
SUM
Cout A[0] B[5]
SUM Cout C[3]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
Cout A[7] B[5] A[6] B[6] A[5] B[6]
SUM
Cout A[4] B[6]
SUM
Cout A[3] B[6]
SUM
Cout A[2] B[6]
SUM
Cout A[1] B[6]
SUM Cout A[0] B[6]
SUM
Cout C[4]
SUM
Cin
Cin
Cin
Cin
Cin
Cin
Cin
Cout CLK
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM Cout
SUM
DA[7] DB[6]
DA[6] DB[7]
DA[5] DB[7]
DA[4] DB[7]
DA[3] DB[7]
DA[2] DB[7]
DA[1] DB[7]
Cin
Cin
Cin
Cin
Cin
Cin
DA[0] DB[7]
Cin
DA[7] DB[7]
Cout Cin
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout Cin
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
Cout
SUM
P[5]
P[15]
P[11]
P[10]
P[9]
P[4]
P[1]
P[14]
P[13]
P[12]
P[8]
P[7]
P[6]
P[3]
Figure 5: Single Stage Pipelined 8-Bit Signed Multiplier With just a single stage of pipeline inserted, the multiplier now runs in excess of 35MHz. The Verilog and VHDL test fixtures for the pipelined version of the multiplier are very similar to the test fixtures for the combinatorial version, except the addition of pipeline registers to the expected result that are used to check against the output.
P[2]
P[0]
Fully Pipelined 200MHz Multiplier

If the latency is acceptable, the multiplier can be fully pipelined by adding registers between every adder . This would give a latency of 2n-1 clock cycles, where n is the number of bits of the inputs. The schematic in Figure 6 shows a fully pipe-lined version of the multiplier that can run up to 200MHz. Notice that in order to control the fan out to no more than 4, a fan out register array is used. This design utilizes 390 logic cells, taking up approximately 75% of the QL2007 device targeted. In order to obtain the 200MHz performance, strict control of the design is needed, therefore only the schematic version of the design can run at 200MHz. The HDL versions of the fully pipelined design run at slower clock rate.
X[0] QD QD QD QD QD QD Y[0] QD QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
Xe1 QD
Xd1 QD QD
Xc1 QD
Xb1 QD
Xa1
X[2] QD
Yg1 QD
Yf1
Ye1 QD
Yd1 QD QD
Yc1
Yb1 QD
Ya1 QD
Xe2 QD
Xd2 QD QD
Xc2 QD
Xb2 QD
Xa2 QD
X[3] QD
Yg2 QD
Yf2
Ye2 QD
Yd2 QD QD
Yc2
Yb2 QD
Ya2 QD
Xe3 QD
Xd3 QD QD
Xc3 QD
Xb3 QD
Xa3 QD QD
X[4] QD
Yg3 QD
Yf3
Ye3 QD
Yd3 QD QD
Yc3
Yb3 QD
Ya3 QD
Xe4 QD
Xd4 QD QD
Xc4 QD
Xb4 QD
Xa4 QD QD QD
X[5] QD
Yg4 QD
Yf4
Ye4 QD
Yd4 QD QD
Yc4
Yb4 QD
Ya4 QD
Xe5 QD
Xd5 QD QD
Xc5 QD
Xb5 QD
Xa5 QD QD QD QD
X[6] QD
Yg5 QD
Yf5
Ye5 QD
Yd5 QD QD
Yc5
Yb5 QD
Ya5 QD
Xe6 QD
Xd6 QD QD
Xc6 QD
Xb6 QD
Xa6 QD QD QD QD QD
X[7] QD
Yg6 QD QD
Yf6
Ye6 QD
Yd6 QD QD
Yc6
Yb6 QD
Ya6 QD
Xe7
Xd7
Xc7
Xb7
Xa7
Yh7
Yg7
Yf7
Ye7
Yd7
Yc7
Yb7
Ya7 Xa0 Ya0 clk P[15:0]
QD
QD
QD
QD
QD
QD
QD
200 MHz 8x8 Multiplier Signed, Pipelined
QD
QD
Xd0
Xc0
Xb0
QD
Xa0
X[1]
Yg0
Yf0
Ye0
Yd0
Yc0
Yb0
Ya0
Y[1]
Y[2]
Y[3]
Y[4]
Y[5]
Y[6]
Y[7]
Xd1 Ya6 C S C S
Xc1 Ya5 C S
Xc1 Ya4 C S
Xb1 Ya3 C S
Xb1 Ya2 C S
Xa1 Ya1 C S
Xa1 Ya0
AND2i0
Xd0 Ya7
Xd0 Ya6
Xb0 Ya3
Xb0 Ya2
Xa0 Ya1
Xc0 Ya5
Xc0 Ya4
Xe1 Yb7 C S
Xd2 Yb6 C S
Xc2 Yb5 C S
Xc2 Yb4 C S
Xb2 Yb3 C S
Xb2 Yb2 C S
Xa2 Yb1 C S
Xa2 Yb0
QD
Xe2 Yc7 C S
Xd3 Yc6 C S
Xc3 Yc5 C S
Xc3 Yc4 C S
Xb3 Yc3 C S
Xb3 Yc2 C S
Xa3 Yc1 C S
Xa3 Yc0
QD
QD
Xe3 Yd7 C S
Xd4 Yd6 C S
Xc4 Yd5 C S
Xc4 Yd4 C S
Xb4 Yd3 C S
Xb4 Yd2 C S
Xa4 Yd1 C S
Xa4 Yd0
QD
QD
QD
Xe4 Ye7 C S
Xd5 Ye6 C S
Xc5 Ye5 C S
Xc5 Ye4 C S
Xb5 Ye3 C S
Xb5 Ye2 C S
Xa5 Ye1 C S
Xa5 Ye0
QD
QD
QD
QD
Xe5 Yf7 C S
Xd6 Yf6 C S
Xc6 Yf5 C S
Xc6 Yf4 C S
Xb6 Yf3 C S
Xb6 Yf2 C S
Xa6 Yf1 C S
Xa6 Yf0
QD
QD
QD
QD
QD
Xe6 Yg7 C S
Xd7 Yg6 C S
Xc7 Yg5 C S
Xc7 Yg4 C S
Xb7 Yg3 C S
Xb7 Yg2 C S
Xa7 Yg1 C S
Xa7 Yg0
QD
QD
QD
QD
QD
QD
Xe7 Yh7 C S C S C S C S C S C S C S
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
C S
C S
C S
C S
C S
QD
C S
QD
QD
QD
QD
QD
QD
QD
QD
C S
C S
C S
C S
QD
C S
QD
QD
QD
QD
QD
QD
QD
QD
QD
C S
C S
C S
QD
C S
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
C S
C S
QD
C S
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
C S
QD
C S
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
C S
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
P[15]
P[14]
P[13]
P[12]
P[11]
P[10]
P[9]
P[8]
P[7]
P[6]
P[5]
P[4]
P[3]
P[2]
P[1]
Figure 6: 200MHz Fully Pipelined 8-bit Signed Multiplier
P[0]
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
QD
clk
X[7:0]
Y[7:0]
Comparison of Methods Presented

The table below shows the results of worst-case simulations (Vcc = 4.75V, Ta = 70C) of the QL2009-2PQ208C. All simulations were performed in QuickWorks Silos III Verilog simulator with the test fixtures included with the design files. All designs are based on the Baugh-Wooley adder tree algorithm. As with any type of design, the choice of algorithm usually involves a trade-off of some kind: either increased latency vs. reduced throughput or size vs. speed. It is up to the designer to chose the best implementation that meets the design requirement. Multiplier Area(Logic Cells) Speed Latency Combinatorial 8-bit Signed (Schematic) 110 25MHz Combinatorial 8-bit Signed (HDL) 112 21.7MHz 1-Stage Pipelined 8-bit Signed(Schematic) 110 35MHz 2 clock cycles 1-Stage Pipelined 8-bit Signed (HDL) 110 34MHz 2 clock cycles Fully Pipelined 8-bit Signed(Schematic) 390 200MHz 16 clock cycles Fully Pipelined 8-bit Signed(HDL) N/A N/A Table 1: Speed and size comparison of the methods presented in this paper
Written By: John Birkner - V.P. CAE, QuickLogic Jim Jian - Customer Engineering, QuickLogic Kevin Smith - Field Applications Engineer, QuickLogic
Further recommended reading 1. J. McCanney and J. McWhirter, Completely Iterative, Pipelined Multiple Array Suitable for VLSI, IRE Proceedings, Vol. 129. No. 2, April, 1982 R. Lyon, 2s Complement Pipelined Multipliers, IEEE Transactions on Communications, April, 1976. Paul J. Song and Giovanni De Micheli Circuit and Architecture Trade-offs for High-Speed Multiplication, IEEE Journal of Solid-State Circuits, Vol. 26, No. 9, September 1991.
2. 3.
QuickWorks is a registered trademark of QuickLogic corporation. Pentium is a registered trademark of Intel corporation. Windows 95 is a registered trademark of Microsoft corporation. Synplify-Lite is a registered trademark of Synplicity, Inc.
Appendix A: Deriving the Algorithm

For two's compliment numbers A and B defined as: A = SUM(a[i]2**i for i=0 to m-2) -a[m-1]2**(m-1) B = SUM(b[j]2**j for j=0 to n-2) -b[n-1]2**(n-1) the product, P, of A and B is derived as follows: P = = + AB a[m-1]b[n-1]2**(m+n-2) SUM(a[i]2**i for i=0 to m-2) b[n-1]2**(n-1) SUM(b[j]2**j for j=0 to n-2) a[m-1]2**(m-1) SUM(a[i]b[j]2**(i+j) for i=0 to m-2,for j=0 to n-2)
Recognizing that addition is more convenient than subtraction, and utilizing the equality: SUM(c[k]2**k for k=0 to n-1) = (2**n)-1-SUM(not(c[k]2**k) for k=0 to n-1) then the equation for P can be converted to the following form: P = + a[m-1]b[n-1]2**(m+n-2) 2**(n-1)(2**(m-1)-1-SUM(not(a[i]b[n-1])2**i for i=0 to m-2)) 2**(m-1)(2**(n-1)-1-SUM(not(b[j]a[m-1])2**j for j=0 to n-2)) SUM(a[i]b[j]2**(i+j) for i=0 to m-2,for j=0 to n-2)
Simplifying this equation yields: P = + + + + + a[m-1]b[n-1]2**(m+n-2) SUM(not(a[i]b[n-1])2**(i+n-1) for i=0 to m-2) SUM(not(b[j]a[m-1])2**(j+m-1) for j=0 to n-2) SUM(a[i]b[j]2**(i+j) for i=0 to m-2,for j=0 to n-2) 2**(n-1) 2**(m-1) 2**(m+n-1)

High Performance Multipliers in Quicklogic Fpgas

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

High Performance Multipliers in Quicklogic Fpgas

Încărcat de

Drepturi de autor:

Formate disponibile

High Performance Multipliers in QuickLogic FPGAs

Binary Multiply Basics

Figure 1. Basic Binary Multiplication

2(n-1) + 2(m-1) - 2(m+n-1)

Figure 3: Architectural block diagram of the Baugh-Wooley multiply algorithm.

Implementing an 8-bit signed Baugh-Wooley multiplier

Cout A[7] B[1] A[6] B[2] A[5] B[2]

SUM Cout A[4] B[2]

Cout A[3] B[2]

SUM Cout A[2] B[2]

Cout A[1] B[2]

SUM Cout A[0] B[2]

Cout A[7] B[2] A[6] B[3] A[5] B[3]

Cout A[4] B[3]

SUM Cout A[3] B[3]

Cout A[2] B[3]

SUM Cout A[1] B[3]

Cout A[0] B[3]

SUM Cout C[1]

Cout A[7] B[3] A[6] B[4] A[5] B[4]

Cout A[4] B[4]

Cout A[3] B[4]

SUM Cout A[2] B[4]

Cout A[1] B[4]

SUM Cout A[0] B[4]

Cout A[7] B[4] A[6] B[5] A[5] B[5]

Cout A[4] B[5]

Cout A[3] B[5]

Cout A[2] B[5]

SUM Cout A[1] B[5]

Cout A[0] B[5]

SUM Cout C[3]

Cout A[7] B[5] A[6] B[6] A[5] B[6]

Cout A[4] B[6]

Cout A[3] B[6]

Cout A[2] B[6]

Cout A[1] B[6]

SUM Cout A[0] B[6]

Cout A[7] B[6] A[6] B[7] A[5] B[7]

Cout A[4] B[7]

Cout A[3] B[7]

Cout A[2] B[7]

Cout A[1] B[7]

Cout A[0] B[7]

SUM Cout C[5]

Figure 4: Combinatorial 8-bit Signed Multiplier

Speeding Up the Multiplier

Pipelining for Performance

Cout A[7] B[1] A[6] B[2] A[5] B[2]

SUM Cout A[4] B[2]

Cout A[3] B[2]

SUM Cout A[2] B[2]

Cout A[1] B[2]

SUM Cout A[0] B[2]

Cout A[7] B[2] A[6] B[3] A[5] B[3]

Cout A[4] B[3]

SUM Cout A[3] B[3]

Cout A[2] B[3]

SUM Cout A[1] B[3]

Cout A[0] B[3]

SUM Cout C[1]

Cout A[7] B[3] A[6] B[4] A[5] B[4]

Cout A[4] B[4]