Sunteți pe pagina 1din 4

1512 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO.

, VOL. 46, NO. 12, DECEMBER 1999

Transactions Briefs
Performing Arithmetic Functions with the Chinese
Abacus Approach

Franco Maloberti and Chen Gang

Abstract—This brief introduces a novel (or ancient) technique for high-


speed arithmetic. The new proposed method is based on the still-used
Chinese abacus. We show that proper electronic circuits, based on pass
transistor and domino logic, allow us to achieve the same functions of
the Chinese abacus. Simulations with a 0.35-m CMOS technology show (a) (b)
2
that either a pipeline 8-bit adder and 8 8 multiplier can run at a speed
Fig. 1. Chinese-abacus coding represents (a) a decimal number and (b) an
as high as 1 GHz.
octal number.
Index Terms—Chinese abacus, digital arithmetic, multiplier.

I. INTRODUCTION
The Chinese abacus is a very popular and efficient technique used
to perform arithmetic functions. It was used for centuries in many
part of the world (mainly in China) and it is still in use in shops
and small commercial enterprises. The main feature of the Chinese
abacus is the speed of use: a well-trained operator is often capable
of competing with electronic pocket calculators. The time required
inputting data manually is comparable to the electronic approach, and
the generation of the result in the Chinese abacus is so straightforward
that the total computation time is extremely fast.
The above observation stimulated us to analyze the basic reason
of the displayed speed and, possibly, to transfer the same features to
an electronic circuit. This paper shows that, actually, the use of the Fig. 2. B/T converter (four unity-weight inputs).
Chinese-abacus approach leads to promising results when using, for
example, a 0.35-m CMOS process. The speed for an 8-bit pipeline
full adder is as high as 1.3 GHz, and a parallel 8 2 8 bit multiplier can
The number representation used in the Chinese abacus refers to
the digital numeric system. As we are mostly interested in the case
run at 980 MHz. Moreover, the compactness of the physical layout
of binary-based coding, it is more convenient to use a basic element
leads to a relatively small area for the circuits.
made up of four unity-weight beads and two beads having a weight of
four units [Fig. 1(b)]. In practice, we use a base of 22 = 4; and the
II. OPERATION PRINCIPLE basic element is able to represent numbers comprised in the range
The Chinese abacus is made of a set of unity elements representing from 0 to 12. The configuration shown in Fig. 1(b) represents the
the various decade of decimal number. Each element is made up of number five.
five beads having a unity weight and two beads having a weight of As it happens, in the other considered cases, the given coding is
5. The configuration shown in Fig. 1(a) represents the number seven. able to represent numbers exceeding the full scale by half of the base
The coding rule is thermometric; thus, in order to represent a of the numeric system. Having an over-scaled room is the key of the
number lower than five, the same number of beads will be raised operation of the method.
in the main part of the unit. For numbers higher than five, one bead
with weight 5 will be lowered. In such a way, a basic element is
able to represent a decimal number comprised in the range from 0 to III. CHINESE-BEAD BASIC BLOCKS
15. The key feature of the Chinese abacus is the use of two beads In order to design circuits based on the Chinese-abacus approach,
with weight 5. This allows the operator to minimize the transmission it is necessary to achieve, with electronic circuits, some basic
of rests. Moreover, the use of the thermometric code permits a fast functions.
implementation of elementary arithmetic functions such as addition The first of them is the binary-to-Chinese-bead conversion. We
and subtraction. attain it with two steps: a binary-to-thermometric (B/T) conversion
Manuscript received September 13, 1999. This work of C. Gang was sup- and a thermometric-to-abacus (T/A) coding. Fig. 2 shows the basic
ported in part by the Italian Foreign Ministry. This paper was recommended block for the B/T conversion, where we have four unity-weight
by Associate Editor W. Liu. inputs. Similar circuits with binary-weight input can be designed.
F. Maloberti is with the Department of Electronics, University of Pavia, The solution in Fig. 2 is based on the pass-transistor approach [1]
and contains n-channel transistors. The control is given by the inputs
Via Ferrata 1, 27100 Pavia, Italy.
C. Gang is with the Department of Electronics, Vocational and Technical
College, Hunan Normal University, Yuelushan, 410081 Changsha, China. x1 ; x2 ; x3 ; and their complemented x0 ; x1 ; x2 ; x3 : The output is
Publisher Item Identifier S 1057-7130(99)09933-4. made by a thermometric 0 representation or high impedance.
1057–7130/99$10.00  1999 IEEE
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 12, DECEMBER 1999 1513

Fig. 5. Abacus to binary.

The blocks discussed above have a total of 6-bit representation


(four lower beads and two upper beads), whereas in binary repre-
sentation, the number represented can be between 0–63. This is an
intrinsic cost of the approach used.
The speed of the basic blocks described above has been simulated
using a digital CMOS 0.35-m technology. For the case of the circuit
in Fig. 2, the output shows a delay between clock edge and data
output as low as 0.34 ns. Therefore, we can expect an excellent speed
of operation in the overall architecture.

IV. THE CIRCUIT OF THE SUM OPERATION


Fig. 3. SU basic block (six inputs). The basic blocks discussed in the previous section are used here
to achieve an N -bit full adder (we will assume N = 8; but the
method can be extended to any N value). Since the advantage of
the Chinese abacus lies mainly in the number representation used,
we will exploit the Chinese-abacus representation of numbers by a
specific sum operation procedure. The required operation is
G=A+B (1)
A = aN 2N + aN0 2N0 + + a 2 + a 2
1
1
111 1
1
0
0
(2)
B = bN 2N + bN0 2N0 + + b 2 + b 2
1
1
111 1
1
0
0
(3)
G = gN 2N + gN 2N + + g 2 + g 2 :
+1
+1
111 1
1
0
0
(4)

The sums can results from the following partial sums:


G10 = A10 + B10 ; A10 = a1 21 + a0 20
B10 = b1 21 + b0 20 (5)
G32 = A32 + B32 ; A32 = a3 21 + a2 20
Fig. 4. Thermometric to abacus converter. B32 = b3 21 + b2 20 (6)
G54 = A54 + B54 ; A54 = a5 21 + a4 20
The status of the output nodes when they are in the high-impedance B54 = b5 21 + b4 20 (7)
condition can be set to one by a complementary block made up of
G76 = A76 + B76 ; A76 = a7 21 + a6 20
p-channel transistors. However, this circuit solution would require
additional silicon area and would lead to nonminimum parasitic B76 = b7 21 + b6 20 : (8)
capacitance, which reduces the operation speed achievable. An al- Of course, the maximum value of the partial sums Gij is six, the
ternative method is to use the pre-charge approach: the output nodes binary representation requiring three bits. The sum G is then given
are pre-charged to logic 1 during a pre-charge phase. The data output by
G=G +G 2 +G 2 +G 2 :
is valid during a complementary phase. 2 4 6
In the use of the Chinese abacus, a possible rest coming from 10 32 54 76 (9)
one unit is accounted for by rising one lower bead in the successive The partial sums G10 ; G32 ; G54 ; and G76 are derived with a B/T
unit. The simple circuit shown in Fig. 3 achieves this function. It circuit. We have two inputs with weight 1 and two inputs with weight
shifts up (SU) the input by one position and provides an extra “one” 2; the schematic with pre-charge transistors is shown in Fig. 6. It also
on the output d0 : Note that the representation used so far is just a contains a symbol representing the entire block. The six thermometric
thermometric one. In order to have a representation in the abacus outputs of the B/T block are processed as shown in Fig. 7. We have
form a further block is necessary, the T/A converter, as shown in four processing lines, each of them receiving the carry d(3+6k) from
Fig. 4 (seven inputs). The logic input d3 is used to switch the inputs the lower line. Each line is the cascade of the B/T block, an SU block
d0 ; d1 ; and d2 ; or the inputs d4 ; d5 ; and d6 toward the lower beads (that accommodates the carry of the lower line) and the conversion
e0 ; e1 ; and e2 : The input d3 itself constitutes the value of the upper from thermometric into binary achieved with the T/A and A/B block.
bead, f0 : The pair of bits at the output of each line and the carry of the upper
Finally Fig. 5 shows another basic block: how to convert a Chinese T/A, as it is stated by (9), give the binary representation of the sum G:
bead coding back to a conventional binary coding (A/B). It is a simple The architecture in Fig. 7 depicts a parallel implementation of the
logic that requires using one pass transistor. adder. However, we can achieve the result with pipeline architecture
1514 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 12, DECEMBER 1999

Fig. 6. B/T converter (two unity-weight inputs and two inputs with weight
2).

Fig. 7. Architecture for the (parallel) 8-bit adder.


as well. The groups of bits (a0 -a1 -b0 -b1 ) and (a4 -a5 -b4 -b5 ) must be
processed during one clock phase (with pre-charge during the com-
plementary one) the other during the reciprocal. The thermometric For our purposes, a convenient way to calculate P is to express
representations at the output of each B/T blocks must be stored by (11) in the form
half a clock cycle delay and transferred to a shift-up module, which
operate during the complementary clock phase of the corresponding
7 1 a b 70
P ; 50
P ; 30
P ; 10
P ;
B/T block.
7 3 a b 72
P ; 52
P ; 32
P ; 12
P ;

a7 b5 P7; 4 P5; 4 P3; 4 P1; 4

a7 b7 P7; 6 P5; 6 P3; 6 P1; 6 (12)


V. THE MULTIPLIER
Again, the elements on the same column have the identical weight;
The bead-code representation of numbers permits an effective
implementation of multiplication of two N -bit digital numbers. moreover, the weight of columns increases by a factor 4 when
Below, we discuss a possible circuit solution for N = 8. The required moving from right to left. The generic partial sums Pi;j represent
operation is the expression

P=A B 1 (10) Pi; j = 2 1 (ai bj + ai01 bj +1 ) + ai01 bj + ai02 bj +1 (13)

where A and B have been defined in (2) and (3). It is well known
P
that the digital representation of results from the sum between the where i = 1; 3; 5; 7 and j = 0; 2; 4; 6; moreover, for i = 1; the last
binary elements (11), shown at the bottom of the page, where the term in (13) must be set equal to zero.
elements of the same column have equal binary weight that increases We achieve a thermometric representation of the partial sums Pi; j
by a factor 2 moving from right to left. Of course, the term a0 b0 with simple logic (for achieving the necessary “and” operations)
represents the LSB. The conventional approach to calculate (11) is and a schematics similar to the one in Fig. 6. The successive use
to use a “shift-and-add” serial technique or, for fast applications, to of a T/A block permits to represent the result into the abacus
hardware implement (11) in a parallel or a pipeline fashion. format.

7 0
a b 6 0
a b 5 0
a b 4 0
a b 3 0
a b 2 0
a b 1 0
a b 0 0
a b

7 1
a b 6 1
a b 5 1
a b 4 1
a b 3 1
a b 2 1
a b 1 1
a b 0 1
a b

7 2
a b a6 b2 a5 b2 a4 b2 a3 b2 a2 b2 a1 b2 a0 b2

7 3
a b 6 3
a b a5 b3 a4 b3 a3 b3 a2 b3 a1 b3 a0 b3

7 4
a b a6 b4 a5 b4 a4 b4 a3 b4 a2 b4 a1 b4 a0 b4

7 5a b 6 5
a b a5 b5 a4 b5 a3 b5 a2 b5 a1 b5 a0 b5

a7 b6 a6 b6 a5 b6 a4 b6 a3 b6 a2 b6 a1 b6 a0 b6

a7 b7 a6 b7 a5 b7 a4 b7 a3 b7 a2 b7 a1 b7 a0 b7 (11)
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 12, DECEMBER 1999 1515

Using the same principle followed to compute (11), we can group TABLE I
FEATURES OF ABACUD ARITHMETIC CIRCUITS
the terms in (12) as follows:

K7; 3 H7; 0 H3; 0


K7; 7 H7; 4 H3; 4 (14)

where the weight of each column increases by a factor 16 moving


from right to left. Moreover

Kl; m = 4ai bm + al bm02 + Pl; m01 (15)

Hi; j = 4(Pi; j + Pi02; j +2 ) + Pi02; j + Pi04; j +2 (16)


and finally, we can represent (14) as

P = 28 Q1 + Q0 (17)
The total number of transistors required by the simulated circuits is
limited; it ranges from 296 to 3699. These figures are quite acceptable
Q1 = 16K7; 7 + K7; 3 + H7; 4 for the implemented functions. Moreover, a custom layout permits
Q0 = 16(H7; 0 + H3; 4 ) + H3; 0 (18) obtaining a good compactness. The 33 transistors required to achieve
a B/T function can accommodated within a 16 2 19 m space,
leading to an area per single transistor as small as 9.5 m2 : Assuming
The approach proposed here is similar to the well-known Wallace
that the overhead for block interconnections is 100% of the basic
block area, we can estimate that the entire 8 2 8 pipeline multiplier
tree [2] and Dadda [3], [4] implementations. The basic idea is
can be accommodated in 0.07 mm2 : The above estimation is rough;
to achieve the multiplication result with a hierarchical operand
reduction. However, the method proposed here utilizes an abacus
nevertheless, the achieved result just gives us an idea of the possible
representation of numbers with a 0–7 range instead of a simpler binary
chip area of the proposed solution.
coding. This feature leads to a further reduction of carry–transfer
need and a lower number of hierarchical levels. Moreover, specific
architectures can be studied in order to reduce the critical path.
VII. CONCLUSION
Nevertheless, the proposed method requires using the variety of basic
blocks discussed in Section II. This is a partial limit: the basic blocks This brief presented a technique for performing arithmetic func-
can be achieved with a regular layout and a well-structured floor plan. tions that mimic the Chinese abacus. The key feature of the method is
The calculation of the Kl; m and Hi; j terms represented by (15) the use of a different data representation. Using abacus basic blocks,
and (16) involve the addition of terms with different weight, some it was possible to achieve fast CMOS adders/multipliers operating
of them having an abacus format. It is possible to achieve the result at a clock frequency higher than traditional counterparts. The circuit
by a proper use of abacus blocks. Unity weight bin, like the lower implementation requires a small chip area. Nevertheless, it is difficult
beads’ partial sums Pi; j or the output of logic “and,” are added by to compare our solution with traditional architectures; the chip area
B/T or SU blocks. The results are then transformed into the abacus critically depends on design rules of the specific technology used.
format by T/A converters. Similar to the architecture in Fig. 7, we can
design parallel computation lines with a minimum (or pipelined) carry
ACKNOWLEDGMENT
path. The strategy used was to performs the required operations with
a hierarchical approach: the various terms are successively grouped The authors would like to thank G. Torelli and S. Cirimelli for
in groups of three or four terms and the results are calculated with numerous helpful discussions.
architectures made by basic blocks.
Pipeline implementations are also possible: the technique, of
course, requires the architecture partitioning in various stages. Each REFERENCES
stage provides the input to an “hold block” used as interface of the
successive pipeline stage. [1] R. Zimmerman and W. Fichtner, “Low-power logic styles: CMOS
versus pass transistor logic,” IEEE J. Solid-State Circuits, vol. 32, pp.
1079–1090, July 1997.
VI. SIMULATION RESULTS AND IMPLEMENTATION ISSUES 2
[2] M. Hanawa, K. Kaneko, et. al, A 4.3 ns 0.3 mm CMOS 54 54 multiplier
using precharged pass-transistor logic,” in Proc. ISSCC’96, pp. 364–365.
Using the methodology discussed in the previous section we [3] S. Naffziger, “A sub-nanosecond 0.5 mm 64 b adder design,” in Proc.
have designed an 8-bit parallel adder, an 8-bit pipeline adder, and ISSCC’96, pp. 362–363.
an 8 2 8 pipeline multiplier. The circuits have been simulated 2
[4] A. Inoue, R. Ohe, et al., “A 4.1 ns compact 54 54 multiplier utilizing
with SPICE using a 0.35-m CMOS process. Parasitic capacitances sing select booth encoders,” in Proc. ISSCC’97, pp. 416–417.
extracted from the layout of basic blocks and an estimation of
interconnection capacitances have been accounted for. The achieved
results are summarized in Table I. We can observe that for the pipeline
implementations, the pre-charge phase and the I/O delay due to the
transfer-gate operation are less than 0.38 and 0.51 ns for the 8-bit
adder and the 8 2 8 multiplier, respectively. Therefore, the maximum
possible clock frequency is, in the nominal case, 1.3 GHz and 980
MHz, respectively.

S-ar putea să vă placă și