Sunteți pe pagina 1din 17

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/260909656

Designing Dynamic Carry Skip Adders: Analysis and Comparison

Article  in  Circuits Systems and Signal Processing · April 2014


DOI: 10.1007/s00034-013-9688-y

CITATIONS READS

0 611

4 authors:

Raffaele De Rose Marco Lanuzza


Università della Calabria Università della Calabria
49 PUBLICATIONS   335 CITATIONS    93 PUBLICATIONS   855 CITATIONS   

SEE PROFILE SEE PROFILE

Fabio Frustaci Sohan Purohit


Università della Calabria Intel
53 PUBLICATIONS   474 CITATIONS    26 PUBLICATIONS   184 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Nanopower analog circuit design View project

Approximate VLSI Circuits & Systems View project

All content following this page was uploaded by Raffaele De Rose on 23 November 2016.

The user has requested enhancement of the downloaded file.


Circuits Syst Signal Process (2014) 33:1019–1034
DOI 10.1007/s00034-013-9688-y

Designing Dynamic Carry Skip Adders:


Analysis and Comparison

Raffaele De Rose · Marco Lanuzza ·


Fabio Frustaci · Sohan Purohit

Received: 30 November 2012 / Revised: 26 September 2013 / Published online: 31 October 2013
© Springer Science+Business Media New York 2013

Abstract Addition represents an important operation that significantly impacts the


performance of almost every data processing system. Due to their importance and
popularity, addition algorithms and their corresponding circuit implementations have
consistently received attention in research circles, over the years. One of the most
popular implementations for long adders is the carry skip adder. In this paper, we
present the design space exploration for a variety of carry skip adder implementa-
tions. More specifically, the paper focuses on the implementation of these adders us-
ing traditional as well as novel dynamic circuit design styles. 8–16–32–64-bit adders
were implemented using traditional domino, footless domino, and data driven dy-
namic logic (D3L) in ST Microelectronics 45 nm 1 V CMOS process. In order to
further exploit the advantages of the domino and D3L approaches, a new hybrid
methodology combining both strategies was implemented and presented in this work.
The adders were analyzed for energy-delay trade-offs at different process corners.
They were also examined for their sensitivity to process and supply voltage varia-
tions. Comparative simulation results reveal that the full D3L adder ensures a better
energy-delay product over all process corners (down to 34 % and 25 % lower than
the domino and hybrid implementations, respectively, at the typical corner), while

B
R. De Rose · M. Lanuzza ( ) · F. Frustaci
Department of Informatics, Modeling, Electronics and System Engineering, University of Calabria,
Via P. Bucci 42C, 87036 Arcavacata Di Rende, CS, Italy
e-mail: lanuzza@deis.unical.it
R. De Rose
e-mail: rderose@deis.unical.it
F. Frustaci
e-mail: frustaci@deis.unical.it

S. Purohit
Intel Corporation, Austin, TX 78746, USA
e-mail: sohan.s.purhoit@intel.com
1020 Circuits Syst Signal Process (2014) 33:1019–1034

showing at the same time similar performance in terms of process and supply voltage
variability as compared to the other considered carry skip adder configurations.

Keywords Carry skip adders · Dynamic circuits

1 Introduction

Addition forms an essential operation in any digital system and can significantly im-
pact the performances of the overall system [20]. Addition also forms the backbone
of other arithmetic circuits such as multipliers, dividers, comparators, etc. Therefore,
high-speed adders can be considered core elements of modern digital signal proces-
sors (DSPs) and multimedia processors and, consequently, they critically influence
the performance-power profiles of these systems.
Since the carry propagation is a major speed limiting factor, the design of fast
carry chains has always garnered great interest from researchers working on high-
performance arithmetic circuits and systems [8, 22]. Among several possible addition
schemes, Manchester Carry Chain (MCC)—based circuits have become very popular
due to their simplicity and efficiency [22]. In the last few years, several efficient
MCC-based addition circuits have been proposed in literature [2, 5, 12]. In particular,
the one introduced in [2] allows low-complexity, area-efficient, and high-performance
adder implementations.
While speed is crucial, the advent of mobile computing has put a lot of emphasis
on reducing the power consumption of digital systems. DSP and multimedia process-
ing systems form part of these mobile computing solutions and, therefore, they are
subjected to stringent power constraints. Consequently, arithmetic processing sys-
tems and, hence, addition circuits need to tailor down their power consumption. As
demonstrated in [24], the chosen logic design style, together with the adopted tran-
sistor sizing criterion, can significantly affect the energy dissipation. In order to de-
sign high-speed adders, the dynamic domino logic is usually exploited [22]. As well
known, the correct functionality of a dynamic domino circuit depends on the appro-
priate design of the system clock distribution tree. However, due to its high switching
activity, the clock distribution tree contributes significantly to the total power budget
of the system, sometimes accounting for almost up to 40 % of the total system power
[10]. In order to limit the power consumption, while still retaining the speed advan-
tage of traditional domino circuits, the Data Driven Dynamic Logic (D3L) was pro-
posed [18]. This logic design style allows circuits to operate in the precharge-evaluate
fashion like conventional dynamic circuits, without the need for a clock signal to gen-
erate the precharge and evaluation sequence. As matter of fact, these circuits make
use of input signal vectors to generate the precharge and evaluation patterns. There-
fore, D3L circuits retain speed advantages shown by conventional dynamic circuits,
while avoiding the extra power consumption associated with the clock tree. However,
in dynamic circuits with long pre-charge propagation paths, the energy advantages
of D3L are typically obtained at the expense of a non-negligible penalty in terms of
speed performances [6, 7, 14, 16].
In this work, an extensive analysis of the impact of different design styles on the
design of a fast carry skip adder is presented. This adder topology has been chosen
Circuits Syst Signal Process (2014) 33:1019–1034 1021

as a case study since it is one of the most popular addition implementation strategies
in applications where balancing between speed and energy consumption is critically
required [13]. Compared to the faster Carry-Look-Ahead (CLA) approach, the carry
skip adder approach has been shown to achieve competitive performance with con-
siderably lower energy dissipation [13]. In this work, detailed evaluations of four
different transistor level designs for n-bit (where n = 8, 16, 32, 64) carry skip adder
are reported. All circuits exploit the carry-skip chain (CSC) proposed in [2] to speed
up the carry propagation and, consequently, improve the overall adder performance.
The four designs were implemented using standard domino, footless domino, D3L,
and dynamic hybrid (standard domino + D3L) logic design styles, respectively, and
they were laid-out exploiting the STMicroelectronics 45 nm 1 V CMOS technology.
In particular, in this paper, we expand the work reported in [4] by post-layout charac-
terizing the considered adder structures. As additional analysis, post-layout compar-
ative characterizations were performed considering different process corners and the
effects of random process variability and power supply variations.
The rest of the paper is organized as follows. Section 2 presents an overview of
the carry-skip adder architecture considered for this study. Section 3 presents the
transistor level designs of the four adders, followed by detailed post-layout simulation
results and analysis in Sect. 4. Finally, the main results of the work are summarized
in Sect. 5.

2 The Carry-Skip Adder Architecture

As illustrated in Fig. 1, the generic n-bit carry skip adder uses four basic logic blocks:
the carry Propagate Block (PB), the carry Generate Block (GB), the carry propagation
block (which contains the Skip Logic) and the Sum Block (SB). The PBs and the GBs
calculate the ith carry propagate (Pi = Ai + Bi ) and carry generate (Gi = Ai Bi )
signals, respectively. The carry propagation block is based on the MCC circuit which
uses the ith carry propagate and generate signals to generate the output carry bits
(Ci+1 = Gi + Pi Ci ). Finally, the SBs produce the final sum bits (Si = Pi XORCi ).
The use of MCCs allows very simple and efficient carry propagation. As depicted
in Fig. 1, the number of carry signals produced by a single MCC block is typically

Fig. 1 n-Bit carry skip adder using cascaded 4-bit blocks


1022 Circuits Syst Signal Process (2014) 33:1019–1034

Fig. 2 4-Bit Manchester carry


chain in standard domino

limited to four [17]. The motivation behind this can be explained with reference to
Fig. 2, which shows the basic architecture of a 4-bit MCC dynamic standard domino
implementation. As shown in Fig. 2, by limiting the size of the smallest basic block
to 4 bits, the maximum height of the NMOS transistor pull-down stack is reduced
to six transistors, thereby limiting the body-effect induced rise in transistor thresh-
old voltages [17]. This approach also limits the propagation delay through the pass
transistors, which is a quadratic function of the number of the bits in the block [22].
The critical path delay of a single 4-bit MCC block also includes the delay of the in-
termediate buffers inserted between two consecutive MCC blocks. Furthermore, the
first stage of the MCC is redundant since it is only used to generate the complement
of the input carry signal. The improved MCC circuits proposed in [5] and in [12]
speed up the carry propagation, but they still exhibit the area and delay overhead due
to the intermediate buffers and redundant input stages of the basic 4-bit MCC block.
The solution proposed in [2], called carry-skip chain (CSC), eliminates the redundant
stages and intermediate buffers, thus resulting in an area-efficient high-performance
MCC circuit implementation. The CSC also incorporates efficient carry-skip speed-
up logic [2]. In this work, the CSC approach has been adopted to design fast n-bit
carry-skip adders.

3 Transistor Level Carry Skip Adder Designs

For this study, four different transistor level carry skip adder designs were imple-
mented by exploiting standard domino, footless domino, D3L and dynamic hybrid
logic design styles. This section details the different transistor level implementations.

3.1 Standard Domino Carry Skip Adder

Figure 3(a–b) shows the standard domino implementations for the PB and the GB ba-
sic sub-circuits, respectively. In both these circuits, the evaluating NMOS transistors
were equally sized in order to make the corresponding pull-down network equiva-
lent to a single 0.12 µm wide NMOS device. Since all stages in a domino circuit
precharge simultaneously, and due to the presence of a single device in the precharge
path, PMOS devices with a channel width of 0.16 µm were used in the precharge
pull-up network. This approach allows the capacitive loading presented to the clock
tree to be reduced, thus lowering the power consumption of the clock distribution
network. As shown in Fig. 3(c), the SBs used for computing the final sum bits were
Circuits Syst Signal Process (2014) 33:1019–1034 1023

Fig. 3 (a) 2-Input standard


domino XOR; (b) 2-input
standard domino AND;
(c) 2-input static XOR

Fig. 4 Implementation of (a) basic 4-bit carry-skip chain (CSC) block and (b) carry-skip signal generator
using standard domino logic (the highlighted NMOS transistors are within the critical evaluation path)

implemented using static 2-input XOR gates. In the XOR gate, pull-down and pull-
up networks (PDNs and PUNs, respectively) were both sized to be equivalent to the
corresponding pull-down and pull-up devices of the output inverter.
The 4-bit CSC-based MCC, implemented by using standard CMOS dynamic
domino logic, is shown in Fig. 4(a). The output node exploits two carry-skip pull-
down transistors controlled by the skip signal and the input carry, respectively. As
illustrated in Fig. 4(b), the skip signal (sk) is generated by the logical AND of all
carry propagate signals in the block (i.e., sk = Pi Pi+1 Pi+2 Pi+3 ). Note that the carry-
skip pull-down not only speeds up the generation of the final carry, but it also restores
the signal strength at this node. This eliminates the need for intermediate buffers be-
tween CSC blocks. Moreover, it is worth noting that NMOS transistors in PDNs of
both circuits were sized by using a progressive transistor sizing approach with a ta-
pering factor of 1.5 [17].

3.2 Footless Domino Carry Skip Adder

As shown in Fig. 5(a), the speed of the standard domino carry skip adder can be
improved by implementing the 4-bit CSC block with the footless domino logic ap-
1024 Circuits Syst Signal Process (2014) 33:1019–1034

Fig. 5 Implementation of (a) basic 4-bit carry-skip chain (CSC) block and (b) carry-skip signal generator
using footless domino logic

proach. Indeed, in this case, only a single NMOS pull-down transistor is used at each
node of the circuit. Similarly, the AND gate used to calculate the skip signal can be
realized in footless domino logic (Fig. 5b). Note that NMOS and PMOS transistors
were sized as in standard domino design.
In terms of transistor count, a single footless domino 4-bit CSC block saves five
transistors per each 4-bit CSC block as compared to the standard domino design.
This is a significant reduction in terms of transistor count of the carry chain, which
results in a more speed-energy-area-efficient implementation. More importantly, the
use of footless domino logic in the CSC blocks allows reducing the system clock load
capacitance, leading to lower dynamic power consumption of the clock distribution
system.
Despite its speed advantages, footless domino logic has a severe drawback: re-
moving the footed transistor may result in high static power dissipation due to short-
circuit paths during the precharge phase [9]. In order to avoid this effect, a delay
element (e.g. a transmission gate) should be inserted in the system clock distribution
tree with the aim of delaying the clock signal between two cascaded blocks [9].

3.3 Carry Skip Adder Implementation Using Data Driven Dynamic Logic

The D3L design methodology allows designers to minimize or, even, eliminate the
clock distribution network required by conventional dynamic circuits, thus leading to
significantly lower energy consumption [18]. In fact, instead of the traditional clocked
precharge, D3L makes use of a combination of input signals to achieve the alternate
precharge and evaluation phases [18]. This helps to retain the speed advantage tradi-
tionally associated to dynamic circuits, without the extra cost of clock-related power
consumption and clock tree design.
Figure 6(a–b) illustrates the D3L implementation of the generic PB and GB sub-
circuits, where the clocked pre-charging transistors are replaced by PMOS precharge
transistors driven by the input signals. Note that the clock signal used in domino dy-
namic logic to coordinate the gate operations is eliminated at the expense of higher
Circuits Syst Signal Process (2014) 33:1019–1034 1025

Fig. 6 Implementation of
(a) 2-input XOR gate and
(b) 2-input AND gate using D3L

Fig. 7 Implementation of (a) basic 4-bit carry-skip chain (CSC) block and (b) carry-skip signal generator
using D3L

capacitance of the input lines. The transistor level schematic of the 4-bit D3L CSC
block is shown in Fig. 7(a–b). In this case, the precharge phase is driven by propa-
gate and generate signals. It is worth noting that the removal of the clocked NMOS
transistor in each pull-down path reduces the evaluation path delay of the generic
block, as in the footless domino version. One impact of the D3L approach from a
timing perspective is that the precharge phase is no longer simultaneous for all the
blocks. Consequently, there exists a precharge propagation path through cascaded
blocks. From a design perspective, this implies that all the precharge networks in
D3L circuits have to be properly sized in order to avoid that the precharge propaga-
tion becomes critical for the system [15]. However, in the designed D3L carry skip
adder, the precharge path is much shorter than the evaluation path, thus allowing all
the PMOS precharge transistors to be sized with a channel width of 0.16 µm without
incurring any additional delay penalties.
1026 Circuits Syst Signal Process (2014) 33:1019–1034

Fig. 8 (a) Simulation setup,


(b) clock distribution tree

This reduces the loading on the intermediate signals used for achieving precharge
and thus reduces the overall power consumption of the circuit. As for the previous
implementations, NMOS evaluating transistors were sized with a progressive sizing
methodology. In terms of transistor count, a single 4-bit D3L CSC block exploits four
more transistors than the 4-bit footless domino CSC block.

3.4 Carry Skip Adder Implementation Using Hybrid Dynamic Design

In this design, a hybrid approach comprising a combination of standard domino and


D3L circuits was adopted. In fact, the circuits generating both propagate and gener-
ate signals were designed in standard domino logic (Fig. 3(a–b)), whereas the CSC
blocks were implemented using D3L design style (Fig. 7(a–b)). In this way, the input
capacitance of the circuit is reduced with respect to the full D3L implementation,
while still retaining power advantages of the D3L carry propagation chain.

4 Results

The above discussed designs were laid-out exploiting the commercial ST Microelec-
tronics 45 nm 1 V CMOS technology. Figure 8(a) shows the simulation setup used in
this work to evaluate the compared circuits. Input buffers were placed between ideal
voltage sources and data/clock inputs in order to provide realistic input signals. The
characterization phase was performed by loading each output signal with a 0.8 fF
capacitance, which corresponds to the input capacitance of a D-type Flip-Flop in the
referred technology.
In order to correctly distribute the clock signal to the dynamic gates used in the
circuits, a two-level clock buffer tree, depicted in Fig. 8(b) (where CLGB , CLPG and
CLCS are the clock load capacitances due to the clocked transistors within the PB, the
GB and the CSC sub-circuits, respectively), was designed. The logical effort method
[21] was used for sizing the inverter chains in the clock buffer network. In particular,
Eq. (1) was applied, where CL represents the clock load capacitances of the critical
inverter chain and Cgl denotes the gate capacitance of the inverter at the lth stage of
the clock buffer, with l = 1, . . . , 4. We have
Cg2 3Cg3 Cg4 CL
= = = (1)
Cg1 Cg2 Cg3 Cg4
Circuits Syst Signal Process (2014) 33:1019–1034 1027

Table 1 Delay comparison of various adder implementations

Implementation 8 bit 16 bit


tbuff-data tin-sum tin-cout tpre tbuff-data tin-sum tin-cout tpre
[ps] [ps] [ps] [ps] [ps] [ps] [ps] [ps]

Standard domino 14.9 278.6 138.8 85.3 14.9 377.5 249.1 89.7
Footless domino 14.9 273.4 138 101.9 14.9 371.8 249.4 106.4
Hybrid 14.9 278.7 141.3 147 14.9 376.4 253.1 150.1
D3L 15.6 278.7 140.8 116.3 15.6 377.1 252.2 116.3

Implementation 32 bit 64 bit


tbuff-data tin-sum tin-cout tpre tbuff-data tin-sum tin-cout tpre
[ps] [ps] [ps] [ps] [ps] [ps] [ps] [ps]

Standard domino 14.9 599.7 472.7 95.2 14.9 1038.3 912.8 101.9
Footless domino 14.9 595.8 473.1 111.3 14.9 1037.1 915.9 118
Hybrid 14.9 599.2 476.4 154 14.9 1039.9 917.8 158.8
D3L 15.6 601.8 477.5 116.3 15.6 1046.2 922.9 116.3

It is worth mentioning that the standard domino circuit has required the most com-
plex clock distribution tree, whereas the CLCSC is reduced for the footless domino cir-
cuit and completely eliminated for the hybrid adder. As previously mentioned, the
full D3L circuit does not require a clock distribution tree.
All the four designs were analyzed for delay, energy and energy-delay product
(EDP). The analysis was repeated for adder widths of 8-16-32-64 bits in order to
investigate the performance dependence of various design styles on adder widths. The
Cadence Spectre simulator was employed to evaluate the speed performance, whereas
the average energy dissipation was measured using Synopsis Nanosim. Comparative
post-layout delay and energy results, obtained for the typical NMOS typical PMOS
process corner at 27 °C, are reported in Tables 1 and 2, respectively.
Table 1 clearly shows that the D3L circuit implementation exhibits a slightly
slower evaluation due to the increased loading capacitance on intermediate data sig-
nals in the PB, GB and CSC sub-circuits. Additionally it shows a higher precharge
delay than standard and footless domino implementations. Furthermore, due to the
higher capacitances on the input lines, the D3L circuit also shows approximately
4.7 % higher data input buffer delay compared to other carry skip adders. This indi-
cates that the D3L circuit effectively presents a larger load to the circuits driving it.
In order to correctly evaluate the energy consumption of the different implementa-
tions, the energy dissipation of the data input buffers (Ebuff-data ), the clock distribu-
tion network (Eclk ) and the carry skip adder (ECSA ) have been separately measured.
Comparative energy results, shown in Table 2, demonstrate that, although the D3L
implementation exhibits a slightly higher data input buffer energy consumption due
to the higher capacitances on its input lines, it always shows the lowest total energy
dissipation owing to the complete removal of the clock distribution network. This
confirms the full D3L approach as the choice implementation strategy when design-
ing high-speed circuits for energy-constrained environments.
1028 Circuits Syst Signal Process (2014) 33:1019–1034

Table 2 Energy comparison of various adder implementations

Implementation 8 bit 16 bit


Ebuff-data Eclk [fJ] ECSA ETOT Ebuff-data Eclk ECSA ETOT
[fJ] [fJ] [fJ] [fJ] [fJ] [fJ] [fJ]

Standard domino 38.6 58.9 113.2 210.6 77.1 98.0 227.6 402.6
Footless domino 38.6 43.4 110.4 192.3 77.1 72.6 223.1 372.8
Hybrid 38.6 33.6 113.8 185.9 77.1 56.1 228.6 361.7
D3L 38.7 – 100.4 139.0 77.4 – 203.6 281.0

Implementation 32 bit 64 bit


Ebuff-data Eclk ECSA ETOT Ebuff-data Eclk ECSA ETOT
[fJ] [fJ] [fJ] [fJ] [fJ] [fJ] [fJ] [fJ]

Standard domino 154.2 173.1 462.9 790.2 308.4 322.7 957.7 1588.8
Footless domino 154.2 126.9 454.5 735.6 308.4 234.5 942.4 1485.3
Hybrid 154.2 98.5 462.8 715.5 308.4 181.4 959.2 1449.0
D3L 154.7 – 416.2 570.9 309.4 – 860.4 1169.8

Fig. 9 Energy-delay-product
curves for the different adder
implementations

The Energy-Delay Product (EDP) value, calculated as the product of the worst
case DATA-INPUT → SUM-OUTPUT delay with the total dissipated energy, gives
a quantitative measure of the speed-energy trade-off and, hence, represents a partic-
ularly useful quality metric when designing circuits that balance the high-speed-low-
power domain. The EDP values for the different circuit implementations are summa-
rized and plotted in Fig. 9.
Due to its better energy results, the D3L adder always exhibits the lowest EDP
values, thus achieving the best speed-energy trade-off.

4.1 Corner Analysis

Post-layout corner simulation results are summarized in Table 3. The TT corner in-
volves typical NMOS and PMOS transistors. The FF corner is related to fast NMOS
Circuits Syst Signal Process (2014) 33:1019–1034 1029

Table 3 Corner simulation results for adder implementations for varying no. of bits

Implementation 8 bit
TT FF SS
Delay Energy EDP Delay Energy EDP Delay Energy EDP
[ps] [fJ] [e-23] [ps] [fJ] [e-23] [ps] [fJ] [e-23]

Standard domino 278.6 210.6 5.87 246.4 292.9 7.22 388.0 187.3 7.27
Footless domino 273.4 192.3 5.26 237.0 271.6 6.44 373.9 167.5 6.26
Hybrid 278.7 185.9 5.18 241.9 264.2 6.39 381.8 162.8 6.22
D3L 278.7 139.0 3.87 237.3 192.2 4.56 373.3 122.7 4.58

Implementation 16 bit
TT FF SS
Delay Energy EDP Delay Energy EDP Delay Energy EDP
[ps] [fJ] [e-23] [ps] [fJ] [e-23] [ps] [fJ] [e-23]

Standard domino 377.5 402.6 15.20 322.1 568.5 18.31 507.7 355.7 18.06
Footless domino 371.8 372.8 13.86 315.5 533.1 16.82 498.9 321.9 16.06
Hybrid 376.4 361.7 13.62 318.8 520.4 16.59 505.8 314.6 15.91
D3L 377.1 281.0 10.60 315.6 390.4 12.32 498.7 247.0 12.32

Implementation 32 bit
TT FF SS
Delay Energy EDP Delay Energy EDP Delay Energy EDP
[ps] [fJ] [e-23] [ps] [fJ] [e-23] [ps] [fJ] [e-23]

Standard domino 599.7 790.2 47.39 497.6 1121.8 55.81 789.9 693.6 54.79
Footless domino 595.8 735.6 43.83 490.4 1058.4 51.90 780.4 632.7 49.38
Hybrid 599.2 715.5 42.87 494.4 1037.5 51.30 788.4 621.6 49.00
D3L 601.8 570.9 34.35 493.1 792.8 39.09 784.3 500.4 39.25

Implementation 64 bit
TT FF SS
Delay Energy EDP Delay Energy EDP Delay Energy EDP
[ps] [fJ] [e-23] [ps] [fJ] [e-23] [ps] [fJ] [e-23]

Standard domino 1038.3 1588.8 164.97 842.4 2267.0 190.97 1339.8 1393.6 186.71
Footless domino 1037.1 1485.3 154.04 835.1 2143.2 178.97 1334.1 1277.3 170.40
Hybrid 1039.9 1449.0 150.69 842.7 2102.8 177.21 1349.0 1257.4 169.63
D3L 1046.2 1169.8 122.39 843.0 1622.5 136.77 1348.2 1027.2 138.48

and PMOS transistors, whereas the SS corner considers slow NMOS and PMOS de-
vices.
The obtained results show that, depending on the process corners, the D3L im-
plementation achieves significantly lower energy dissipation compared to the other
implementations. Moreover, the full D3L implementation provides 34 %, 30 %,
28 % and 26 % improvement in EDP, respectively, over the 8–16–32–64-bit stan-
1030 Circuits Syst Signal Process (2014) 33:1019–1034

Table 4 Comparison of delay variability of different adder implementations for varying no. of bits

8 bit 16 bit 32 bit 64 bit


μ σ σ/μ μ σ σ/μ μ σ σ/μ μ σ σ/μ
[ps] [ps] [%] [ps] [ps] [%] [ps] [ps] [%] [ps] [ps] [%]

Standard domino 278.6 25.0 9.0 377.5 34.5 9.1 599.7 51.4 8.6 1038.3 90.1 8.7
Footless domino 273.4 26.7 9.8 371.8 37.2 10.0 595.8 53.0 8.9 1037.1 90.3 8.7
Hybrid 278.7 26.3 9.4 376.4 37.6 10.0 599.2 56.2 9.4 1039.9 96.4 9.3
D3L 278.7 27.0 9.7 377.1 35.7 9.5 601.8 57.4 9.5 1046.2 94.1 9.0

dard domino implementation in the typical corner. It is easy to observe that the D3L
circuit always offers the lowest energy-delay product values over all process corners.

4.2 Delay Variability Analysis

The effects of random process variations on delay variability of all the considered cir-
cuit implementations were evaluated through 1000 sample Monte Carlo simulations
performed in cadence environment. The results of the delay variability analysis are
shown in Table 4, which reports the mean (μ), standard deviation (σ ) and the relative
variation (σ/μ) of the delay. The standard domino circuit always presents the lowest
delay variability (i.e., the lowest σ and σ/μ values) for all the evaluated adder sizes.
The footless domino circuit exhibits the highest delay variability for adder sizes of 8
and 16 bits, while for the cases of 32 and 64 bits, the D3L circuit and the hybrid cir-
cuit show the highest σ/μ values, respectively. Overall, it can be observed that all the
adder implementations show similar delay variability (around 8–10 %) in the pres-
ence of mismatch variations. Therefore, the delay variability seems to be relatively
independent of the different circuit implementations of the carry skip adder.

4.3 Power Supply Variability Analysis

As well known, varying the value of VDD is an effective strategy to enhance the
circuit performances (VDD increasing) or, alternatively, to reduce power dissipation
(VDD decreasing). However, contrary to a deliberate VDD modification, the power
supply in a digital circuit can also experience an unwanted variation ΔVDD from its
nominal value due to noise-related effects [11, 23]. Thus, differently from a designed
VDD modification, which has a pre-determined and desired impact on circuit perfor-
mance, an unwanted VDD fluctuation can easily cause a random variation of circuit
performance. In practical designs, the variation ΔVDD can be kept down by a proper
sizing of the supply distribution rails and by the use of decoupling capacitors, which
is primarily a design effort on the full-chip level. However, as a general design prac-
tice, individual VLSI circuits are usually designed to tolerate a 5 %–10 % supply
voltage variation [3]. Since the ratio ΔVDD /VDD turns out to be small, the impact of
the power supply uncertainty on adder delay can be evaluated by the delay sensitivity
with respect to VDD [1]:
Δτ
VDD dτ
SVτ DD = lim 0 ΔVτ = · (2)
ΔVDD →F6 DD τ dVDD
VDD
Circuits Syst Signal Process (2014) 33:1019–1034 1031

The above figure of merit can be theoretically derived by correctly modeling the
carry-skip chain of the analyzed adders. Let us consider the example of the standard
domino implementation of Fig. 4(a). The critical path of the chain is the series of the
highlighted NMOS devices through which the carry-out node can be discharged. To
purpose of modeling, this path can be considered as formed by a single equivalent
NMOS transistor. According to the alpha-power law proposed in [19], the delay τ of
the CSC is proportional to the saturation current IDS of the equivalent NMOS, which
can be expressed as
IDS = K·(vgs − vth )α (3)
where K is a technology-dependent constant, proportional to the transistor aspect
ratio; vgs is the gate-to-source voltage; vth is the transistor threshold voltage; α is a
technology-dependent coefficient ranging from 1 (deep sub-micrometer transistors)
to 2 (long-channel transistors). The transient response of the output voltage v0 can be
formulated as follows:
 t
v0 (t) = VDD − 1/C · K · (vgs − vth )α dt (4)
0

In order to express the delay τ introduced by the equivalent NMOS device, the lat-
ter can be replaced with its equivalent resistance, whose mean value can be calculated
as
 VDD /2
1 v0 3 VDD
Requ = dv0 = · (5)
−VDD /2 VDD K · (vin − vth )α 4 K · ( DD
V
2 − vth )
α

where it has been assumed vin = VDD /2 (the CSC delay τ is defined as the difference
of the time when v0 crosses the value VDD /2 and the time when vin reaches the same
value). Therefore, the simple circuit of Fig. 4 can be treated as a RC network, whose
delay is given by [13]
VDD
τ = log 2 · Requ · C = 0.52 · ·C (6)
K · ( 2 − vth )α
VDD

By substituting Eq. (6) in Eq. (2), and after some simple simplifications, the fol-
lowing expression of the delay sensitivity with respect to supply variations can be
achieved:
0.5 · α · VDD
vth
SVτ DD = 1 − (7)
0.5 · VDD
vth −1
As explained in [1], the value of α, which is typical of the adopted technology, can
be found using the simulator: first, the values IDS1 and IDS2 of the NMOS saturation
current have to be obtained for vgs equal to the VDD,max allowed by the technology
and for vgs equal to the (2/3)VDD,max , respectively; afterwards, the value of α can be
calculated from Eq. (3), as follows:
   2V 
− vth ∼
DD,max
α = log(IDS1 /IDS2 ) log (VDD,max − vth ) = 1.24 (8)
3
1032 Circuits Syst Signal Process (2014) 33:1019–1034

Fig. 10 Power supply


sensitivity of the various carry
skip adder implementations

Fig. 11 Delay under VDD


variations for 64 bit carry skip
adder (delay values are
normalized to their nominal
values, i.e. VDD = 1 V)

It is worth noting that Eq. (7) calculates the delay sensitivity for a single CSC
whereas the critical path of the whole carry skip adder is composed of a single GB,
a series-connected CSCs and a single SB sub-circuits. However, the above modeling
is absolutely general for any dynamic gate and it can be successfully applied also to
the GB and SB sub-circuits. Since the sensitivity of each component of the adder’s
critical path can be modeled as in Eq. (7), it follows that the same Eq. (7) expresses
the delay sensitivity with respect to VDD of the whole adder. Moreover, as stated
above, Eq. (7) is totally general for any kind of dynamic gate and, consequently, it is
valid not only for the standard domino implementation of the adder, but also for all
the other analyzed implementations. Simulation results (i.e. dotted lines), reported in
Fig. 10, confirm the latter consideration. The delay sensitivity with respect to VDD is
practically the same for all the adder implementations. Moreover, the proposed model
is in very good agreement with simulation results, showing a maximum error of only
9 % which occurs when VDD = 0.8 V.
For the sake of completeness, in Fig. 11, the adders’ delay is plotted for the 64-bit
configuration and for different values of the power supply. Delay values have been
normalized to their nominal values (i.e., VDD = 1 V). As predicted by the above
sensitivity analysis, the different adder implementations have roughly the same delay
Circuits Syst Signal Process (2014) 33:1019–1034 1033

Fig. 12 Energy under VDD


variations for 64 bit carry skip
adder (energy values are
normalized to their nominal
values, i.e. VDD = 1 V)

percentage variation with varying VDD . In particular, higher delay variations can be
observed for lower VDD values. As matter of fact, for VDD = 0.8 V (−20 %), the delay
increases by about 45 %. On the contrary, for VDD = 1.2 (+20 %), the delay reduction
is lower (around −20 %). Figure 12 depicts the variation of the dissipated energy with
varying VDD . Once again, energy values have been normalized to their nominal values
(i.e., VDD = 1 V). It is worth noting that the hybrid implementation shows the most
stable design with varying VDD . Indeed, for VDD = 1.2 V (+20 %), the hybrid design
shows an energy dissipation increase of +46 %, whereas the standard and footless
domino implementations undergo a larger energy overhead (+57 %).

5 Conclusion

This paper has presented four different implementations of carry-skip adders using
different dynamic circuit design styles. For each implementation, adders of lengths
varying from 8 bits to 64 bits have been investigated for energy, delay, EDP over all
process corners, as well as for robustness against random process and power supply
variations. Moreover, a thorough study of the power supply sensitivity of these adders
has been also presented. Comparative simulation results reveal that the full D3L adder
features the best energy-delay trade-off among all the considered implementations at
the different process corners, while showing a roughly similar sensitivity to random
process and power supply variations with respect to the other designed adder circuits.

References

1. M. Alioto, G. Palumbo, Impact of Supply Voltage Variations on Full Adder Delay: Analysis and
Comparison. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 14(12), 1322–1335 (2006)
2. A.A. Amin, Area-efficient high-speed carry chain. Electron. Lett. 43(23), 1258–1260 (2007)
3. A. Chandrakasan, W. Bowhill, F. Fox, Design of High Performance Microprocessor Circuits (IEEE
Press, New York, 2001)
4. R. De Rose, M. Lanuzza, F. Frustaci, Design and Evaluation of High-Speed Energy-Aware Carry Skip
Adders, in Proc. of IEEE 22nd International Conference on Microelectronics (2010), pp. 124–127
5. H. Eriksson, P. Larsson-Edefors, A. Alvandopour, A 2.8 ns 30 mW/MHz area-efficient 32-b Manch-
ester carry-bypass adder, in Proc. of ISCAS 2001 (2001), pp. 84–87
6. F. Frustaci, M. Lanuzza, P. Zicari, S. Perri, P. Corsonello, Low-Power Split-Path Data-Driven Dy-
namic Logic (SPD3L). IET Circuits Devices Syst. 3(6), 303–312 (2009)
1034 Circuits Syst Signal Process (2014) 33:1019–1034

7. F. Frustaci, M. Lanuzza, P. Zicari, S. Perri, P. Corsonello, Designing High Speed Adders in Power-
Constrained Environments. IEEE Trans. Circuits Syst. II 56(2), 172–176 (2009)
8. S. Hauck, M. Hosler, T.W. Fry, High-performance carry chains for FPGA’s. IEEE Trans. Very Large
Scale Integr. (VLSI) Syst. 2(8), 138–147 (2000)
9. P. Hofstee et al., 1 GHz single-issue 64b PowerPC processor, in Proc. of IEEE Int. Solid-State Circuits
Conf. (2000), pp. 92–93
10. H. Kawaguchi, T. Sakurai, A reduced clock-swing flip-flop (RCSFF) for 63 % power reduction. IEEE
J. Solid-State Circuits 33(5), 807–811 (1998)
11. M. Lanuzza, R. De Rose, F. Frustaci, S. Perri, P. Corsonello, Comparative analysis of yield optimized
pulsed flip-flops. Microelectron. Reliab. 52, 1679–1689 (2012)
12. J.H. Lou, J.B. Kuo, A 1.5 V bootstrapped pass-transistor-based Manchester carry chain circuit suitable
for implementing low-voltage carry look-ahead adders. IEEE Trans. Circuits Syst. I, Fundam. Theory
Appl. 11(45), 1191–1194 (1998)
13. B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs (Oxford University Press, Lon-
don, 2000)
14. S. Purhoit, M. Lanuzza, S. Perri, P. Corsonello, M. Margala, Design and evaluation of an energy-delay-
area efficient datapath for coarse-grain reconfigurable computing systems. J. Low Power Electron.
5(3), 326–338 (2009)
15. S. Purhoit, M. Lanuzza, M. Margala, New Performance/Power/Area Efficient Reliable Full Adder
Design, in Proc. of the ACM Great Lakes Symposium on VLSI, GLSVLSI (2009), pp. 493–498
16. S. Purhoit, M. Lanuzza, M. Margala, Design Space Exploration of Split-Path Data Driven Dynamic
Full Adder. J. Low Power Electron. 6(4), 469–481 (2010)
17. M. Rabaey, A. Chandrakasan, B. Nikolic, Digital Integrated Circuits (Prentice-Hall, New York, 2002)
18. R. Rafati, S.M. Fakhraie, K.C. Smith, Lower-Power Data-Driven Dynamic Logic (D3 L), in Proc. of
IEEE International Symposium on Circuits and Systems, ISCAS 2000 (2000), pp. 752–755
19. T. Sakurai, A.R. Newton, Alpha-power law MOSFET model and its applications to CMOS inverter
delay and other formulas. IEEE J. Solid-State Circuits 25, 584–594 (1990)
20. R. Shalem, E. John, L.K. John, A novel low power energy recovery full adder cell, in Proc. of the 9th
Great Lakes Symposium on VLSI (1999), pp. 380–383
21. I. Sutherland, R. Sproull, D. Harris, Logical Effort (Morgan Kaufmann, San Mateo, 1999)
22. N. Weste, K. Eshraghian, Principles of CMOS VLSI Design (Addison-Wesley, Reading, 1993)
23. S.S. Yoon, S.R. Yoon, S.W. Kim, C. Kim, Charge-Sharing-Problem Reduced Split-Path Domino
Logic, in Proc. of VLSI Design (2004), pp. 201–205
24. R. Zlatanovic, B. Nikolic, Power-Performance Optimization for Custom Digital Circuits, in Proc. of
PATMOS (2005), pp. 404–414

View publication stats

S-ar putea să vă placă și