Sunteți pe pagina 1din 14

462

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 4, APRIL 2005

Self-Reset Logic for Fast Arithmetic Applications


Miguel E. Litvin and Samiha Mourad, Senior Member, IEEE
AbstractA new family of self-reset logic (SRL) cells is presented in this paper. The single-ended basic structure proposed realizes an incomplete logic family, since it is incapable of inverting logic. Thus, a dual-rail SRL (DRSRL) implementation is also proposed. These cells maintain small delay variations for all input combinations, once minimum timing requirements on inputs are satised, and produce output pulses of fairly constant width for varying fanout, leaving enough headroom in the design to accommodate process, supply voltage, and temperature variations. These properties simplify the implementation of data-path and control circuits where the logic depth does not affect the stage output pulse width, eliminating the need for pulse-width controlling circuits required in previous works on SRL. In SRL, power is consumed only if new data are pumped through the logic. The clock grid is limited to the registers that launch and receive the signal path. The clocking overhead is thus reduced, compared with other dynamic designs, and it is especially suitable for wave pipelining. Case study examples and simulated characterization data are included to show the design methodology. Index TermsAsynchronous circuits, dual-rail logic, dynamic logic, pulsed logic, self-reset logic (SRL), self-timed circuits.

The remainder of this paper is organized as follows. Section II introduces general concepts on SRL. Section III describes the new cells. Section IV introduces timing parameters and basic timing analysis. Section V presents characterization data for a set of gates implemented in 0.18- m CMOS technology. Section VI describes an application implemented with this technology, an -bit Carry Propagate Adder. Section VII discusses a comparison with Domino Logic implementation of the same design presented in the previous section. The nal section presents the conclusion and proposed future work. II. SELF-RESET LOGIC Self-resetting logic blocks have been reported in the context of RAM designs with very short cycle times [4][7]. A selfresetting logic block can be described as a subset of pulse mode circuits. It is also referred to as postcharge logic because, when it receives an input pulse, it may discharge an internal node, called the summing node, and propagate the signal forward, later recharging the summing node. A mechanism generally triggered by the leading edge of the output data signal generates a reset signal to restore charge to the input stage, resetting it to its original state, and thus making it ready to receive the next pulse. Additionally, this resetting signal will reset the output to its inactive state, producing the trailing edge of the output pulse, which will eventually deactivate the reset signal. The reset signal may come from the previous stage [8]; it can be locally generated as a result of the leading edge of the local output [6] or it can be fed back from a later stage. In all previously reported cases, the width of the output pulse varies as data traverse successive stages and additional circuits have been proposed to narrow or elongate such pulses. This represents an added challenge to using SRL in circuits requiring logic depth, as in a data path or a wave-pipelining implementation. In a pipeline structure built with these logic blocks, there is no separate distribution of a clock. Each stage communicates with the next by sending a data pulse to the following stage in the pipeline. Instead of propagating a single transition at a time through the circuit, a fully formed pulse is propagated. Each block in the pipeline is designed to respond quickly to the leading edge of the pulse it receives. This increases the operating speed while simultaneously relaxing the matching constraints on convergent paths. A generic view of a self-reset logic gate is shown in Fig. 1 [9]. This gure introduces the basic concepts involved in SRL cells. There is a subblock where the logic function performed by the gate is implemented. Here it is represented by the FN block, which receives input data pulses. The output of the gate provides a pulse if the logic function becomes TRUE. The reset signal is implemented as two separate pulses: RL (active Low) and RH (active High). RL is used to reset the input stage, while

I. INTRODUCTION

N TODAYS fast processing environment, the use of dynamic circuits is becoming increasingly popular [1], and a clock distribution grid is necessary to serve not only registers, but also each individual dynamic gate. Self-reset logic (SRL) provides a design solution where the clocking overhead is minimized and the ability to reduce the area otherwise devoted to the clock grid reduces area and power consumption. In an SRL implementation, power is consumed only when new data are received. In between sets of data, some power consumption occurs due to leakage on postcharged nodes. Since SRL works with pulses, extra effort must be devoted to maintaining signal integrity, which implies constraints on the characteristics of the pulses. Previous works in SRL have resorted to special circuits to elongate or shorten pulses at different stages, adding extra delay and further complicating the design process. We propose a new family of SRL gates, which maintain a fairly constant output pulse width, under a wide range of process, voltage, and temperature (PVT) conditions. The proposed gates disable their inputs once new data have been captured; re-enabling the inputs only after a fully formed pulse has been generated at the output. The basic SRL gate does not generate inverted outputs, a property referred to as monotonicity [2], [3]. To overcome this limitation, we turn to dual-rail logic, hence the name we assign for this family of gates, Dual-Rail Self-Reset Logic with Input Disable (DRSRL-ID).

Manuscript received July 2, 2003; revised August 13, 2004. The authors are with the Department of Electrical Engineering, Santa Clara University, Santa Clara, CA 95053 USA (e-mail: smourad@scu.edu). Digital Object Identier 10.1109/TVLSI.2004.842921

1063-8210/$20.00 2005 IEEE

LITVIN AND MOURAD: SELF-RESET LOGIC FOR FAST ARITHMETIC APPLICATIONS

463

Fig. 1.

Generic SRL cell and timing diagram.

RH is used to reset the output stage after the output data has been propagated. The basic operation of an SRL gate is as follows. The gate is initially in its standby state, where power consumption is minimal (or zero). All inputs and output(s) are stable, i.e., at logic level Low. Upon receiving input data, some switching occurs, and, if the evaluated function becomes TRUE, an output pulse is generated. Formally, the leading edges of inputs propagate through the gate generating a fast leading edge of the output pulse. After the output pulse has reached a dened width, and provided that the inputs become inactive, the gate will be reset, going back to its standby state. Depending on how the Reset signals are generated and used, one can distinguish three groups or families of SRL gates: 1) one family where the reset signals are generated and used locally; 2) a second family where the reset signals come from other cells; and 3) a third family that has elements in common with the previous two, where each cell receives data inputs and a reset signal and propagates forward an output signal and a reset pulse [8]. In the remainder of this paper, we will restrict the discussion to SRL gates of the rst family, where the Reset signal is generated and used locally. III. NEW SRL FAMILIES A. Single-Rail SRL Gates With Input Disable (SRSRL-ID) There are difculties associated with the reset signal depending on the inputs becoming inactive and the output pulse being generated [6], [10]. In those implementations, the internal reset cannot be activated until after the input pulses have vanished, thus affecting the cycle time and elongating the reset state and affecting the output pulse width. As shown in [7], one can resort to additional circuitry to chop the output pulse. To overcome those difculties, we add an input disable capability. Fig. 2 shows an implementation of a single-rail SRL gate with input disable (SRL-ID). The circuit is a simple buffer and the

Fig. 2. Basic SRL gate with input disable and timing diagram.

functional block FN consists of transistor NM1. To provide the input disable function, an extra nMOS transistor is added in series with the FN block. This transistor is identied as Nme in the gure. While the cell is in Standby, its output is Low and the summing node outn is maintained High, thus the signal rst1 is High (inactive reset) and the extra nMOS device Nme is ON. The cell is ready to sample the input when such input becomes active. Since the cell shown is a simple buffer, the input signal must become High to provide a path to ground in order to discharge the internal summing node outn. As outn starts to discharge, the nMOS device (Nma) connected to node rst1n becomes less conductive, and the pMOS device above it (Pma) begins to turn ON. Shortly thereafter, rst1 will switch to Low, effectively disabling input readout. The discharging outn node causes the inverter output to rise, generating the leading edge of the output pulse. This signal is fed back through an inverter, which will turn in charge of pulling up the sumon the pMOS device ming node, back to its standby situation. As outn recharges, its voltage will switch the output inverter made up of transistors Pmout and Nmout, and the output will nally go Low. This last event will eventually deactivate the reset signal, enabling input readout again. Transistors PMk1 and PMk2 are small pull-up devices. They have been added to compensate for leakage and to maintain the charge at the summing node outn during the time the circuit does not receive pulses. is a function of the timing paramThe output pulse width eters of the output stage and the feedback that controls the reset signal, but it is fairly independent of the implementation of FN,

464

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 4, APRIL 2005

Fig. 3.

Dual-rail SRL buffer gate with input disable. Fig. 4. Dual-rail SRL OR/NOR gate with input disable.

the functional block. A family of gates sharing very close values has been developed. of The implementation of these gates in noninverting (monotonic) logic has its limitation since it does not provide a complete logic family. It could be argued that an additional inverter would produce the inverted signal, but then we would have some outputs at a constant High during the standby state. We seek a conguration where all inactive outputs will be at the same logic level (i.e., LOW). At the same time, we impose the restriction that an output will be active only dynamically, that is, for the duration of the output pulse . Since our objective is to use SRL in wave-pipelining circuits, if an inversion is needed, the inverted signal has to propagate from the previous register through the random logic block. Alternatives to solve the monotonicity include: 1) changing polarity of the active pulse at every other stage or 2) using a dual-rail logic implementation. We have chosen the dual-rail implementation since it provides a safe environment, which is independent of the physical location of a particular gate in the succession of gates in the combinational block. B. Dual-Rail SRL Gates With Input Disable (DRSRL-ID) The cell shown in Fig. 3 is a dual-rail implementation of a buffer/inverter gate. The logical functions FN and FNb (inside the highlighted box) are implemented by the nMOS transistors receiving signals a and an, respectively. Like the basic cell shown in the previous case, this one also features the input disable function, but in this case the transistor Nme, enabling the input readout, has been placed close to the ground rail. Its function can also be labeled evaluation enable because a path to ground from the summing nodes SUMn or SUM can only exist when it is ON. Placing the evaluation transistor near ground in an nMOS tree function implementation also has the advantage that the evaluation of both functions FN and FNb is enabled/disabled with just one device. These two functions are mutually exclusive. In the case shown, if FN becomes true, then net SUMn will be discharged and, as the voltage at this net goes to 0, a rising edge will occur at the output labeled buf. If instead input an receives a pulse, making FNb active, then net SUM will discharge, generating a pulse at output inv. If neither input a nor input an receive

Fig. 5. Dual-rail SRL XOR/XNOR gate with input disable.

a pulse, both summing nets SUM and SUMn will remain HIGH and no pulse will be generated at the outputs; both nets buf and inv will remain LOW. The mechanism of self-reset in this circuit is similar to the one described for the single-rail gate of Fig. 2, but it has been somewhat simplied in that the same signal used to initiate the post-charge of nets SUM and SUMn is used to control the evaluation transistor. Fig. 4 depicts the implementation of an OR/NOR gate. Since these gates generate dual outputs, the same circuit can be used to perform the AND/NAND or OR/NOR functions. In agreement with De Morgans law, an OR gate with inverted input signals behaves as a NAND on the noninverted signals. Nevertheless, it should be noted that the logic functionality refers to operations on pulses at inputs, and, if no pulses are present at the inputs, the outputs will remain at logic LOW state. Fig. 5 shows an XOR/XNOR gate. In this case, the use of shared elements between the FN and FNb blocks minimizes the number of devices needed. This approach is especially useful when implementing more complex gates. Additionally, in this case, we show the implementation of the self-reset without using the extra inverters in the feedback loop,

LITVIN AND MOURAD: SELF-RESET LOGIC FOR FAST ARITHMETIC APPLICATIONS

465

IV. TIMING PARAMETERS FOR DRSRL-ID We use the waveforms in Fig. 7 to illustrate the denitions of all cell timing parameters. In this gure, D and Dn represent the input signal and its inverse. Relevant time points have been dened for the case where the output becomes active. An equivalent set of time points applies for the case when output Yn generates a pulse. The diagram has been simplied by showing the case of a buffer/inverter gate, but the meaning of the timing parameters would be the same for any gate of the family. A. Parameter Denitions All timing intervals are measured between points where the signal crosses 1/2 VDD in its excursion. input pulse width. Capture time: elapsed time from the latest input arrival that validates FN to the High-to-Low transition of the summing node SUMn. Width of the negative pulse at the summing node SUMn. Delay from the leading edge of the summing node SUMn pulse to the falling edge of the reset signal pulse_rst. Time elapsed from pulse_rst becoming active (Low) to the SUMn rising edge. Width of the reset pulse (negative pulse of signal pulse_rst). Data delay forward. The time from the leading edge of the input data transition that validates FN to the leading edge of the pulse at the output. Delay between the leading (rising) edge of output to the leading (falling) edge of the reset signal pulse_rst (time from activation of the output node , to the start of the reset or post charge state of the gate). Delay from the leading edge of the summing node SUMn to the leading edge of output . Width of the output pulse. Time elapsed from pulse_rst becoming active (Low) to the falling edge of output . Recovery time. Time elapsed from the trailing edge of the output pulse to the trailing edge of the reset pulse. Once the input readout is disabled, further input changes are ignored and we say that the gate enters a blind state where the output pulse depends on the gates internal timing, not on the dewidth of the incoming pulses. The output pulse width pends on the feedback section that generates the reset signal and is independent of the complexity of the FN/FNb blocks. A complete family of these gates has been generated. As dened above, a signal must be present at the gate input for at least the capture time before it has any effect. In the case of an n-input AND gate, all input pulses must overlap for at least for the gate to recognize the all-ones input (1)

Fig. 6. Simulation waveforms showing proper logic behavior of basic DRSRL-ID gates.

similar to [11] and [12]. In the present case, as we actually use the internal reset pulse_rst signal to disable input readout as well as to control the postcharge of the summing nodes, we can safely play with the width of the resetting pulse, without being affected by the switching activity at the gate inputs. Implementing the NOR gate that senses the outputs in skewed logic (weaker) and at the same time reducing the size of the pMOS devices PM1 and PM2 increases the width of the resetting pulse and, consequently, the width of the output pulse , without affecting the forward propagation of data . These changes reduce the total area of the gate. The output inverter on each of the dual outputs favors the switching Low-to-High, thus generating output pulses with short rise time. The waveforms in Fig. 6 show the result of spice simulation of a set of these gates, implemented in a 0.18- m CMOS process, running at a 2.5-GHz data rate, with V. The waveforms are presented in stacked mode. The rst four waveforms from the top correspond to input signals da and db and their inverses dan and dbn, followed by the outputs of a buffer/inverter buf1_out and buf1_inv followed by the direct and inverse outputs of the basic two-input gates (AND/NAND, OR/NOR, and XOR/XNOR). The proper logic behavior of a buffer/inv, and the two-input basic DRSRL-ID gates is shown. The remainder of this paper is devoted to the theory of operation and application of these gates. Any reference to self-reset logic in the following applies to DRSRL-ID. We want to emphasize, nevertheless, that, for an application where inversions are not needed, for example, one can use a single-rail implementation for a simple decoder, realizing some area savings especially in wiring. To analyze the dynamic behavior of these gates, a few timing parameters are introduced in the following section.

466

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 4, APRIL 2005

Fig. 7. Timing parameters.

Fig. 8. Conventional pipelining circuit using edge-triggered FFs.

However, we want to see the output pulse being formed, hence (2) In a combinational block implemented with these gates, signals arriving at a particular stage may have different arrival times as measured from primary inputs of the circuit. If a certain combination of inputs validates the logic function of the gate, while the input pulses satisfy the above constraint on , then a fully formed pulse will be generated at the output. The output is a function of the timing characteristics of the pulse width current stage (e.g., transistor sizes and internal feedback loop) and independent of the width of input pulses because the stage will ignore the noninitial inputs until after a fully formed pulse has been generated at the output. There is some inuence of the stage loading on the characteristics of the output pulse, but once red the output pulse is independent of any variations at the inputs. This behavior is a self retiming of valid signals at the current stage and will greatly simplify the design methodology for fast designs.

We denote by the Cycle time. This is the time elapsed since an input combination that validates the gate function is received until the gate recovers from the reset condition and therefore is ready to evaluate inputs again. According to the diagrams in Fig. 6, can be expressed in terms of the previously dened parameters in either of the following forms: (3) (4) B. Basic Timing Analysis In a conventional pipeline, there is a succession of combinational stages separated by storage elements, i.e., edge-triggered ip-ops (FFs). In such an arrangement, there will be at most one set of related data traveling at a time through each combinational block (a single data wave). The basic architecture is shown in Fig. 8. The block receives an n-bit word, which is sampled at the receiving register. To subtasks, impleincrease throughput, a task is divided into mented in combinational logic blocks. Initially, let us assume

LITVIN AND MOURAD: SELF-RESET LOGIC FOR FAST ARITHMETIC APPLICATIONS

467

that the combinational blocks are implemented with conventional static CMOS gates. At each clock cycle, the intermediate results are stored in a register, and new data are pumped into the pipeline. After clock cycles the results corresponding to data sampled cycles before, reaches the output register. After that, at each clock cycle, new data are presented at the output. Since each subtask must complete within one clock cycle, the subtask that takes longer to evaluate denes the minimum clock cycle . To evaluate this, let us dene as the longest delay among all combinational blocks. One needs to consider also the clocking overhead resulting from the use of registers and the un. controllable clock skew The three basic timing parameters associated with registers are the clock-to-output delay , setup , and hold time . Inequality (5) shows that the lower bound on the clock period is . When considering a function of two successive sets of data, the current set must be held for at , as indicated on least the hold time plus the clock skew the left-hand side of inequality (6). The right-hand side of this inequality represents the time of the earliest arrival bits of the following data set, which arrives one clock period later, going . Inequality (6) through the shortest path, given by represents the constraint at the output register, which is necessary to avoid interference between successive data sets, and (7) is derived from it. Note that the left-hand side of (7) represents the shortest combinational path among all stages in the pipeline. In general, this condition is easily met in conventional pipelining, and the constraint on the output register prevails (5) (6) (7) The question now is what would be different if the combinational blocks are implemented using DRSRL-ID. To evaluate (the longest block delay), we need to consider the longest time it takes to transmit a signal from the input of each combinational block (output of the input register) to the output of the combinational block (data input of the corresponding output register). We have shown that all gates of our DRSRL-ID library have a fast delay forward tdf and that all of them share very calculation, let close values of this parameter. However, for and the signal that, within the us consider the worst case block, traverses the larger number of gates. Then (8) Assume that is the maximum number of gates a signal goes through amongst all combinational blocks in the pipe. As in the static CMOS case, wiring delays have to be added here, even though that was not explicitly shown. The constraint at the output register requires that the combinational output remains constant for enough time to be properly sampled by the output register. In the static CMOS case, the assumption was that the input register, maintaining its values in between clock active edges, would provide enough storage . It is also worth noting time and thus the constraint on that needs to be controlled, as stated by (7). So there

Fig. 9. Output pulse width w for different loading conditions for the sample set of DRSRL-ID gates.

is a reasonable upper bound on the maximum time difference between early and late arrival signals. This contention is also present in an implementation with DRSRL-ID, and the fact is that there is an extra built-in allowance here, since (9) The maximum time difference between early and late arrival . So, even though path signals is bounded by equalization is only mandatory for Wave Pipelining and not for the conventional case, it is a recommended way to go. In doing a rough equalization, we guarantee proper operation by construction. As per the constraint at the output register, the output must satisfy pulse width (10) Additionally, the minimum clock period is bounded by the gates cycle time, since only after time units from the last valid input can a DRSRL-ID gate sample new data (11) Section V shows results that conrm the timing concepts described in this section. V. EXPERIMENTS AND RESULTS A. Circuit Characterization A library of DRSRL-ID cells has been dened and simulated under various loading conditions and temperature and power supply values, and, to account for fabrication process variations, different transistor models were used. The different parameters dened in the previous section were measured. Figs. 911 show simulation results for a subset of these gates consisting of a buffer/inv, and the following two-input gates: AND/NAND, OR/NOR, and XOR/XNOR. These gates were implemented in a 0.18- m CMOS process, and simulations were performed at 2.5 GHz and under nominal values for transistor models (process), supply voltage (1.2 V), and temperature (27 C). The results in Fig. 9 show the variation of the width of the with the load for each of the eight gates. The output pulse inputs for the gates under test come from similar gates, which in turn had received their inputs from signal generators. In this

468

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 4, APRIL 2005

Fig. 10.

Variation of the output pulse width w for different PVT conditions for the sample set of DRSRL-ID gates.

Fig. 11.

Variation of the delay forward t

for different PVT conditions for the sample set of DRSRL-ID gates.

way, we provided a more realistic excitation, with input signals with measurable rise and fall times and rounded edges. It can increases with loading, but it does not vary be observed that signicantly for low values of loading. All gates in the sample set show very close values of . At eight buffer loads, where the weight of the loading is more signicant, the width of the output pulse becomes essentially the same for all gate types. Using a Design of Experiments (DOE) approach, we studied the sensitivity of the designs with respect to PVT conditions. For this factorial (DOE), we considered two values for power supply (1.0 and 1.4 V), two types of transistors (fast and slow, total variation ), and two levels of temperature (high and low, 105 C and 0 C) To this set of PVT conditions, we added two typical cases which account for typical nMOS/pMOS transistors, one using nominal values of voltage and temperature and the other using the corresponding low values. For all gates under test, a four-buffer load was used. The results of the DOE indicate, as illustrated in Figs. 10 and and the delay forward de11, how the output pulse width pend on PVT conditions. In these gures, a four-letter acronym is used to identify each PVT corner. The rst two letters refer to the transistor models: pMOS and nMOS, identifying them as slow , fast , and typical . The next two letters refer to supply

voltage and temperature, identifying them as high , low , or nominal . It can be observed that and show little variation for a wide spread of PVT conditions, except for the extreme cases: 1) fast transistors at high VDD and low temperature or 2) slow transistors at low VDD and high temperature. . This is by design, It is also important to note that since we want the gates to propagate data signals forward as fast value), while the output pulse is made to as possible (small last long enough to guarantee proper data sampling by the next stage in the logic path, accounting for differences in arrival times values, and skew of input signals, differences in individual of the clock at launching and receiving registers at both ends of a combinational block implemented with these gates. The gates shown have not been optimized for , but had we equalized values, this would have also resulted in closer values of . translates into a much Nevertheless, a large relative delta in smaller relative variation in . B. Logic Depth The purpose of this experiment was to verify that the output is maintained at a constant value that is indepenpulse width dent of the number of stages the signal traverses. A long chain of buffers connected as shown in Fig. 12 was used. All buffers

LITVIN AND MOURAD: SELF-RESET LOGIC FOR FAST ARITHMETIC APPLICATIONS

469

Fig. 12.

Buffer chain.

are identical in size, but the load at different stages varied from one to four other buffers. For this simulation, typical transistor models were used for the same 0.18- m CMOS process, with power supply V and temperature C. The frequency of the input waveform was 2.5 GHz and, under these ps. conditions, the gates have a nominal value for was varied in the The input pulse width to the rst stage interval (12) In inequality (12), the left-hand side guarantees that the rst gate will enter the blind state (i.e., will stop reading the input) before the input pulse vanishes. The right-hand side of the inequality is the minimum cycle time required for the gate to process the input pulse, to generate a fully formed output pulse, and to recover. By recovering, we mean that the internal reset of the gate will become inactive again, enabling the input readout. After the rst gate in the series captures the rst pulse, a second pulse could start while the gate is still in the blind state. For this second pulse to propagate through the following stages as , the second input pulse must last a pulse of constant width after the gate comes out of the blind state produced by the rst pulse. This guarantees that the gate has sampled the second pulse, the leading edge of the second output pulse has occurred, and the internal reset has been activated, disabling input readout. , then an input pulse generates more than one output If pulse. In the limit, if the input to the rst gate is maintained constant at Logic Level High, its output will generate pulses of each cycle time. These pulses will propagate through width . The period of this repetitive the logic while maintaining event will be equal to the Buffer_cycle_time . This retriggering makes the buffer behave like a monostable circuit. A retriggering condition could occur with any of the gates implemented in this technology as long as either Fn or Fnb is TRUE with the inputs held High. The same retriggering effect would take place on domino gates, if instead of pulses at inputs, a combination of inputs that validates the output, were maintained constant across successive clock cycles. The domino gate will respond with a new output pulse per clock cycle. The train of output pulses in DRSRL-ID gates will occur according to their own cycle time . Simulation results are shown in Fig. 13. The input to the rst stage is an ideal pulse with fast rise and fall times and perfectly V, and at off state at VSS. The lower part at top at

of the gure shows the response of the rst two stages: out0 and out1 to successive pulses. It is evident that if there is no input pulse applied, then the output remains low at its nominal value. is maintained constant. Maximum Also, the pulse width variation measured at these nodes: less than 2 ps (see the comment on simulation resolution below). The top part of the gure shows, in an overlay mode, the following signals: input, out0 (output of the rst stage), some signals down the path out4, out8, out16, out24, and out30, and the output of the last stage out31. It can also be observed that the shape of the traveling pulses is fairly constant, with very little overshoot and undershoot. Only out31, the last stage, shows slower rise and fall times. This is because the lumped capacitive load on this stage represents a load greater than 12 buffer at stage 8 and loads. For example, the difference between stage 30 is only about 2 ps, which amounts to about 1% of the . The SPICE simulation time step was 10 ps, ideal value of variation reported gets assimilated within thus most of the the margin of error of the simulation. We can thus conclude that is maintained across the successive stages, varying less than 5% from the nominal value. VI. APPLICATION: A 16-B PARALLEL ADDER We use the Carry Look Ahead structure proposed by [13], with a slight modication to control the fanout and the loading at critical points. The global organization is shown in Fig. 14, where only the combinational block that performs the addition is depicted. This is a Carry Propagate Adder (CPA). This block by itself is essentially asynchronous. In order to make it work in a synchronous environment, an input register and an output register are added. The combinational block is composed of three stages: a propagate/generate block PG generator shown on top, followed by a parallel carry generator, and a sum generator. The inputs A and B to the circuit are provided through dynamic FFs [14], which convert static input signals into the pulses SRL uses. The maximum clock frequency depends on the cycle time of the gates, while controlling the maximum difference in arrival times of signals at any given stage. The rst stage forms the propagate term and the generate term , according to the following equations: for for (13) (14)

470

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 4, APRIL 2005

Fig. 13.

Buffer chain simulation waveforms.

Fig. 14.

CPA block diagram.

Fig. 15.

CPA-PG generator pg cell.

The basic cell used in the PG generator is shown in Fig. 15. The resulting and bits are the inputs to the next stage, the parallel carry generator. The delay through all paths is, by design, approximately the same. Carries are computed from the recursive equations for where for for (16) (17) (15)

Sums are computed from for for (18) (19)

The structure of the Carry Generator block is shown in Fig. 16. This block uses two types of cells: the pgcg cell detailed in Fig. 17 and a delay cell made up of two buffers in series. While some signals go through a pg-cg cell, others go through delay elements to equalize path delays as close as possible. Since the fanout of the pg-cg cells vary widely between the

LITVIN AND MOURAD: SELF-RESET LOGIC FOR FAST ARITHMETIC APPLICATIONS

471

Fig. 16.

Detail of the carry generator block.

Fig. 17.

CPA. Typical pg-cg cell used in the carry generate block.

lower and upper bits, an element is repeated to divide down the total loading of that stage. The nal stage of the adder uses XOR gates at all bits but the least and the most signicant bits, where buffers are used. The outputs of the nal stage are captured by the output register, which is implemented by a set of edge-triggered FFs. The output register must sample the nal stage of the adder while the result pulses are available. For this wave-pipelining application, all paths have been equalized by using rough padding, that is, adding buffers to the shorter paths to get the same number of stages in all cases [15]. As in any other wave-pipelining implementation, it is imporand shortest tant to make the delays through the longest paths through the logic as close as possible, since the maximum operating frequency is a function of that differ. This is done by tightly controlling the difference ence in arrival times of all signals connected to a certain stage. This implies not only the delay equalization of the different elements of logic at a given stage, but also the delay equalization

of the interconnects between successive stages and controlling the total loading of a given intermediate driver. A careful layout plan is important, but in the present design there is a certain tolerance level for differences in arrival times. This is true as long as one can guarantee that all valid input pulses at a given stage will overlap long enough to generate the output pulse within the time frame of valid inputs for the next stage. Wave pipelining is especially suitable for designs that show a high degree of parallelism and regularity. If the combinational functional block is to behave as a conventional pipe stage, the maximum operating frequency is just a function of the longest path. Results of SPICE simulation of the adder, implemented in a 0.18- m CMOS process, running at a 2.5-GHz data rate, with V are shown in Fig. 18. All of our netlists include long wires based on layout estimates and were later postprocessed with a well-proven parasitics estimator which also takes into account physical device implementation (i.e., building a large transistor as a set of parallel smaller transistors).

472

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 4, APRIL 2005

Fig. 18.

Simulation waveforms in the adder. TABLE I SIMULATION DATA AT TIME MARKS IN FIG. 18

In Fig. 18, a set of signals is shown in stacked mode. The signals (from top to bottom) are da[3:0] and db[3:0] (input register outputs corresponding to lower bits of operators a and b). These pulses are the input to the combinational block built with DRSRL-ID cells. These signals are followed by some bits of the result r[4:0]. These signals are the inputs to the output register. At the bottom, signals qr[4:0] (output registers output) and the output register clock signal CLKout are shown. It can be observed in Fig. 18 that as the pulse-waves advance through the stages of the adder, the timing difference among signals at a given stage, is minimal, so they conform a coherent data wave. The latency through the combinational block is approximately 400 ps, which is close to the clock period, but the latency of the combinational block is independent of the clock period. Overlaying the SPICE output, we have marked successive time points at the rising edge of the output register clock CLKout. Table I below shows data inputs da[3:0] and db[3:0] to the combinational block and output register results qr[4:0], in decimal base. It can be seen in Fig. 18 and in Table I that the results are displayed at the corresponding to data received at time .The FFs used in the implementation of output register at

the output register also serve as interface elements with the next circuit. If the circuitry following the adder is also implemented in dynamic dual-rail logic, then these FFs must be of dynamic type like the ones used in the input register [14]. If the next stage is implemented with static logic, then these FFs must be of the edge-triggered static type. While the last combinational stage of the adder shown provides dual outputs, only one set of these outputs is fed to the output register. VII. COMPARISON WITH DOMINO LOGIC In this section, we compare DRSRL-ID with one of the most popular dynamic alternatives: domino. This comparison could be done at gate level or comparing performance of a given application implementation in DRSRL with domino. In both cases, it can be shown that DRSRL can achieve at least the same speed or faster, with some power reduction and less silicon area. Since our intention is to make a fair comparison that looks into the architectural differences and not just a comparison to a generic domino implementation, we have created a domino family mirroring our DRSRL family of gates by transforming each DRSRL-ID gate into a dual-rail domino gate. We simply

LITVIN AND MOURAD: SELF-RESET LOGIC FOR FAST ARITHMETIC APPLICATIONS

473

(a) Fig. 19.


AND/NAND

(b)

gates. (a) DRSRL-ID. (b) DRD.

replaced the reset-controlled post_charge and input enable/disable mechanisms by a clock controlled mechanism, which will dene when input data are sampled and when the domino-precharge takes place. In this way, both gate families will have the same transistor sizes for data input and output stages. The nMOS device used to enable/disable data sampling in our DRSRL gates (Nme in Fig. 3) will be connected to CLK in these dual-rail domino (DRD) gates. In our DRSRL-ID gates, the internal reset circuit is made of a NOR gate, which controls the input enable/disable nMOS device and the pMOS devices used to replenish charge to summing nodes. In the DRD counterpart, the exact same pMOS and nMOS devices will now be controlled by clock (CLK), as shown in Fig. 19 for an AND/NAND in the two implementations. Also, the pMOS devices used to restore charge at internal summing nodes have the same size in both cases. This is somewhat unfair for the SRL case, as was explained in Section III and shown in Fig. 5, since these pMOS devices can be made much smaller in the DRSRL case. In the domino case, the clock signal is used to enable the possible discharge of the summing nodes and to restore charge. High activity nets could be discharged and recharged every cycle, so a symmetric clock must restore charge in half the period. In the DRSRL-ID case, the delay is made as short as possible, we guarantee that forward pulses reach the next stage by using longer output pulse width (providing the necessary pulse overlap for signals having different routing), and, in effect, can be a large portion of the cycle time , so the width of the resetting pulse is accordingly elongated. Thus, approximately the same energy is restored in a longer time, thus requiring less power. Our DRSRL-ID gates depict a very short forward delay with the output stage favoring the low-to-high transition, since we care to transmit a fast rising-edge output pulse. Popular domino implementations usually do not have their output stage sized according to this criterion. Also, there is a great variety on the sizes of the transistors connected to the clock input relative to the sizes of the data input devices, making the nMOS

device controlled by CLK usually larger than any of the nMOS devices in the nMOS input tree. Also, pMOS devices controlled by CLK are made larger than here. (Such implementation will render a domino gate that consumes more power.) We have implemented the same adder described above, but this time with DRD gates, and compared performances. The differences are in the combinational block, since we have used the exact same input and output registers in both designs. The additional item in the DRD case is the clock distribution network. In order to get a maximum data rate in the DRD case, we carefully designed the clock distribution tree so that the clock arrives before the data at every stage of the combinational block. In this way, we guarantee that the DRD gates will be able to sample arriving data without the extra delay of having to wait for the enabling clock edge. Only after this optimization were we able to get the domino adder to work at the same data rate as the DRSRL-ID counterpart. Thus, we can safely state that DRSRL-ID circuits are able to work at the same speed as a domino counterpart. In this comparison, we made the slew rate of the clock signal at each gate in the DRD case as close as possible to the slew rate of the reset signal in the DRSRL-ID case, so that the power involved in switching at the gate level is fairly comparable. The main difference comes from the power devoted to the clock grid, as can be seen in Fig. 20, which shows total instanta, its average value, and average power neous supply current consumption, for the combinational portion in both adder implementations, for the same data input patterns. The gure shows that the DRSRL-ID case consumes about 15% less power, and even though in this comparison the consumption in the DRSRL-ID has been somewhat exaggerated, the extra power used by the clock distribution network in the domino case is clearly visible. In a general case, the power consumption difference strongly depends on the clock distribution network used in domino. For a small block, such a power difference will not gravitate as much against the domino case, but will be more visible for a larger block.

474

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 4, APRIL 2005

Fig. 20.

Total instantaneous supply current I

: average value and power consumption. Simulation of the adder in DRSRL-ID and DRD.

It is important to note that power consumption in single-rail implementations (SRL-ID) is directly data-dependent. If the input pulse pattern present at an SRL-ID gate does not validate the logic function, there is no pulse generated at the gate output. In the dual-rail case (DRSRL-ID), as long as new data are brought to a gate, a pulse will be generated at the direct output or at the inverse output . The same considerations apply to single- and dual-rail domino. The difference is in the clock distribution scheme, which will consume power, no matter what data, if any, are presented at the domino gate inputs. Also, the internal reset generation in the DRSRLID/SRSRL-ID gates is made up of small devices, and the corresponding sizes affect the output pulse width and the length of the internal reset. As mentioned above, the gate families we have implemented in SRSRL-ID and DRSRLi-ID share a common output pulse width, and this characteristic, provided that we control the total loading at every point, simplies the methodology, since we just drop a gate where it is necessary,

without having to do the extra work of satisfying strict timing conditions for the clock signal to arrive at a given place. The main emphasis in using DRSRL-ID is associated with the design simplication, and area reduction. VIII. CONCLUSION AND FUTURE WORK After reviewing the basic concepts of self-reset logic (SRL), a new family of these gates was introduced, DRSRL-ID. The goal was to obtain a family of gates that could simplify the implementation of fast processing circuits of any logical depth, while avoiding some of the restrictions and complications due to pulses being elongated or shortened as signals traverse the logic stages. The principles of operation were described and a set of timing parameters was dened. Characterization data were presented that displayed the sensitivity of the design to process, voltage, and temperature variations and loading conditions. An experiment on a long chain of buffer cells demonstrated that the

LITVIN AND MOURAD: SELF-RESET LOGIC FOR FAST ARITHMETIC APPLICATIONS

475

width of the signal pulses remains essentially constant and independent of the logic depth of the circuit, while loading is maintained at reasonable levels. As proof of the concepts, a 16-b parallel adder was impleV and mented in a 0.18- m CMOS process with simulated for operation at a 2.5-GHz data rate. The designs shown here provide a practical proof of the feasibility of using the proposed technique in many applications where fast processing and low power consumption are needed. The use of SRL provides an elegant implementation that provides savings in power and area with respect to a comparable CMOS-dynamic implementation by avoiding the clock distribution required with other dynamic gates. For comparison purposes, the same adder was implemented in DRSRL-ID and in DRD logic, and the results were reported in Section VII. The use of DRSRL-ID has additional advantages. The combination provides a fairly constant pulse width, which avoids pulse width adjusting structures. It also provides an additional tolerance in the design to accommodate the difference in arrival times of signals at any stage. While such tolerance is built into the structure of the gate family, it comes at the price of adding to the total cycle time and affects the minimum clock period used to pump new data into the circuit. Our goal at devising this gate family was to apply DRSRL logic in wave-pipelined designs. The use of wave-pipelining provides savings in area and timing, since all intermediate storage elements are removed from the circuit. There are also savings from the point of view of timing overhead. The reduction in area in addition to the simplied equalization mechanism due to the built-in tolerance makes this approach suitable for many fast-processing designs. Our design provides a good solution that is suitable for applications in datapath and control logic. The extension of this technique to larger adders is natural. We are also working on other structures, which can make use of the technique, and compare the results to current circuit implementations. The basic DRSRL-ID is suitable for structures with feedback, which is an area we will investigate further. ACKNOWLEDGMENT The authors would like to thank Dr. F. Klass of P.A. Semi for proposing the research area, Dr. S. Dutta of Philips Semiconductors for his valuable comments, and to T. Egan of Teradyne for his thorough review of earlier drafts. REFERENCES
[1] K. Bernstein and K. Carrig et al., High Speed CMOS Design Styles. Norwell, MA: Kluwer, 1999. [2] T. Thorp and G. Yee et al., Domino Logic Synthesis Using Complex Static Gates, in Int. Conf. CAD Dig. Tech. Papers, 1998, pp. 242247. [3] N. Starodoubtsev and S. Bystrov et al., Toward synthesis of monotonic asynchronous circuits from signal transition graphs, in Proc. Intl. Conf. App. Concurrency Syst. Design, 2001, pp. 179188.

[4] T. Chapell, B. Chapell, and S. Schuster et al., A 2 ns cycle, 3.8 ns access 512-Kb CMOS ECL SRAM with a fully pipelined architecture, IEEE J. Solid State Circuits, vol. 26, no. 11, pp. 15771585, Nov. 1991. [5] R. Heald and J. Holst, A 6 ns cycle, 256-Kb cache memory and memory management unit, IEEE J. Solid-State Circuits, vol. 28, no. 11, pp. 10781083, Nov. 1993. [6] R. Heald and K. Shin et al., 64-KByte sum-addressed-memory cache with 1.6-ns cycle and 2.6-ns latency, IEEE J. Solid-State Circuits, vol. 33, no. 11, pp. 16821689, Nov. 1998. [7] W. Hwang, R. V. Joshi, and W. H. Henkels, A 500-MHz, 32-word 64-bit, eight-port self-resetting CMOS register le, IEEE J. Solid-State Circuits, vol. 34, no. 1, pp. 5667, Jan. 1999. [8] D. Wendell, Reset logic circuit and method, U.S. Patent 5 430 399, Jul. 4, 1995. [9] V. Narayanan, B. Chappell, and B. Fleischer, Static timing analysis for self resetting circuits, in Proc. ICCAD, 1996, pp. 119126. [10] B. Hauck and B. Huss, Asynchronous wave pipelines for high throughput datapaths, in Proc. IEEE Int. Conf. Circuits Syst., vol. 1, 1998, pp. 283286. [11] G. Jung, V. Sundarajan, and G. Sobelman, A robust self-resetting CMOS 32-Bit parallel adder, in Proc. IEEE ISCAS, vol. 1, 2002, pp. 473476. [12] G. Jung and J. Kong et al., High-speed add-compare-select units using locally self-resetting CMOS, in Proc. IEEE ISCAS, vol. 1, 2002, pp. 889892. [13] W. Liu, T. Gray, D. Fan, W. J. Farlow, T. A. Hughes, and R. K. Cavin, A 250-MHz wave pipelined adder in 2-m CMOS, IEEE J. Solid-State Circuits, vol. 29, no. 9, pp. 11171128, Sep. 1994. [14] E. F. Klass, C. Amir, A. Das, K. Aingaran, C. Truong, R. Wang, A. Mehta, R. Heald, and G. Yee, A new family of semi dynamic and dynamic ip-ops with embedded logic for high-performance processors, IEEE J. Solid-State Circuits, vol. 34, no. 5, pp. 712716, May 1999. [15] E. F. Klass, Wave pipelining theoretical and practical issues in CMOS, Ph.D. dissertation, Stanford Univ., Stanford, CA, 1994.

Miguel E. Litvin received the Engineer in Electronics degree from the National Technological University, Mendoza, Argentina, in 1978 and the M.Sc. degree in electrical engineering from Santa Clara University, Santa Clara, CA, in 1994, where he is currently working toward the Ph.D. degree in electrical engineering. From 1978 to 1989, he worked at INVAP-SE, Rio Negro, Argentina, in instrumentation and control electronics, including a two-year assignment in California, for the design of an embedded processor. He joined Cadence Design Systems, San Jose, CA, in 1989, where he worked in CAD, and at Analog Devices, San Jose, CA, from 1993 to 1997, later joining Sun Microsystems, Sunnyvale, CA, working in processor design. His research interests include structures for fast digital processing, wave pipelining, self-reset logic, pulsed circuits, and asynchronous systems.

Samiha Mourad is currently the William and Janice Terry Professor and the chair of the Electrical Engineering Department at Santa Clara University, Santa Clara, CA. She spent last fall as a Visiting Professor with the Nara Institute of Science and Technology (NAIST). Prior to joining Santa Clara University, she was with the Center of Reliable Computing, Stanford University, Stanford, CA, where she conducted research in digital testing. She also worked at Bendix Test Systems in New Jersey and Fordham University, New York.

S-ar putea să vă placă și