Unit - 2: Asic Library Design O Delay Estimation

DEPT.
OF ELECTRICAL & COMPUTER ENGINEERING

DIRE DAWA INSTITUTE OF TECHNOLOGY
DIRE DAWA UNIVERSITY
UNIT – 2
 ASIC LIBRARY DESIGN
o DELAY ESTIMATION
 RC DELAY MODELS
 ELMORE DELAY MODEL
 LINEAR DELAY MODEL
 LOGICAL EFFORT
 PARASITIC DELAY
o LOGICAL EFFORT & TRANSISTOR
SIZING
 DELAY IN LOGIC GATE
 DELAY IN MULTISTAGE N/WS
 CHOOSING THE BEST NO. OF STAGES
 PROGRAMMABLE ASIC (P-ASIC) LOGIC
CELLS
o ACTEL ACT
o XILINX LCA
o ALTERA FLEX
 P-ASIC I/O CELLS
 P-ASIC INTERCONNECT
 P-ASIC DESIGN SOFTWARES
 REFERENCES
VLSI DESIGN, ECEg 4201 1 ROOHA RAZMID AHAMED
DEPT. OF ELECTRICAL & COMPUTER ENGINEERING
ASIC LIBRARY DESIGN

Once we have decided to use an ASIC design style—using predefined and pre-
characterized cells from a library—we need to design or buy a Cell library. Even though it is
not necessary, knowledge of ASIC library design makes it easier to use library cells
effectively.
DELAY ESTIMATION
The two most common metrics for a good chip are speed and power. The most
obvious way to characterize a circuit is through simulation. Unfortunately, simulations only
inform us how a particular circuit behaves, not how to change the circuit to make it better.
There are far too many degrees of freedom in chip design to explore each promising path
through simulation (although some may try). Moreover, if we don’t know approximately
what the result of the simulation should be, we are unlikely to catch the inevitable bugs in
our simulation model. Mediocre engineers rely entirely on computer tools, but outstanding
engineers develop their physical intuition to rapidly predict the behavior of circuits.
DEFENITIONS
We begin with a few definitions illustrated in Figure 2.1:
 Propagation delay time, tpd = maximum time from the input crossing 50% to the
output crossing 50%
 Contamination delay time, tcd = minimum time from the input crossing 50% to the
output crossing 50%
 Rise time, tr = time for a waveform to rise from 20% to 80% of its steady-state value
 Fall time, tf = time for a waveform to fall from 80% to 20% of its steady-state value
 Edge rate, trf = (tr + tf )/2
Figure 2.1 Propagation delay and rise/fall times.

Intuitively, we know that when an input changes, the output will retain its old value
for at least the contamination delay and takes on its new value in at most the propagation
delay. We sometimes differentiate between the delays for the output rising, tpdr /tcdr , and
the output falling, tpdf /tcdf . Rise/fall times are also sometimes called slopes or edge rates.
Propagation and contamination delay times are also called max-time and min-time,
respectively. The gate that charges or discharges a node is called the driver and the gates
and wires being driven are called the load. Propagation delay is usually the most relevant
value of interest, and is often simply called delay
Timing Optimization
In most designs there will be many logic paths that do not require any conscious
effort when it comes to speed. These paths are already fast enough for the timing goals of
the system. However, there will be a number of critical paths that limit the operating speed
of the system and require attention to timing details. The critical paths can be affected at
four main levels:
 The architectural/micro-architectural level

 The logic level
 The circuit level
 The layout level
The most leverage is achieved with a good micro-architecture. This requires a broad
knowledge of both the algorithms that implement the function and the technology being
targeted, such as how many gate delays fit in a clock cycle, how quickly addition occurs,
how fast memories are accessed, and how long signals take to propagate along a wire.
Trade-offs at the micro-architectural level includes the number of pipeline stages, the
number of execution units (parallelism), and the size of memories.
The next level of timing optimization comes at the logic level. Trade-offs include types
of functional blocks (e.g., ripple carry vs. look ahead adders), the number of stages of gates
in the clock cycle, and the fan-in and fan-out of the gates. The transformation from function
to gates and registers can be done by experience, by experimentation, or, most often, by
logic synthesis. Remember, however, that no amount of skillful logic design can overcome a
poor micro-architecture.
Once the logic has been selected, the delay can be tuned at the circuit level by choosing
transistor sizes or using other styles of CMOS logic. Finally, delay is dependent on the
layout. The floor plan (either manually or automatically generated) is of great importance
because it determines the wire lengths that can dominate delay. Good cell layouts can also
reduce parasitic capacitance.

Many RTL designers never venture below the micro-architectural level. A common
design practice is to write RTL code, synthesize it (allowing the synthesizer to do the timing
optimizations at the logic, circuit, and placement levels) and check if the results are fast
enough. If they are not, the designer recodes the RTL with more parallelism or pipelining,
or changes the algorithm and repeats until the timing constraints are satisfied. Timing
analyzers are used to check timing closure, i.e., whether the circuit meets all of the timing
constraints. Without an understanding of the lower levels of abstraction where the
synthesizer is working, a designer may have a difficult time achieving timing closure on a
challenging system.
RC Delay Model
RC delay models approximate the nonlinear transistor I-V and C-V characteristics
with an average resistance and capacitance over the switching range of the gate. This
approximation works remarkably well for delay estimation despite its obvious limitations
in predicting detailed analog behavior.
Effective Resistance
The RC delay model treats a transistor as a switch in series with a resistor. The
effective resistance is the ratio of Vds to Ids averaged across the switching interval of
interest.
A unit nMOS transistor is defined to have effective resistance R. The size of the unit
transistor is arbitrary but conventionally refers to a transistor with minimum length and
minimum contacted diffusion width (i.e., 4/2 λ). Alternatively, it may refer to the width of
the nMOS transistor in a minimum-sized inverter in a standard cell library. An nMOS
transistor of k times unit width has resistance R/k because it delivers k times as much
current. A unit pMOS transistor has greater resistance, generally in the range of 2R–3R,
because of its lower mobility. Throughout this course we will use 2R for examples to keep
arithmetic simple. R is typically on the order of 10 kΩ for a unit transistor.
According to the long-channel model, current decreases linearly with channel length
and hence resistance is proportional to L. Moreover, the resistance of two transistors in
series is the sum of the resistances of each transistor.
Gate & Diffusion Capacitance

Each transistor also has gate and diffusion capacitance. We define C to be the gate
capacitance of a unit transistor of either flavor. A transistor of k times unit width has
capacitance kC. Diffusion capacitance depends on the size of the source/drain region. We
assume the contacted source or drain of a unit transistor to also have capacitance of about

C. Wider transistors have proportionally greater diffusion capacitance. Increasing channel

length increases gate capacitance proportionally but does not affect diffusion capacitance.
Although capacitances have nonlinear voltage dependence, we use a single average

value. We roughly estimate C for a minimum length transistor to be 1 fF/µm of width. In a
65 nm process with a unit transistor being 0.1 µm wide, C is thus about 0.1 fF.
Equivalent RC Circuit
Figure 2.2 shows equivalent RC circuit models for nMOS and pMOS transistors of
width k with contacted diffusion on both source and drain. The pMOS transistor has
approximately twice the resistance of the nMOS transistor because holes have lower
mobility than electrons. The pMOS capacitors are shown with VDD as their second terminal
because the n-well is usually tied high. However, the behavior of the capacitor from a delay
perspective is independent of the second terminal voltage so long as it is constant. Hence,
we sometimes draw the second terminal as ground for convenience.
Figure 2.2 Equivalent circuits for transistors.
The equivalent circuits for logic gates are assembled from the individual transistors.
Figure 2.3 shows the equivalent circuit for a fanout-of-1 inverter with negligible wire
capacitance. The unit inverters of Figure 2.3(a) are composed from an nMOS transistor of
unit size and a pMOS transistor of twice unit width to achieve equal rise and fall resistance.
Figure 2.3(b) gives an equivalent circuit, showing the first inverter driving the second
inverter’s gate. If the input A rises, the nMOS transistor will be ON and the pMOS OFF.
Figure 2.3(c) illustrates this case with the switches removed. The capacitors shorted
between two constant supplies are also removed because they are not charged or
discharged. The total capacitance on the output Y is 6C.

Figure 2.3 Equivalent circuit for an inverter.
Example 2.1: Sketch a 3-input NAND gate with transistor widths chosen to achieve
effective rise and fall resistance equal to that of a unit inverter (R). Annotate the gate with
its gate and diffusion capacitances. Then sketch equivalent circuits for the falling output
transition and for the worst-case rising output transition.
SOLUTION:
Figure 2.4 Equivalent circuit for a 3-input NAND gate

Figure 2.4(a) shows such a gate. The three nMOS transistors are in series so the
resistance is three times that of a single transistor. Therefore, each must be three times unit
width to compensate. In other words, each transistor has resistance R/3 and the series
combination has resistance R. The two pMOS transistors are in parallel. In the worst case
(with one of the inputs low), only one of the pMOS transistors is ON. Therefore, each must
be twice unit width to have resistance R.
Figure 2.4(b) shows the capacitances. Each input presents five units of gate
capacitance to whatever circuit drives that input. Notice that the capacitors on source
diffusions attached to the rails have both terminals shorted together so they are irrelevant
to circuit operation. Figure 2.4(c) redraws the gate with these capacitances deleted and the
remaining capacitances lumped to ground.
Figure 2.4(d) shows the equivalent circuit for the falling output transition. The
output pulls down through the three series nMOS transistors. Figure 2.4(e) shows the
equivalent circuit for the rising output transition. In the worst case, the upper two inputs
are 1 and the bottom one falls to 0. The output pulls up through a single pMOS transistor.
The upper two nMOS transistors are still on, so the diffusion capacitance between the
series nMOS transistors must also be discharged.
Transient Response
Now, consider applying the RC model to estimate the step response of the first-order
system shown in Figure 2.5. This system is a good model of an inverter sized for equal rise
and fall delays.
Figure 2.5 First Order RC System
The system has a transfer function
1
( )=
1+
And a step response
( )=
Where τ=RC. The propagation delay is the time at which Vout reaches VDD /2, as shown in
the figure 2.6.

Figure 2.6 First Order Step Response
The factor of ln 2 = 0.69 is cumbersome. The effective resistance R is an empirical

parameter anyway, so it is preferable to incorporate the factor of ln 2 to define a new
effective resistance R’ = R ln 2. Now the propagation delay is simply R’C. For the sake of
convenience, we usually drop the prime symbols and just write
where the effective resistance R is chosen to give the correct delay.
Elmore Delay
In general, most circuits of interest can be represented as an RC tree, i.e., an RC

circuit with no loops. The root of the tree is the voltage source and the leaves are the
capacitors at the ends of the branches. The Elmore delay model [Elmore48] estimates the
delay from a source switching to one of the leaf nodes changing as the sum over each node i
of the capacitance Ci on the node, multiplied by the effective resistance Ris on the shared
path from the source to the node and the leaf. Application of Elmore delay is best illustrated
through examples
Example 2.2: Compute the Elmore delay for Vout in the 2nd order RC system from Figure
2.7.
Figure 2.7 Second Order RC System

SOLUTION:
The circuit has a source and two nodes. At node n1, the capacitance is C1 and the resistance
to the source is R1. At node Vout, the capacitance is C2 and the resistance to the source is (R1
+ R2). Hence, the Elmore delay is tpd = R1C1 + (R1 + R2)C2. Note that the effective resistances
should account for the factor of ln 2.
Example 2.3: Estimate tpd for a unit inverter driving m identical unit inverters.
SOLUTION:
Figure 2.8 Equivalent Circuit for Inverter
Figure 2.8 shows an equivalent circuit for the falling transition. Each load inverter presents
3C units of gate capacitance, for a total of 3mC. The output node also sees a capacitance of
3C from the drain diffusions of the driving inverter. This capacitance is called parasitic
because it is an undesired side-effect of the need to make the drain large enough to contact.
The parasitic capacitance is independent of the load that the inverter is driving. Hence, the
total capacitance is (3 + 3m)C. The resistance is R, so the Elmore delay is tpd = (3 + 3m)RC.
The equivalent circuit for the rising transition gives the same results.
It is often helpful to express delay in a process-independent form so that circuits can

be compared based on topology rather than speed of the manufacturing process. Moreover,
with a process-independent measure for delay, knowledge of circuit speeds gained while
working in one process can be carried over to a new process. Observe that the delay of an
ideal fanout-of-1 inverter with no parasitic capacitance is = 3RC [Sutherland99]. We
denote the normalized delay d relative to this inverter delay:
d=
Hence, the delay of a fanout-of-h inverter can be written in normalized form as d = h

+ 1, assuming that diffusion capacitance approximately equals gate capacitance. An FO4
inverter has a delay of 5 . If diffusion capacitance were slightly higher or lower, the FO4
delay would change by only a small amount. Thus, circuit delay measured in FO4 delays is
nearly constant from one process to another.

The delay consists of two components. The parasitic delay is the time for a gate to
drive its own internal diffusion capacitance. Boosting the width of the transistors decreases
the resistance but increases the capacitance so the parasitic delay is ideally independent of
the gate size. The effort delay depends on the ratio h of external load capacitance to input
capacitance and thus changes with transistor widths. It also depends on the complexity of
the gate. The capacitance ratio is called the fan-out or electrical effort and the term
indicating gate complexity is called the logical effort. For example, an inverter has a delay of
d = h + 1, so the parasitic delay is 1 and the logical effort is also 1. The NAND3 has a worst
case delay of d = (5/3)h + 5. Thus, it has a parasitic delay of 5 and a logical effort of 5/3.
Linear Delay Model
The RC delay model showed that delay is a linear function of the fan-out of a gate. Based on
this observation, designers further simplify delay analysis by characterizing a gate by the
slope and y-intercept of this function. In general, the normalized delay of a gate can be
expressed in units of as
= +
p is the parasitic delay inherent to the gate when no load is attached. f is the effort delay or
stage effort that depends on the complexity and fan-out of the gate:
= ·ℎ
Eq (2.1)
The complexity is represented by the logical effort, g [Sutherland99]. An inverter is

defined to have a logical effort of 1. More complex gates have greater logical efforts,
indicating that they take longer to drive a given fan-out. For example, the logical effort of
the 3-input NAND gate is 5/3. A gate driving h identical copies of itself is said to have a fan-
out or electrical effort of h. If the load does not contain identical copies of the gate, the
electrical effort can be computed as
ℎ=
Eq (2.2)
where Cout is the capacitance of the external load being driven and Cin is the input
capacitance of the gate.
Figure 2.9 plots normalized delay vs. electrical effort for an idealized inverter and 3-
input NAND gate. The y-intercepts indicate the parasitic delay, i.e., the delay when the gate
drives no load. The slope of the lines is the logical effort. The inverter has a slope of 1 by

definition. The NAND has a slope of 5/3. The remainder of this section explores how to
estimate the logical effort and parasitic delay and how to use the linear delay model.
Figure 2.9 Normalized Delay v/s Fan-out
LOGICAL EFFORT & TRANSISTOR SIZING

Logical effort of a gate is defined as the ratio of the input capacitance of the gate to
the input capacitance of an inverter that can deliver the same output current. Equivalently,
logical effort indicates how much worse a gate is at producing output current as compared
to an inverter, given that each input of the gate may only present as much input capacitance
as the inverter.
Logical effort can be measured in simulation from delay vs. fan-out plots as the ratio
of the slope of the delay of the gate to the slope of the delay of an inverter. Alternatively, it
can be estimated by sketching gates. Figure 2.10 shows inverter, 3-input NAND, and 3-
input NOR gates with transistor widths chosen to achieve unit resistance, assuming pMOS
transistors have twice the resistance of nMOS transistors. The inverter presents three units
of input capacitance. The NAND presents five units of capacitance on each input, so the
logical effort is 5/3. Similarly, the NOR presents seven units of capacitance, so the logical
effort is 7/3. This matches our expectation that NANDs are better than NORs because NORs
have slow pMOS transistors in series.
Table 2.1 lists the logical effort of common gates. The effort tends to increase with
the number of inputs. NAND gates are better than NOR gates because the series transistors
are nMOS rather than pMOS. Exclusive-OR gates are particularly costly and have different
logical efforts for different inputs.

Figure 2.10 Logic gates Sized for Unit Resistances
Table 2.1 Logical effort of common gates
Parasitic Delay
The parasitic delay of a gate is the delay of the gate when it drives zero load. It can be
estimated with RC delay models. A crude method good for hand calculations is to count
only diffusion capacitance on the output node. For example, consider the gates in Figure
2.10, assuming each transistor on the output node has its own drain diffusion contact.
Transistor widths were chosen to give a resistance of R in each gate. The inverter has three
units of diffusion capacitance on the output, so the parasitic delay is 3RC = . In other
words, the normalized parasitic delay is 1. In general, we will call the normalized parasitic
delay pinv . pinv is the ratio of diffusion capacitance to gate capacitance in a particular
process. It is usually close to 1 and will be considered to be 1 in many examples for
simplicity. The 3-input NAND and NOR each have 9 units of diffusion capacitance on the
output, so the parasitic delay is three times as great (3pinv, or simply 3). Table 2.2
estimates the parasitic delay of common gates. Increasing transistor sizes reduces
resistance but increases capacitance correspondingly, so parasitic delay is, on first order,
independent of gate size. However, wider transistors can be folded and often see less than
linear increases in internal wiring parasitic capacitance, so in practice, larger gates tend to
have slightly lower parasitic delay.

Table 2.2 Parasitic delay of common gates
Delay in Logic Gate
Consider two examples of applying the linear delay model to logic gates.
Example 2.4: Use the linear delay model to estimate the delay of the fanout-of-4 (FO4)
inverter. Assume the inverter is constructed in a 65 nm process with = 3 ps.
SOLUTION:
The logical effort of the inverter is g = 1, by definition. The electrical effort is 4 because the
load is four gates of equal size. The parasitic delay of an inverter is pinv = 1. The total delay
is d = gh + p = 1 × 4 + 1 = 5 in normalized terms, or tpd = 15 ps in absolute terms.
Often path delays are expressed in terms of FO4 inverter delays. While not all designers are
familiar with the notation, most experienced designers do know the delay of a fanout-of-4
inverter in the process in which they are working. can be estimated as 0.2 FO4 inverter
delays. Even if the ratio of diffusion capacitance to gate capacitance changes so pinv = 0.8 or
1.2 rather than 1, the FO4 inverter delay only varies from 4.8 to 5.2. Hence, the delay of a
gate-dominated logic block expressed in terms of FO4 inverters remains relatively constant
from one process to another even if the diffusion capacitance does not.
As a rough rule of thumb, the FO4 delay for a process (in picoseconds) is 1/3 to 1/2
of the drawn channel length (in nanometers). For example, a 65 nm process with a 50 nm
channel length may have an FO4 delay of 16–25 ps. Delay is highly sensitive to process,
voltage, and temperature variations. The FO4 delay is usually quoted assuming typical
process parameters and worst-case environment (low power supply voltage and high
temperature).
Example 2.5: A ring oscillator is constructed from an odd number of inverters, as shown in
Figure 2.11. Estimate the frequency of an N-stage ring oscillator.

Figure 2.11 Ring Oscillator
SOLUTION:
The logical effort of the inverter is g = 1, by definition. The electrical effort of each inverter
is also 1 because it drives a single identical load. The parasitic delay is also 1. The delay of
each stage is d = gh + p = 1 × 1 + 1 = 2. An N-stage ring oscillator has a period of 2N stage
delays because a value must propagate twice around the ring to regain the original polarity.
Therefore, the period is T = 2 × 2N. The frequency is the reciprocal of the period, 1/4N.
A 31-stage ring oscillator in a 65 nm process has a frequency of 1/(4 × 31 × 3 ps) = 2.7 GHz.
Note that ring oscillators are often used as process monitors to judge if a particular chip is
faster or slower than nominally expected. One of the inverters should be replaced with a
NAND gate to turn the ring off when not in use. The output can be routed to an external
pad, possibly through a test multiplexer. The oscillation frequency should be low enough
(e.g., 100 MHz) that the path to the outside world does not attenuate the signal too badly.
Logical Effort of Paths
Designers often need to choose the fastest circuit topology and gate sizes for a particular
logic function and to estimate the delay of the design. As has been stated, simulation or
timing analysis are poor tools for this task because they only determine how fast a
particular implementation will operate, not whether the implementation can be modified
for better results and if so, what to change. Inexperienced designers often end up in the
“simulate and tweak” loop involving minor changes and many fruitless simulations. The
method of Logical Effort [Sutherland99] provides a simple method “on the back of an
envelope” to choose the best topology and number of stages of logic for a function. Based
on the linear delay model, it allows the designer to quickly estimate the best number of
stages for a path, the minimum possible delay for the given topology, and the gate sizes that
achieve this delay.
Delay in Multistage Logic Networks
Figure 2.12 shows the logical and electrical efforts of each stage in a multistage path as a
function of the sizes of each stage. The path of interest (the only path in this case) is

marked with the dashed blue line. Observe that logical effort is independent of size, while
electrical effort depends on sizes. This section develops some metrics for the path as a
whole that are independent of sizing decisions.
Figure 2.12 Multistage Logic Network
The path logical effort G can be expressed as the products of the logical efforts of each stage
along the path.
=
The path electrical effort H can be given as the ratio of the output capacitance the path must
drive divided by the input capacitance presented by the path. This is more convenient than
defining path electrical effort as the product of stage electrical efforts because we do not
know the individual stage electrical efforts until gate sizes are selected.
( )
=
( )
The path effort F is the product of the stage efforts of each stage. Recall that the stage effort
of a single stage is f = gh. Can we by analogy state F = GH for a path?
= =
In paths that branch, ≠ . This is illustrated in Figure 2.13 a circuit with a two way
branch. Consider a path from the primary input to one of the outputs. The path logical
effort is G = 1 × 1 = 1. The path electrical effort is H = 90/5 = 18. Thus, GH = 18. But F = f1 f2
= g1h1g2h2 = 1 × 6 × 1 × 6 = 36. In other words, F = 2GH in this path on account of the two-
way branch.
Figure 2.13 Circuit with two way branch

We must introduce a new kind of effort to account for branching between stages of a path.
This branching effort b is the ratio of the total capacitance seen by a stage to the
capacitance on the path; in Figure 2.13 it is (15 + 15)/15 = 2.
+
=
The path branching effort B is the product of the branching efforts between stages.
Now we can define the path effort F as the product of the logical, electrical, and branching
efforts of the path. Note that the product of the electrical efforts of the stages is actually BH,
not just H.
We can now compute the delay of a multistage network. The path delay D is the sum of the
delays of each stage. It can also be written as the sum of the path effort delay DF and path
parasitic delay P:
= = +
The product of the stage efforts is F, independent of gate sizes. The path effort delay is the
sum of the stage efforts. The sum of a set of numbers whose product is constant is
minimized by choosing all the numbers to be equal. In other words, the path delay is
minimized when each stage bears the same effort. If a path has N stages and each bears the
same effort, that effort must be
= =
Thus, the minimum possible delay of an N-stage path with path effort F and path parasitic
delay P is
= +

This is a key result of Logical Effort. It shows that the minimum delay of the path can be
estimated knowing only the number of stages, path effort, and parasitic delays without the
need to assign transistor sizes. This is superior to simulation, in which delay depends on
sizes and you never achieve certainty that the sizes selected are those that offer minimum
delay.
It is also straightforward to select gate sizes to achieve this least delay. Combining EQs
(2.1) and (2.2) gives us the capacitance transformation formula to find the best input
capacitance for a gate given the output capacitance it drives.
×
=
Eq (2.3)
Starting with the load at the end of the path, work backward applying the capacitance
transformation to determine the size of each stage. Check the arithmetic by verifying that
the size of the initial stage matches the specification.
Example 2.6: Estimate the minimum delay of the path from A to B in Figure 2.14 and
choose transistor sizes to achieve this delay. The initial NAND2 gate may present a load of 8
λ of transistor width on the input and the output load is equivalent to 45 λ of transistor
width.
Figure 2.14 Example Path
SOLUTION:
The path logical effort is G = (4/3) × (5/3) × (5/3) = 100/27. The path electrical effort is H =
45/8. The path branching effort is B = 3 × 2 = 6. The path effort is F = GBH = 125. As there
are three stages, the best stage effort is = √125 = 5 . The path parasitic delay is P = 2 + 3
+ 2 = 7. Hence, the minimum path delay is D = 3 × 5 + 7 = 22 in units of , or 4.4 FO4
inverter delays. The gate sizes are computed with the capacitance transformation from EQ
(2.3) working backward along the path: y = 45 × (5/3)/5 = 15. x = (15 + 15) × (5/3)/5 = 10.
We verify that the initial 2-input NAND gate has the specified size of (10 + 10 + 10) ×
(4/3)/5 = 8.

Figure 2.15 Example path annotated with transistor sizes
The transistor sizes in Figure 2.15 are chosen to give the desired amount of input
capacitance while achieving equal rise and fall delays. For example, a 2-input NOR gate
should have a 4:1 P/N ratio. If the total input capacitance is 15, the pMOS width must be 12
and the nMOS width must be 3 to achieve that ratio.
We can also check that our delay was achieved. The NAND2 gate delay is d1 = g1h1 + p1 =
(4/3) × (10 + 10 + 10)/8 + 2 = 7. The NAND3 gate delay is d2 = g2h2 + p2 = (5/3) × (15 +
15)/10 + 3 = 8. The NOR2 gate delay is d3 = g3h3 + p3 = (5/3) × 45/15 + 2 = 7. Hence, the
path delay is 22, as predicted.
Recall that delay is expressed in units of . In a 65 nm process with = 3 ps, the delay is
66ps. Alternatively, a fanout-of-4 inverter delay is 5 , so the path delay is 4.4 FO4s.
Many inexperienced designers know that wider transistors offer more current and thus try
to make circuits faster by using bigger gates. Increasing the size of any of the gates except
the first one only makes the circuit slower. For example, increasing the size of the NAND3
makes the NAND3 faster but makes the NAND2 slower, resulting in a net speed loss.
Increasing the size of the initial NAND2 gate does speed up the circuit under consideration.
However, it presents a larger load on the path that computes input A, making that path
slower. Hence, it is crucial to have a specification of not only the load the path must drive
but also the maximum input capacitance the path may present.
Choosing the Best No. of Stages
Given a specific circuit topology, we now know how to estimate delay and choose
gate sizes. However, there are many different topologies that implement a particular logic
function. Logical Effort tells us that NANDs are better than NORs and that gates with few
inputs are better than gates with many. In this section, we will also use Logical Effort to
predict the best number of stages to use.
Logic designers sometimes estimate delay by counting the number of stages of logic,
assuming each stage has a constant “gate delay.” This is potentially misleading because it
implies that the fastest circuits are those that use the fewest stages of logic. Of course, the

gate delay actually depends on the electrical effort, so sometimes using fewer stages results
in more delay. The following example illustrates this point.
Example 2.7: A control unit generates a signal from a unit-sized inverter. The signal must
drive unit-sized loads in each bitslice of a 64-bit datapath. The designer can add inverters
to buffer the signal to drive the large load. Assuming polarity of the signal does not matter,
what is the best number of inverters to add and what delay can be achieved?
SOLUTION:
Figure 2.16 Comparison of different number of stages of Buffers
Figure 2.16 shows the cases of adding 0, 1, 2, or 3 inverters. The path electrical effort is H =
64. The path logical effort is G = 1, independent of the number of inverters. Thus, the path
effort is F = 64. The inverter sizes are chosen to achieve equal stage effort. The total delay is
= √64 +
The 3-stage design is fastest and far superior to a single stage. If an even number of
inversions were required, the two- or four-stage designs are promising. The four-stage
design is slightly faster, but the two-stage design requires significantly less area and power.

Figure 2.17 Logic Block with additional inverters
In general, you can always add inverters to the end of a path without changing its function
(save possibly for polarity). Let us compute how many should be added for least delay. The
logic block shown in Figure 2.17 has n1 stages and a path effort of F. Consider adding N – n1
inverters to the end to bring the path to N stages. The extra inverters do not change the
path logical effort but do add parasitic delay. The delay of the new path is
/
= + +( − )
Differentiating with respect to N and setting to 0 allows us to solve for the best number of
stages, which we will call . The result can be expressed more compactly by defining
to be the best stage effort
=− ln + + =0
→ + (1 − )=0
Eq (2.4)
EQ (2.4) has no closed form solution. Neglecting parasitics (i.e., assuming pinv = 0),
we find the classic result that the stage effort = 2.71828 (e) [Mead80]. In practice, the
parasitic delays mean each inverter is somewhat more costly to add. As a result, it is better
to use fewer stages, or equivalently a higher stage effort than e. Solving numerically, when
pinv = 1, we find = 3.59.
A path achieves least delay by using = stages. It is important to understand not

only the best stage effort and number of stages but also the sensitivity to using a different
number of stages. Figure 2.18 plots the delay increase using a particular number of stages
against the total number of stages, for pinv = 1. The x-axis plots the ratio of the actual
number of stages to the ideal number. The y-axis plots the ratio of the actual delay to the
best achievable. The curve is flat around the optimum. The delay is within 15% of the best

achievable if the number of stages is within 2/3 to 3/2 times the theoretical best number
(i.e., is in the range of 2.4 to 6).
Using a stage effort of 4 is a convenient choice and simplifies mentally choosing the
best number of stages. This effort gives delays within 2% of minimum for pinv in the range
of 0.7 to 2.5. This further explains why a fanout-of-4 inverter has a “representative” logic
gate delay.
Figure 2.18 Sensitivity of delay to number of stages

PROGRAMMABLE ASIC (P-ASIC) LOGIC CELLS

All programmable ASICs or FPGAs contain a basic logic cell replicated in a regular array
across the chip (analogous to a base cell in an MGA). There are the following three different
types of basic logic cells: (1) multiplexer based, (2) look-up table based, and (3)
programmable array logic. The choice among these depends on the programming
technology.
ACTEL ACT
The basic logic cells in the Actel ACT family of FPGAs are called Logic Modules . The ACT 1
family uses just one type of Logic Module and the ACT 2 and ACT 3 FPGA families both use
two different types of Logic Module.
ACT 1 Logic Module
The functional behavior of the Actel ACT 1 Logic Module is shown in Figure 2.19 (a). Figure
2.19 (b) represents a possible circuit-level implementation. We can build a logic function
using an Actel Logic Module by connecting logic signals to some or all of the Logic Module
inputs, and by connecting any remaining Logic Module inputs to VDD or GND. As an
example, Figure 2.19 (c) shows the connections to implement the function
Figure 2.19 The Actel ACT architecture. (a) Organization of the basic logic cells. (b) The
ACT 1 Logic Module. (c) An implementation using pass transistors (without any buffering).
(d) An example logic macro.

= ∙ + ∙ +
How did we know what connections to make? To understand how the Actel Logic Module
works, we take a detour via multiplexer logic and some theory.
Shannon’s Expansion Theorem
In logic design we often have to deal with functions of many variables. We need a method
to break down these large functions into smaller pieces. Using the Shannon expansion
theorem, we can expand a Boolean logic function F in terms of (or with respect to) a
Boolean variable A,
= ∙ ( = 1)+ ∙ ( = 0)
Eq (2.5)
where F (A = 1) represents the function F evaluated with A set equal to '1'. For example, we
can expand the following function F with respect to (We shall use the abbreviation wrt ) A,
= ∙ + ∙ ∙ + ′∙ ′∙
= ∙( ∙ )+ ∙( + ∙ )
Eq (2.6)
We have split F into two smaller functions. We call F (A = '1') = B · C' the cofactor of F wrt A
in Eq. 2.6 . We shall sometimes write the cofactor of F wrt A as FA(the cofactor of F wrt A' is
F A' ). We may expand a function wrt any of its variables. For example, if we expand F wrt B
instead of A,
= ∙ + ∙ ∙ + ′∙ ′∙
= ∙( + ∙ )+ ∙( ∙ )
Eq (2.7)
We can continue to expand a function as many times as it has variables until we reach the
canonical form (a unique representation for any Boolean function that uses only
minterms. A minterm is a product term that contains all the variables of F—such as A · B' ·
C). Expanding Eq. 2.7 again, this time wrt C, gives
= ∙( ∙ + ∙ )+ ∙( ∙ + ∙ )
Eq (2.8)
As another example, we will use the Shannon expansion theorem to implement the
following function using the ACT 1 Logic Module:
= ( ∙ )+( ∙ )+
Eq (2.9)

First we expand F wrt B:
= ∙( + )+ ∙( + )
= ∙ 2+ ′∙ 1
Eq (2.10)
Equation 2.10 describes a 2:1 MUX, with B selecting between two inputs: F (A = '1') and F
(A = '0'). In fact Eq. 2.10 also describes the output of the ACT 1 Logic Module in Figure 2.19
Now we need to split up F1 and F2 in Eq. 2.10 . Suppose we expand F2 = F B wrt A, and F1 =
F B' wrt C:
F2 = A + D = (A ∙ 1) + A′ ∙ D
Eq (2.11)
1= + = ( ∙ 1) + ( ∙ )
Eq (2.12)
From Eqs. 2.10 to 2.12 we see that we may implement F by arranging for A, B, C to appear
on the select lines and '1' and D to be the data inputs of the MUXes in the ACT 1 Logic
Module. This is the implementation shown in Figure 2.19 (d), with connections: A0 = D, A1
= '1', B0 = D, B1 = '1', SA = C, SB = A, S0 = '0', and S1 = B.
Now that we know that we can implement Boolean functions using MUXes, how do
we know which functions we can implement and how to implement them?
Multiplexer Logic as Function Generator
Figure 2.20 illustrates the 16 different ways to arrange ‘1’s on a Karnaugh map
corresponding to the 16 logic functions, F (A, B), of two variables. Two of these functions
are not very interesting (F = '0', and F = '1'). Of the 16 functions, Table 2.3 shows the 10
that we can implement using just one 2:1 MUX. Of these 10 functions, the following six are
useful:
 INV. The MUX acts as an inverter for one input only.

 BUF. The MUX just passes one of the MUX inputs directly to the output.
 AND. A two-input AND.
 OR. A two-input OR.
 AND1-1. A two-input AND gate with inverted input, equivalent to an NOR-11.
 NOR1-1. A two-input NOR gate with inverted input, equivalent to an AND-11.

Figure 2.20 The logic Functions of Two Variables
Sl Function F= Canonical Minterms M1

No Form A0 A1 SA
1 ‘0’ ‘0’ 0 None 0 0 0
2 NOR1-1(A,B) (A+B’)’ A’·B 1 B 0 A
3 NOT (A) A’ A’·B’+A’·B 0,1 0 1 A
4 AND 1-1(A,B) A·B’ A·B’ 2 A 0 B
5 NOT (B) B’ A’·B’+A·B’ 0,2 0 1 B
6 BUF (B) B A’·B+A·B 1,3 0 B 1
7 AND (A,B) A·B A·B 3 0 B A
8 BUF (A) A A·B’+A·B 2,3 0 A 1
9 OR (A,B) A+B A’·B+A·B’+ 1,2,3 B 1 A
A·B
10 ‘1’ ‘1’ A’·B’+A’·B+ 0,1,2,3 1 1 1
A·B’+ A·B
Table 2.3 Boolean Functions using a 2:1 Mux
Figure 2.21 (a) shows how we might view a 2:1 MUX as a function wheel, a three-
input black box that can generate any one of the six functions of two input variables: BUF,
INV, AND-11, AND1-1, OR, AND. We can write the output of a function wheel as
F1 = WHEEL1(A, B)
Eq (2.13)
where we define the wheel function as follows:
WHEEL1(A, B) = MUX(A0, A1, SA)

Eq (2.14)
The MUX function is not unique; we shall define it as

MUX (A0, A1, SA) = A0 · SA′ + A1 · SA

Eq (2.15)
The inputs (A0, A1, SA) are described using the notation
A0, A1, SA = {A, B, ′0′, ′1′}

Eq (2.16)
to mean that each of the inputs (A0, A1, and SA) may be any of the values: A, B, '0', or '1'.
We chose the name of the wheel function because it is rather like a dial that you set to your
choice of function. Figure 2.21 (b) shows that the ACT 1 Logic Module is a function
generator built from two function wheels, a 2:1 MUX, and a two-input OR gate.
Figure 2.21 The ACT 1 Logic Module as a Boolean function generator. (a) A 2:1 MUX
viewed as a function wheel. (b) The ACT 1 Logic Module viewed as two function wheels, an
OR gate, and a 2:1 MUX.
We can describe the ACT 1 Logic Module in terms of two WHEEL functions:
F = MUX [WHEEL1, WHEEL2, OR (S0, S1)]

Eq (2.17)
Now, for example, to implement a two-input NAND gate, F = NAND (A, B) = (A · B)',
using an ACT 1 Logic Module we first express F as the output of a 2:1 MUX. To split up F we
expand it wrt A (or wrt B; since F is symmetric in A and B):
F = A · (B′) + A′ · (′1′)
Eq (2.18)
Thus to make a two-input NAND gate we assign WHEEL1 to implement INV (B), and
WHEEL2 to implement '1'. We must also set the select input to the MUX connecting
WHEEL1 and WHEEL2, S0 + S1 = A—we can do this with S0 = A, S1 = '1'.
Before we get too carried away, we need to realize that we do not have to worry
about how to use Logic Modules to construct combinational logic functions—this has

already been done for us. For example, if we need a two input NAND gate, we just use a
NAND gate symbol and software takes care of connecting the inputs in the right way to the
Logic Module.
How did Actel design its Logic Modules? One of Actel’s engineers wrote a program that
calculates how many functions of two, three, and four variables a given circuit would
provide. The engineers tested many different circuits and chose the best one: a small,
logically efficient circuit that implemented many functions. For example, the ACT 1 Logic
Module can implement all two-input functions, most functions with three inputs, and many
with four inputs.
Apart from being able to implement a wide variety of combinational logic functions,
the ACT 1 module can implement sequential logic cells in a flexible and efficient manner.
For example, you can use one ACT 1 Logic Module for a transparent latch or two Logic
Modules for a flip-flop. The use of latches rather than flip-flops does require a shift to a
two-phase clocking scheme using two non overlapping clocks and two clock trees. Two-
phase synchronous design using latches is efficient and fast but, to handle the timing
complexities of two clocks requires changes to synthesis and simulation software that have
not occurred. This means that most people still use flip-flops in their designs, and these
require two Logic Modules.
ACT 2 & ACT 3 Logic Modules
Using two ACT 1 Logic Modules for a flip-flop also requires added interconnect and
associated parasitic capacitance to connect the two Logic Modules. To produce an efficient
two-module flip-flop macro we could use extra antifuses in the Logic Module to cut down
on the parasitic connections. However, the extra antifuses would have an adverse impact
on the performance of the Logic Module in other macros. The alternative is to use a
separate flip-flop module, reducing flexibility and increasing layout complexity. In the ACT
1 family Actel chose to use just one type of Logic Module. The ACT 2 and ACT 3
architectures use two different types of Logic Modules, and one of them does include the
equivalent of a D flip-flop.
Figure 2.22 shows the ACT 2 and ACT 3 Logic Modules. The ACT 2 C Module is
similar to the ACT 1 Logic Module but is capable of implementing five-input logic functions.
Actel calls its C-module a combinatorial module even though the module implements
combinational logic. John Wakerly blames MMI for the introduction of the term
combinatorial [Wakerly, 1994, p. 404].
The use of MUXes in the Actel Logic Modules (and in other places) can cause confusion in
using and creating logic macros. For the Actel library, setting S = '0' selects input A of a two-
input MUX. For other libraries setting S = '1' selects input A. This can lead to some very
hard to find errors when moving schematics between libraries. Similar problems arise in
flip-flops and latches with MUX inputs. A safer way to label the inputs of a two-input MUX is

with '0' and '1', corresponding to the input selected when the select input is '1' or '0'. This
notation can be extended to bigger MUXes, but in Figure 2.22 , does the input combination
S0 = '1' and S1 = '0' select input D10 or input D01? These problems are not caused by Actel,
but by failure to use the IEEE standard symbols in this area.
The S-Module ( sequential module ) contains the same combinational function

capability as the C-Module together with a sequential element that can be configured as a
flip-flop. Figure 2.22 (d) shows the sequential element implementation in the ACT 2 and
ACT 3 architectures.
Figure 2.22 The Actel ACT 2 and ACT 3 Logic Modules. (a) The C-Module for combinational
logic. (b) The ACT 2 SModule. (c) The ACT 3 S-Module. (d) The equivalent circuit (without
buffering) of the SE (sequential element). (e) The sequential element configured as a
positive-edge–triggered D flip-flop. (Source: Actel.)
XILINX LCA
Xilinx LCA (a trademark, denoting logic cell array) basic logic cells, configurable
logic blocks or CLBs , are bigger and more complex than the Actel or QuickLogic cells. The
Xilinx LCA basic logic cell is an example of a coarse-grain architecture . The Xilinx CLBs
contain both combinational logic and flip-flops.

XC 3000 CLB
The XC3000 CLB, shown in Figure 2.23 , has five logic inputs (A–E), a common clock
input (K), an asynchronous direct-reset input (RD), and an enable (EC). Using
programmable MUXes connected to the SRAM programming cells, you can independently
connect each of the two CLB outputs (X and Y) to the output of the flip-flops (QX and QY) or
to the output of the combinational logic (F and G).
Figure 2.23 The Xilinx XC3000 CLB (configurable logic block). (Source: Xilinx.)
A 32-bit look-up table ( LUT ), stored in 32 bits of SRAM, provides the ability to
implement combinational logic. Suppose you need to implement the function F = A · B · C ·
D · E (a five-input AND). You set the contents of LUT cell number 31 (with address '11111')
in the 32-bit SRAM to a '1'; all the other SRAM cells are set to '0'. When you apply the input
variables as an address to the 32-bit SRAM, only when ABCDE = '11111' will the output F
be a '1'. This means that the CLB propagation delay is fixed, equal to the LUT access time,
and independent of the logic function you implement.
There are seven inputs for the combinational logic in the XC3000 CLB: the five CLB
inputs (A–E), and the flip-flop outputs (QX and QY). There are two outputs from the LUT (F
and G). Since a 32-bit LUT requires only five variables to form a unique address (32 = 25),
there are several ways to use the LUT:
 You can use five of the seven possible inputs (A–E, QX, QY) with the entire 32-bit
LUT. The CLB outputs (F and G) are then identical.

 You can split the 32-bit LUT in half to implement two functions of four variables
each. You can choose four input variables from the seven inputs (A–E, QX, QY). You
have to choose two of the inputs from the five CLB inputs (A–E); then one function
output connects to F and the other output connects to G.
 You can split the 32-bit LUT in half, using one of the seven input variables as a select
input to a 2:1 MUX that switches between F and G. This allows you to implement
some functions of six and seven variables.
XC 4000 Logic Block
Figure 2.24 shows the CLB used in the XC4000 series of Xilinx FPGAs. This is a fairly
complicated basic logic cell containing 2 four-input LUTs that feed a three-input LUT. The
XC4000 CLB also has special fast carry logic hardwired between CLBs. MUX control logic
maps four control inputs (C1–C4) into the four inputs: LUT input H1, direct in (DIN), enable
clock (EC), and a set / reset control (S/R) for the flip-flops. The control inputs (C1–C4) can
also be used to control the use of the F' and G' LUTs as 32 bits of SRAM.
Figure 2.24 The Xilinx XC4000 family CLB (configurable logic block). ( Source: Xilinx.)
XC 5200 Logic Block
Figure 2.25 shows the basic logic cell, a Logic Cell or LC, used in the XC5200 family of Xilinx
LCA FPGAs. The LC is similar to the CLBs in the XC2000/3000/4000 CLBs, but simpler.
Xilinx retained the term CLB in the XC5200 to mean a group of four LCs (LC0–LC3).

The XC5200 LC contains a four-input LUT, a flip-flop, and MUXes to handle signal
switching. The arithmetic carry logic is separate from the LUTs. A limited capability to
cascade functions is provided (using the MUX labeled F5_MUX in logic cells LC0 and LC2 in
Figure 2.25 ) to gang two LCs in parallel to provide the equivalent of a five-input LUT.
Figure 2.25 The Xilinx XC5200 family LC (Logic Cell) and CLB (configurable logic block).
(Source: Xilinx.)
ALTERA MAX
Suppose we have a simple two-level logic circuit that implements a sum of products
as shown in Figure 2.26 (a). We may redraw any two-level circuit using a regular structure
( Figure 2.26 b): a vector of buffers, followed by a vector of AND gates (which construct the
product terms) that feed OR gates (which form the sums of the product terms). We can
simplify this representation still further ( Figure 2.26 c), by drawing the input lines to a
multiple-input AND gate as if they were one horizontal wire, which we call a product-term
line . A structure such as Figure 2.26 (c) is called programmable array logic , first
introduced by Monolithic Memories as the PAL series of devices.
Because the arrangement of Figure 2.26 (c) is very similar to a ROM, we sometimes
call a horizontal product-term line, which would be the bit output from a ROM, the bit line .
The vertical input line is the word line. Figure 2.26 (d) and (e) show how to build the
programmable-AND array (or product-term array) from EPROM transistors. The
horizontal product term lines connect to the vertical input lines using the EPROM
transistors as pull-downs at each possible connection. Applying a '1' to the gate of an
unprogrammed EPROM transistor pulls the product-term line low to a '0'. A programmed n

-channel transistor has a threshold voltage higher than V DD and is therefore always off .
Thus a programmed transistor has no effect on the product-term line.
Figure 2.26 Logic arrays. (a) Two-level logic. (b) Organized sum of products. (c) A
programmable-AND plane. (d) EPROM logic array. (e) Wired logic.
Notice that connecting the n -channel EPROM transistors to a pull-up resistor as

shown in Figure 2.26 (e) produces a wired-logic function—the output is high only if all of
the outputs are high, resulting in a wired-AND function of the outputs. The product-term
line is low when any of the inputs are high. Thus, to convert the wired-logic array into a
programmable-AND array, we need to invert the sense of the inputs. We often conveniently
omit these details when we draw the schematics of logic arrays, usually implemented as
NOR–NOR arrays (so we need to invert the outputs as well). They are not minor details
when you implement the layout, however.
Figure 2.27 shows how a programmable-AND array can be combined with other
logic into a macrocell that contains a flip-flop. For example, the widely used 22V10 PLD,
also called a registered PAL, essentially contains 10 of the macrocells shown in Figure 2.27.

The part number, 22V10, denotes that there are 22 inputs (44 vertical input lines for both
true and complement forms of the inputs) to the programmable AND array and 10
macrocells. The PLD or registered PAL shown in Figure 2.27 has an 2 I x jk programmable-
AND array.
Figure 2.27 A registered PAL with i inputs, j product terms, and k macrocells.
Logic Expanders
The basic logic cell for the Altera MAX architecture, a macrocell, is a descendant of
the PAL. Using the logic expander , shown in Figure 2.28 to generate extra logic terms, it is
possible to implement functions that require more product terms than are available in a
simple PAL macrocell. As an example, consider the following function:
F = A′ · C · D + B′ · C · D + A · B + B · C′
Eq (2.19)
This function has four product terms and thus we cannot implement F using a
macrocell that has only a three-wide OR array (such as the one shown in Figure 2.28 ). If we
rewrite F as a “sum of (products of products)” like this:
F = (A′ + B′) · C · D + (A + C′) · B

= (A · B)′ (C · D) + (A′ · C)′ · B
Eq (2.20)
we can use logic expanders to form the expander terms (A · B)' and (A' · C)' (see
Figure 2.28 ). We can even share these extra product terms with other macrocells if we
need to. We call the extra logic gates that form these shareable product terms a shared
logic expander , or just shared expander.

Figure 2.28 Expander logic and programmable inversion. An expander increases the
number of product terms available and programmable inversion allows you to reduce the
number of product terms you need.
The disadvantage of the shared expanders is the extra logic delay incurred because
of the second pass that you need to take through the product-term array. We usually do not
know before the logic tools assign logic to macrocells ( logic assignment ) whether we
need to use the logic expanders. Since we cannot predict the exact timing the Altera MAX
architecture is not strictly deterministic . However, once we do know whether a signal has
to go through the array once or twice, we can simply and accurately predict the delay. This
is a very important and useful feature of the Altera MAX architecture.
The expander terms are sometimes called helper terms when you use a PAL. If you
use helper terms in a 22V10, for example, you have to go out to the chip I/O pad and then
back into the programmable array again, using twopass logic .
Figure 2.29 Use of programmed inversion to simplify logic: (a) The function F = A · B' + A ·
C' + A · D' + A' · C · D requires four product terms (P1–P4) to implement while (b) the
complement, F ' = A · B · C · D + A' · D' + A' · C' requires only three product terms (P1–P3).

Another common feature in complex PLDs, also used in some PLDs, is shown in
Figure 2.28 . Programming one input of the XOR gate at the macrocell output allows you to
choose whether or not to invert the output (a '1' for inversion or to a '0' for no inversion).
This programmable inversion can reduce the required number of product terms by using
a de Morgan equivalent representation instead of a conventional sum-of-products form, as
shown in Figure 2.29.
As an example of using programmable inversion, consider the function
F = A · B′ + A · C′ + A · D′ + A′ · C · D
Eq (2.20)
which requires four product terms—one too many for a three-wide OR array.
If we generate the complement of F instead,
F ′ = A · B · C · D + A′ · D′ + A′ · C′
Eq (2.21)
this has only three product terms. To create F we invert F ', using programmable inversion.
Figure 2.30 The Altera MAX architecture. (a) Organization of logic and interconnect. (b) A
MAX family LAB (Logic Array Block). (c) A MAX family macrocell. The macrocell details
vary between the MAX families—the functions shown here are closest to those of the MAX
9000 family macrocells.

Figure 2.30 shows an Altera MAX macrocell and illustrates the architectures of several
different product families. The implementation details vary among the families, but the
basic features: wide programmable-AND array, narrow fixed-OR array, logic expanders,
and programmable inversion—are very similar. Each family has the following individual
characteristics:
 A typical MAX 5000 chip has: 8 dedicated inputs (with both true and complement
forms); 24 inputs from the chipwide interconnect (true and complement); and
either 32 or 64 shared expander terms (single polarity). The MAX 5000 LAB looks
like a 32V16 PLD (ignoring the expander terms).
 The MAX 7000 LAB has 36 inputs from the chipwide interconnect and 16 shared
expander terms; the MAX 7000 LAB looks like a 36V16 PLD.
 The MAX 9000 LAB has 33 inputs from the chipwide interconnect and 16 local
feedback inputs (as well as 16 shared expander terms); the MAX 9000 LAB looks
like a 49V16 PLD.
PROGRAMMABLE ASIC I/O CELLS

All programmable ASICs contain some type of input/output cell ( I/O cell ). These
I/O cells handle driving logic signals off-chip, receiving and conditioning external inputs, as
well as handling such things as electrostatic protection. This chapter explains the different
types of I/O cells that are used in programmable ASICs and their functions.
The following are different types of I/O requirements.

 DC output. Driving a resistive load at DC or low frequency (less than 1 MHz).
Example loads are light-emitting diodes (LEDs), relays, small motors, and such. Can
we supply an output signal with enough voltage, current, power, or energy?
 AC output. Driving a capacitive load with a high-speed (greater than 1 MHz) logic
signal off-chip. Example loads are other logic chips, a data or address bus, ribbon
cable. Can we supply a valid signal fast enough?
 DC input. Example sources are a switch, sensor, or another logic chip. Can we
correctly interpret the digital value of the input?
 AC input. Example sources are high-speed logic signals (higher than 1 MHz) from
another chip. Can we correctly interpret the input quickly enough?
 Clock input. Examples are system clocks or signals on a synchronous bus. Can we
transfer the timing information from the input to the appropriate places on the chip
correctly and quickly enough?
 Power input. We need to supply power to the I/O cells and the logic in the core,
without introducing voltage drops or noise. We may also need a separate power
supply to program the chip.

These issues are common to all FPGAs (and all ICs) so that the design of FPGA I/O cells
is driven by the I/O requirements as well as the programming technology.
PROGRAMMABLE ASIC INTERCONNECTS

All FPGAs contain some type of programmable interconnect. The structure and
complexity of the interconnect is largely determined by the programming technology and
the architecture of the basic logic cell. The raw material that we have to work with in
building the interconnect is aluminum based metallization. The first programmable ASICs
were constructed using two layers of metal; newer programmable ASICs use three or more
layers of metal interconnect.
ACTEL ACT
The Actel ACT family interconnect scheme shown in Figure 2.31 is similar to a channeled
gate array. The channel routing uses dedicated rectangular areas of fixed size within the
chip called wiring channels (or just channels). The horizontal channels run across the
chip in the horizontal direction. In the vertical direction there are similar vertical
channels that run over the top of the basic logic cells, the Logic Modules. Within the
horizontal or vertical channels wires run horizontally or vertically, respectively, within
tracks. Each track holds one wire. The capacity of a fixed wiring channel is equal to the
number of tracks it contains. Figure 2.32 shows a detailed view of the channel and the
connections to each Logic Module—the input stubs and output stubs.
Figure 2.31 The interconnect architecture used in an Actel ACT family FPGA. ( Source:
Actel.)

Figure 2.32 ACT 1 horizontal and vertical channel architecture. (Source: Actel.)
In a channeled gate array the designer decides the location and length of the
interconnect within a channel. In an FPGA the interconnect is fixed at the time of
manufacture. To allow programming of the interconnect, Actel divides the fixed
interconnect wires within each channel into various lengths or wire segments. We call this
segmented channel routing, a variation on channel routing. Antifuses join the wire
segments. The designer then programs the interconnections by blowing antifuses and
making connections between wire segments; unwanted connections are left
unprogrammed. A statistical analysis of many different layouts determines the optimum
number and the lengths of the wire segments.
The ACT 1 interconnection architecture uses 22 horizontal tracks per channel for
signal routing with three tracks dedicated to VDD, GND, and the global clock (GCLK),
making a total of 25 tracks per channel. Horizontal segments vary in length from four
columns of Logic Modules to the entire row of modules (Actel calls these long segments
long lines ).
Four Logic Module inputs are available to the channel below the Logic Module and
four inputs to the channel above the Logic Module. Thus eight vertical tracks per Logic
Module are available for inputs (four from the Logic Module above the channel and four
from the Logic Module below). These connections are the input stubs.
The single Logic Module output connects to a vertical track that extends across the
two channels above the module and across the two channels below the module. This is the

output stub. Thus module outputs use four vertical tracks per module (counting two tracks
from the modules below, and two tracks from the modules above each channel). One
vertical track per column is a long vertical track ( LVT ) that spans the entire height of the
chip (the 1020 contains some segmented LVTs). There are thus a total of 13 vertical tracks
per column in the ACT 1 architecture (eight for inputs, four for outputs, and one for an
LVT).
XILINX LCA
Figure 2.32 shows the hierarchical Xilinx LCA interconnect architecture.
 The vertical lines and horizontal lines run between CLBs.

 The general-purpose interconnect joins switch boxes (also known as magic boxes or
switching matrices).
 The long lines run across the entire chip. It is possible to form internal buses using
long lines and the three-state buffers that are next to each CLB.
 The direct connections (not used on the XC4000) bypass the switch matrices and
directly connect adjacent CLBs.
 The Programmable Interconnection Points ( PIP s) are programmable pass
transistors that connect the CLB inputs and outputs to the routing network.
 The bidirectional ( BIDI ) interconnect buffers restore the logic level and logic
strength on long interconnect paths.
Figure 2.32 Xilinx LCA interconnect. (a) The LCA architecture (notice the matrix element
size is larger than a CLB). (b) A simplified representation of the interconnect resources.
Each of the lines is a bus.

ALTERA MAX 5000 & 7000

Altera MAX 5000 devices (except the EPM5032, which has only one LAB) and all
MAX 7000 devices use a Programmable Interconnect Array ( PIA), shown in Figure 7.8 .
The PIA is a cross-point switch for logic signals traveling between LABs. The advantages of
this architecture (which uses a fixed number of connections) over programmable
interconnection schemes (which use a variable number of connections) is the fixed routing
delay. An additional benefit of the simpler nature of a large regular interconnect structure
is the simplification and improved speed of the placement and routing software.
Figure 2.33 A simplified block diagram of the Altera MAX interconnect scheme. (a) The PIA
(Programmable Interconnect Array) is deterministic—delay is independent of the path
length. (b) Each LAB (Logic Array Block) contains a programmable AND array. (c)
Interconnect timing within a LAB is also fixed.
Figure 2.33 (a) illustrates that the delay between any two LABs, t PIA , is fixed. The delay
between LAB1 and LAB2 (which are adjacent) is the same as the delay between LAB1 and
LAB6 (on opposite corners of the die). It may seem rather strange to slow down all
connections to the speed of the longest possible connection—a large penalty to pay to
achieve a deterministic architecture. However, it gives Altera the opportunity to highly
optimize all of the connections since they are completely fixed.
PROGRAMMABLE ASIC DESIGN SOFTWARE

There are five components of a programmable ASIC or FPGA: (1) the programming
technology, (2) the basic logic cell, (3) the I/O cell, (4) the interconnect, and (5) the design
software that allows you to program the ASIC. The design software is much more closely
tied to the FPGA architecture than is the case for other types of ASICs.

The sequence of steps for FPGA design is similar to the sequence discussed in “ASIC
Design Flow .” As for any ASIC a designer needs design entry software, a cell library, and
physical-design software. Each of the FPGA vendors sells design kits that include all the
software and hardware that a designer needs. Many of these kits use design-entry software
produced by a different company. Often designers buy that software from the FPGA vendor.
This is called an original equipment manufacturer ( OEM ) arrangement—similar to
buying a car with a stereo manufactured by an electronics company but labeled with the
automobile company’s name. Design entry uses cell libraries that are unique to each FPGA
vendor. All of the FPGA vendors produce their own physical-design software so they can
tune the algorithms to their own architecture.
Unfortunately, there are no standards in FPGA design. Thus, for example, Xilinx calls
its 2:1 MUX an M2_1 with inputs labeled D0 , D1 , and S0 with output O . Actel calls a 2:1
MUX an MX2 with inputs A , B , and S with output Y . This problem is not peculiar to Xilinx
and Actel; each ASIC vendor names its logic cells, buffers, pads, and so on in a different
manner. Consequently designers may not be able to transfer a netlist using one ASIC
vendor library to another. Worse than this, designers may not even be able to transfer a
design between two FPGA families made by the same FPGA vendor!
One solution to the lack of standards for cell libraries is to use a generic cell library,
independent from any particular FPGA vendor. For example, most of the FPGA libraries
include symbols that are equivalent to TTL 7400 logic series parts. The FPGA vendor’s own
software automatically handles the conversion from schematic symbols to the logic cells of
the FPGA.
Schematic entry is not the only method of design entry for FPGAs. Some designers
are happier describing control logic and state machines in terms of state diagrams and logic
equations. A solution to some of the problems with schematic entry for FPGA design is to
use one of several hardware description languages ( HDL s) for which there are some
standards. There are two sets of languages in common use. One set has evolved from the
design of programmable logic devices (PLDs). The ABEL (pronounced “able”), CUPL
(“cupple”), and PALASM (“pal-azzam”) languages are simple and easy to learn. These
languages are useful for describing state machines and combinational logic. The other set of
HDLs includes VHDL and Verilog, which are higher-level and are more complex but are
capable of describing complete ASICs and systems.
After completing design entry and generating a netlist, the next step is simulation.
Two types of simulators are normally used for FPGA design. The first is a logic simulator for
behavioral, functional, and timing simulation. This tool can catch any design errors. The
designer provides input waveforms to the simulator and checks to see that the outputs are
as expected. At this point, using a nondeterministic architecture, logic path delays are only
estimates, since the wiring delays will not be known until after physical design (place-and-
route) is complete. Designers then add or back-annotate the postlayout timing information
to the postlayout netlist (also called a backannotated netlist). This is followed by a
postlayout timing simulation.

The second type of simulator, the type most often used in FPGA design, is a timing-
analysis tool. A timing analyzer is a static simulator and removes the need for input
waveforms. Instead the timing analyzer checks for critical paths that limit the speed of
operation—signal paths that have large delays caused, say, by a high fanout net. Designers
can set a certain delay restriction on a net or path as a timing constraint; if the actual delay
is longer, this is a timing violation. In most design systems we can return to design entry
and tag critical paths with attributes before completing the place-and-route step again. The
next time we use the place-and-route software it will pay special attention to those signals
we have labeled as critical in order to minimize the routing delays associated with those
signals. The problem is that this iterative process can be lengthy and sometimes
nonconvergent. Each time timing violations are fixed, others appear. This is especially a
problem with placeand- route software that uses random algorithms (and forms a chaotic
system). More complex (and expensive) logic synthesizers can automate this iterative stage
of the design process. The critical path information is calculated in the logic synthesizer,
and timing constraints are created in a feedforward path (this is called forward-
annotation ) to direct the place-and-route software.
Although some FPGAs are reprogrammable, it is not a good idea to rely on this fact.
It is very tempting to program the FPGA, test it, make changes to the netlist, and then keep
programming the device until it works. This process is much more time consuming and
much less reliable than performing thorough simulation. It is quite possible, for example, to
get a chip working in an experimental fashion without really knowing why. The danger
here is that the design may fail under some other set of operating conditions or
circumstances. Simulation is the proper way to catch and correct these potential disasters.

REFERENCES
 Jan M. Rabaey, Anantha Chandrakasan, and Borivoje Nikolic, Digital Integrated Circuits A Design Perspective, 2nd
edition
 M. Smith, Application-Specific Integrated Circuits, Addison-Wesley, 1997.
 Neil H Weste, David Harris, CMOS VLSI Design, A Circuits & Systems Perspective, 3rd Edition, Pearson Education
 [Elmore48] W. Elmore, “The transient response of damped linear networks with particular regard to wideband
amplifiers,” J. Applied Physics, vol. 19, no. 1, Jan. 1948, pp. 55–63.
 [Mead80] C. Mead and L. Conway, Introduction to VLSI Systems, Reading, MA: Addison-Wesley, 1980.
 [Sutherland99] I. Sutherland, B. Sproull, and D. Harris, Logical Effort: Designing Fast CMOS Circuits, San
Francisco, CA: Morgan Kaufmann, 1999.
 [Wakerly00] J. Wakerly, Digital Design Principles and Practices, 3rd ed., Upper Saddle River, NJ: Prentice Hall,
2000.

Unit - 2: Asic Library Design O Delay Estimation

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Unit - 2: Asic Library Design O Delay Estimation

Încărcat de

Drepturi de autor:

Formate disponibile

DEPT.

OF ELECTRICAL & COMPUTER ENGINEERING

ASIC LIBRARY DESIGN

Figure 2.1 Propagation delay and rise/fall times.

VLSI DESIGN, ECEg 4201 2 ROOHA RAZMID AHAMED

 The architectural/micro-architectural level

VLSI DESIGN, ECEg 4201 3 ROOHA RAZMID AHAMED

Gate & Diffusion Capacitance

VLSI DESIGN, ECEg 4201 4 ROOHA RAZMID AHAMED

C. Wider transistors have proportionally greater diffusion capacitance. Increasing channel

Although capacitances have nonlinear voltage dependence, we use a single average

Figure 2.2 Equivalent circuits for transistors.

VLSI DESIGN, ECEg 4201 5 ROOHA RAZMID AHAMED

Figure 2.3 Equivalent circuit for an inverter.

Figure 2.4 Equivalent circuit for a 3-input NAND gate

VLSI DESIGN, ECEg 4201 6 ROOHA RAZMID AHAMED

Figure 2.5 First Order RC System

The system has a transfer function

VLSI DESIGN, ECEg 4201 7 ROOHA RAZMID AHAMED

Figure 2.6 First Order Step Response

The factor of ln 2 = 0.69 is cumbersome. The effective resistance R is an empirical

where the effective resistance R is chosen to give the correct delay.

In general, most circuits of interest can be represented as an RC tree, i.e., an RC

Figure 2.7 Second Order RC System

VLSI DESIGN, ECEg 4201 8 ROOHA RAZMID AHAMED

Figure 2.8 Equivalent Circuit for Inverter

It is often helpful to express delay in a process-independent form so that circuits can

Hence, the delay of a fanout-of-h inverter can be written in normalized form as d = h

VLSI DESIGN, ECEg 4201 9 ROOHA RAZMID AHAMED

Linear Delay Model

The complexity is represented by the logical effort, g [Sutherland99]. An inverter is

VLSI DESIGN, ECEg 4201 10 ROOHA RAZMID AHAMED

Figure 2.9 Normalized Delay v/s Fan-out

LOGICAL EFFORT & TRANSISTOR SIZING

VLSI DESIGN, ECEg 4201 11 ROOHA RAZMID AHAMED

Figure 2.10 Logic gates Sized for Unit Resistances

Table 2.1 Logical effort of common gates

VLSI DESIGN, ECEg 4201 12 ROOHA RAZMID AHAMED

Table 2.2 Parasitic delay of common gates

Delay in Logic Gate

VLSI DESIGN, ECEg 4201 13 ROOHA RAZMID AHAMED

Figure 2.11 Ring Oscillator

Logical Effort of Paths

Delay in Multistage Logic Networks

VLSI DESIGN, ECEg 4201 14 ROOHA RAZMID AHAMED

Figure 2.12 Multistage Logic Network

Figure 2.13 Circuit with two way branch

VLSI DESIGN, ECEg 4201 15 ROOHA RAZMID AHAMED

VLSI DESIGN, ECEg 4201 16 ROOHA RAZMID AHAMED

Figure 2.14 Example Path

VLSI DESIGN, ECEg 4201 17 ROOHA RAZMID AHAMED

Figure 2.15 Example path annotated with transistor sizes

Choosing the Best No. of Stages

VLSI DESIGN, ECEg 4201 18 ROOHA RAZMID AHAMED

Figure 2.16 Comparison of different number of stages of Buffers

VLSI DESIGN, ECEg 4201 19 ROOHA RAZMID AHAMED

Figure 2.17 Logic Block with additional inverters

to be the best stage effort

A path achieves least delay by using = stages. It is important to understand not

VLSI DESIGN, ECEg 4201 20 ROOHA RAZMID AHAMED

Figure 2.18 Sensitivity of delay to number of stages

VLSI DESIGN, ECEg 4201 21 ROOHA RAZMID AHAMED

PROGRAMMABLE ASIC (P-ASIC) LOGIC CELLS