Sunteți pe pagina 1din 730

Principles of ASIC

Design
By,

PREMANANDA B.S.
premanandabs@gmail.com
Modules
• ASICs & FPGAs
• CMOS Logic
• ASIC Library Design
• Programmable ASIC
• Programmable ASIC Logic Cells
• Programmable ASIC I/O Cells
• Programmable ASIC Interconnect
Cont.,
• Layout
• Layout Compaction
• Logic Synthesis
• Simulation
• ASIC Construction
• System Partitioning
• Floor planning
• Placement
Cont.,
• Global Routing
• Detailed Routing
• Circuit Extraction and DRC
• Clock
• Deep-submicron Effects & Signal Integrity
• Testing & Verification
• EDA Tools in ASIC Flow
• The Seven Key Challenges of ASIC
Design
Module Objectives
• The course enables the student to understand the
complete ASIC (and FPGA) design flow.
• Different architectures of FPGA are considered.
• Having had the basic knowledge of VLSI design flow the
module is therefore designed to enable the student to
understand mainly the physical design aspects of ASIC
design.
• Main importance is given to understand the various
algorithms used in physical design.
• Also an insight is given into emerging trends like Deep
sub-micron effects (DSM) effects.
• As a case study complete ASIC flow for an application
will be dealt.
Agenda
• Introduction to ASICs
• Types of Design Styles
• Structured ASICs
• VLSI/ASIC design methodology
• ASIC & FPGA design flow
• ASIC Cell Libraries
• FPGA Vs ASIC
• FPGA Vs ASIC design flow comparison
Introduction to ASICs
• Application-specific integrated circuit (ASIC) are
integrated circuits (ICs).
• ASIC (“a-sick”) as the name implies, is a silicon chip
hardwired to meet specific application needs of one
electronic product or range of products.
• An application could be a microprocessor, cell phone,
modem, router, etc.
• The respective ASIC will have its own architecture; need
to support its own protocol requirements.
• Today’s ASIC has a complete system in a single often
called as System on a Chip (SOC).
• The various cost function for an ASIC chip could be
"Area, Timing and Power" Targets.
History of Integration

• Small-scale Integration (SSI, ~10 gates per


chip, 60’s)
• Medium-scale integration (MSI, ~100 – 1000
gates per chip, 70’s)
• Large-scale Integration (LSI, ~1000 – 10,000
gates per chip, 80’s)
• Very large-scale integration (VLSI, ~10,000 –
100,000 gates per chip, 90’s)
• Ultra large-scale integration (ULSI, ~1M –
10M gates per chip)
History of Technology
• Bipolar technology and transistor–transistor logic (TTL)
preceded metal-oxide-silicon (MOS) technology because
it was difficult to make metal-gate n-channel.
• MOS (nMOS or NMOS) technology required fewer
masking steps, was denser and consumed less power
than equivalent bipolar ICs.
• Early 1980s the aluminum gates of the transistors were
replaced by poly-silicon gates.
• The introduction of complementary MOS (CMOS) greatly
reduced power.
• The feature size is the smallest shape you can make on
a chip and is measured in or lambda.
Origin of ASICs:
• The standard parts, initially used to design micro-
electronic systems, were gradually replaced with a
combination of glue logic, custom ICs, dynamic random
access memory (DRAM) and static RAM (SRAM).

History of ASICs:
• The IEEE Custom Integrated Circuits Conference (CICC)
and IEEE International ASIC Conference document the
development of ASICs.

• Application-specific standard products (ASSPs) are


a cross between standard parts and ASICs.
Examples of ASICs
• Examples of ICs that are not ASICs include:
– Standard parts such as memory chips (ROMs, RAMs)
– Microprocessors
– TTL-equivalent ICs at different levels

• Examples of ICs that are ASICs include:


– A chip for a toy that talks
– A chip for a satellite
– A chip designed to handle the interface between
memory and a MP for a workstation CPU
– A chip containing a MP as a cell together with other
logic
The Goal of ASIC Designer

 Meet the market requirement:


• Satisfying the customer need
• Beating the competition
• Increasing the functionality
• Reducing the cost

 Achieved by:
• Using the next generation Silicon Technologies
• New Design concept and Tools
• High Level Integration
Types of ASICs
Different types of ASICs are:
• Full-custom ASICs
• Semi-custom ASICs
– Standard-cell–based ASICs (CBIC)
– Gate-array–based ASICs
• Channeled gate arrays
• Channelless gate arrays
• Structured gate arrays
• Programmable ASICs
– Programmable Logic Devices (PLD)
– Field Programmable Gate Array (FPGA)
Full-Custom ASICs
• All mask layers are customized in a full-custom ASIC.
• It only makes sense to design a full-custom IC if there
are no libraries available.
• In a full-custom ASIC an engineer designs some or all of
the logic cells, circuits or layout specifically for one ASIC.
• Bipolar technology has historically been more widely
used for full-custom analog design because of its
improved precision.
• We prefer to use of CMOS technology for analog
functions because of
– CMOS is the most widely available IC technology.
– Increased levels of integration required to mix analog and digital
functions on same IC.
• Microprocessors were exclusively full-custom, but
designers are increasingly turning to semi-custom ASIC
techniques in this area too.
• Other examples of full-custom ICs or ASICs are
requirements for high-voltage (automobile),
analog/digital (communications), or sensors and
actuators.
• Full-custom offers the highest performance and lowest
part cost (smallest die size).
• Disadvantages of increased design time, complexity,
design expense, and highest risk.
• Full-custom ICs are most expensive to manufacture and
to design.
• The manufacturing lead time is typically eight weeks.
Semi-Custom ASICs
• In a semi-custom initially the standard logic
cells are pre-designed and some of the mask
layers are customized.
• Using pre-designed cells form a cell library
makes easy for the designers.
• Types of semi-custom ASICs are:
– Standard-cell–based ASICs
– Gate-array–based ASICs
Standard-Cell–Based ASICs
• A cell-based ASIC (CBIC—“sea-bick”) uses pre-
designed logic cells known as standard cells.
• Possibly megacells, megafunctions, fullcustom blocks,
system-level macros (SLMs), fixed blocks, cores, or
Functional Standard Blocks (FSBs).
• The advantages of CBICs is that designers save time,
money, reduce risk by using pre-designed, pre-tested
and pre-characterized standard-cell library.
• The disadvantages are the time or expense of designing
or buying the standard-cell library and the time needed
to fabricate all layers of the ASIC for each new design.
• A CBIC structure is shown;
• The important features of
CBIC are:
– All mask layers are
customized—transistors and
interconnect
– Custom blocks can be
embedded
– Manufacturing lead time is
about eight weeks

In datapath (DP) logic we may use a datapath compiler


and a datapath library. Cells such as ALUs are pitch-
matched to each other to improve timing and density.
Gate-Array–Based ASICs
• In a gate array (GA) based ASIC the transistors
are predefined on the silicon wafer.
• The predefined pattern of transistors on a GA is
the base array.
• The smallest element that is replicated to make
the base array is the base (primitive) cell.
• Only the top few layers of metal which define the
interconnect between transistors are defined by
the designer using custom masks, it is often
called masked gate array (MGA).
• The logic cells in a gate array library are often
called macros.
• Gate-arrays sometimes referred as prediffused
array.
• Using the wafer prefabricated up to metallization
steps reduce the time needed to make an MGA,
the turnaround time (few days to couple of
weeks).
• There are three types of MGAs:
– Channeled gate arrays
– Channelless gate arrays
– Structured gate arrays
Channeled Gate Array
• In Channeled gate array we leave space
between the rows of transistors for wiring.
• The space for interconnect between rows of
cells are fixed in height.
• The important features of channeled gate array
are:
– Only the interconnect is customized
– The interconnect uses predefined spaces between
rows of base cells
– Manufacturing lead time is between two days and two
weeks
Channeled Gate Array
Channelless Gate Array
• A channelless gate array (channel-free gate
array, sea-of-gates array or SOG array) is as
shown.
• The important features of channelles gate array are:
– Only some (the top few) mask layers are customized-the
interconnect
– Manufacturing lead time is between two days and two weeks.

• In channelless gate array there are no predefined areas


set aside for routing between cells.
• When we use an area of transistors for routing, we do
not make any contacts to the devices lying underneath.
• The amount of logic that can be implemented in given
silicon area is higher for channelless GAs than for
channeled GAs.
• Customizing the contact layer in a channelless GAs
allows us to increase the density of gate-array cells.
Structured Gate Array
• A Structured Gate Array can be either channeled or
channeless but it includes or embeds a custom block.
• An embedded gate array or structured gate array
(masterslice or masterimage) combines some of the
features of CBICs and MGAs.
• A embedded gate array is as shown.
• The important features of structured GA are:
– Only the interconnect is customized
– Custom blocks (the same for each design) can be
embedded
– Manufacturing lead time is between two days and two
weeks

• An embedded gate array gives the improved


area efficiency and increased performance of a
CBIC but with the lower cost and faster
turnaround of an MGA.
• Disadvantage is that the embedded function is
fixed, this makes implementation of memory
difficult and inefficient.
Programmable Logic Devices
• Examples and types of PLDs are:
– Read-only memory (ROM)
– Programmable ROM or PROM
– Electrically programmable ROM, or EPROM
– Erasable PLD (EPLD)
– Electrically erasable PROM, or EEPROM
– UV-erasable PROM, or UVPROM
– Mask-programmable ROM
– A mask-programmed PLD usually uses bipolar technology
– Logic arrays may be either a Programmable Array Logic (PAL)
or a programmable logic array (PLA); both have an AND plane
and an OR plane.
• PLD is as shown in figure,

• Following are the important features that all PLDs


have in common:
– No customized mask layers or logic cells
– Fast design turnaround
– A single large block of programmable interconnect
– A matrix of logic macrocells that usually consist of
programmable array logic followed by a flip-flop or latch
Field-Programmable Gate Arrays
• A field-programmable gate array (FPGA) or complex
PLD.
• Essential characteristics of an FPGA is:
– None of the mask layers are customized
– A method for programming the basic logic cells and the
interconnect
– The core is a regular array of programmable basic logic cells
that can implement combinational as well as sequential logic
(flip-flops)
– A matrix of programmable interconnect surrounds the basic
logic cells
– Programmable I/O cells surround the core
– Design turnaround is a few hours
FPGA die is shown
Structured ASICs
• Structured ASICs have the bottom metal layers fixed and
only the top layers can be designed by the customer.
• Structured ASICs are custom devices that approach the
performance of today's Standard-Cell ASIC while
dramatically simplifying the design complexity.
• Structured ASICs offer designers a set of devices with
specific, customizable metal layers along with predefined
metal layers, which can contain the underlying pattern of
logic cells, memory, and I/O.
• Supporters claim that structured ASICs can cut
development cost by 25% compared to a standard ASIC
in a complex application, and cut unit cost by about 90%
versus a complex FPGA.
• The hybrid technology of “structured ASICs” resides in
the middle ground between application-specific and
programmable-silicon devices, and seeks to provide the
best of both worlds.
• By incorporating prefabricated elements common to
many designs (and the needs of more customers),
structured ASIC devices reduce heavy, up-front non-
recurring engineering (NRE) costs associated with cell-
based ASICs.
• This also helps shorten development time. Of course,
selection of specific elements to be prefabricated is
crucial.
• Structured ASICs reportedly approach the performance
level of cell-based ASIC devices.
VLSI Design Methodology
• Design entry & Simulation (Front End)
– Schematic Entry
– HDL Entry
– State Machine Diagram Entry
• Synthesis
• Place & route (Back End)
• Verification
• Fabrication
ASIC Design Methodology
• Most ASICs are designed using a RTL/Synthesis based
methodology.
• Design details captured in a simulatable description of
the hardware Captured as Register Transfer Language.
• The most common RTLs are Verilog HDL and VHDL.
• Automatic synthesis is used to turn the RTL into a gate-
level description i.e., AND and OR gates, etc.
• Physical Design tools are then used to turn the gate-
level design into a set of chip masks (for photolitho-
graphy) or a configuration file for downloading to an
FPGA.
• At each stage, extensive verification is performed.
SCHEMATIC EDITOR
HDL EDITOR
FSM
SYNTHESIS = T O M
ASIC Design Flow
• A design flow is a sequence of steps to design an ASIC
– Design entry: Using a hardware description language (HDL) or
schematic entry.
– Logic synthesis: Produces a netlist-logic cells and their
connections.
– System partitioning: Divide a large system into ASIC-sized
pieces.
– Prelayout simulation: Check to see if the design functions
correctly.
– Floorplanning: Arrange the blocks of the netlist on the chip.
– Placement: Decide the locations of cells in a block.
– Routing: Make the connections between cells and blocks.
– Extraction: Determine the resistance and capacitance of the
interconnect.
– Postlayout simulation: Check to see the design still works with
the added loads of the interconnect.
ASIC Design Flow
FPGA Design Flow
Xilinx FPGA design flow
ASIC Cell Libraries
• For MGAs and CBICs we have three choices:
 Use a design kit from the ASIC vendor
– is usually a phantom library-the cells are empty boxes, or
phantoms, you hand off your design to the ASIC vendor and they
perform phantom instantiation
 Buy an ASIC-vendor library from a library vendor
– involves a buy-or-build decision.
– you need a qualified cell library (qualified by the ASIC foundry
– If you own the masks (the tooling) you have a customer-owned
tooling solution
 Build your own cell library in-house
– involves a complex library development process
– Complex and very expensive
• Each cell in an ASIC cell library must contain the
following:
– physical layout
– behavioral model
– Verilog/VHDL model
– timing model
– test strategy
– characterization
– circuit extraction
– Process control monitors (PCMs)
– cell schematic
– cell icon
– layout versus schematic (LVS) check
– logic synthesis
– retargeting
– wire-load model
– Routing model
FPGA !!!!!!!
or
ASIC!!!!!!!
Field Programable Gate Arrays
Faster time-to-market:
• No layout, masks or other manufacturing steps are
needed for FPGA design.
• Readymade FPGA is available and burn your HDL code
to FPGA ! Done !!

No NRE (Non Recurring Expenses):


• This cost is typically associated with an ASIC design.
• For FPGA this is not there.
• FPGA tools are cheap. (sometimes its free ! You need to
buy FPGA.... thats all !).
• ASIC youpay huge NRE and tools are expensive. "...Its in
crores....!!
Simpler design cycle:
• This is due to software that handles much of the routing,
placement, and timing.
• Manual intervention is less.
• The FPGA design flow eliminates the complex and time-
consuming floorplanning, place and route, timing
analysis.

Field Reprogramability:
• A new bitstream ( i.e. your program) can be uploaded
remotely, instantly.
• FPGA can be reprogrammed in a snap while an ASIC
can take $50,000 and more than 4-6 weeks to make the
same changes.
• FPGA costs start from a couple of dollars to several
hundreds or more depending on the hardware features.
More predictable project cycle:
• The FPGA design flow eliminates potential re-spins,
wafer capacities, etc of the project since the design logic
is already synthesized and verified in FPGA device.

Reusability:
• Reusability of FPGA is the main advantage.
• Prototype of the design can be implemented on FPGA
which could be verified for almost accurate results so
that it can be implemented on an ASIC.
• If design has faults change the HDL code, generate bit
stream, program to FPGA and test again.
• Modern FPGAs are reconfigurable both partially and
dynamically.
• FPGAs are good for prototyping and limited production.If
you are going to make 100-200 boards it isn't worth to
make an ASIC.
• Generally FPGAs are used for lower speed, lower
complexity and lower volume designs.
• But today's FPGAs even run at 500 MHz with superior
performance.
• With unprecedented logic density increases and a host
of other features, such as embedded processors, DSP
blocks, clocking, and high-speed serial at ever lower
price, FPGAs are suitable for almost any type of design.
• FPGA sythesis is much more easier than ASIC.
• In FPGA you need not do floor-planning, tool can do it
efficiently. In ASIC you have do it.
• Unlike ASICs, FPGA's have special hardwares such as
Block-RAM, DCM modules, MACs, memories and
highspeed I/O, embedded CPU etc inbuilt, which can be
used to get better performace.
• Modern FPGAs are packed with features.
• Advanced FPGAs usually come with phase-locked
loops, low-voltage differential signal, clock data recovery,
more internal routing, high speed, hardware multipliers
for DSPs, memory,programmable I/O, IP cores and
microprocessor cores.
• Remember Power PC (hardcore), Microblaze (softcore)
in Xilinx and ARM (hardcore), Nios(softcore) in Altera.
• There are FPGAs available now with built in ADC ! Using
all these features designers can build a system on a
chip. Now, dou yo really need an ASIC ?
FPGA Design Disadvantages
• Power consumption in FPGA is more.
– You don't have any control over the power
optimization. This is where ASIC wins the race !
• You have to use the resources available in the
FPGA.
– Thus FPGA limits the design size.
• Good for low quantity production.
– As quantity increases cost per product increases
compared to the ASIC implementation.
ASIC Design Advantages
Cost....cost....cost....Lower unit costs:
• For very high volume designs costs comes out
to be very less.
• Larger volumes of ASIC design proves to be
cheaper than implementing design using FPGA.

Speed...speed...speed....ASICs are faster than


FPGA:
• ASIC gives design flexibility.
• This gives enoromous opportunity for speed
optimizations.
Low power....Low power....Low power:
• ASIC can be optimized for required low power.
• There are several low power techniques such as power
gating, clock gating, multi vt cell libraries, pipelining etc are
available to achieve the power target.
• This is where FPGA fails badly !!! Can you think of a cell
phone which has to be charged for every call.....never ..
.....low power ASICs helps battery live longer life !!

 In ASIC you can implement analog circuit, mixed signal


designs.
 This is generally not possible in FPGA.
 In ASIC DFT (Design For Test) is inserted.
 In FPGA DFT is not carried out (rather for FPGA no need of DFT !)
ASIC Design Diadvantages
Time-to-market:
• Some large ASICs can take a year or more to design.
• A good way to shorten development time is to make
prototypes using FPGAs and then switch to an ASIC.
Design Issues:
• In ASIC you should take care of DFM issues, Signal
Integrity isuues and many more.
• In FPGA you don't have all these because ASIC
designer takes care of all these. (Don't forget FPGA is
an IC and designed by ASIC design enginner !!)
Expensive Tools:
• ASIC design tools are very much expensive.
– You spend a huge amount of NRE.
Unit Cost Analysis
Exploding ASIC NRE Cost
Time to Market
System Reconfigurability

Time
Lack of reconfigurability is a huge opportunity cost of ASICs
FPGA offer flexible life cycle management
Design Cycle
Profit Model
Comparison Between ASIC
Technologies
• We’ll compare the most popular types of ASICs:
an FPGA, an MGA, and a CBIC.
• Example of an ASIC part cost:
– A 0.5mm, 20k-gate array might cost 0.01–0.02 cents/
gate (for more than 10,000 parts) or $2–$4 per part,
but an equivalent FPGA might be $20.
• When does it make sense to use a more
expensive part? This is what we shall examine
next.
Product Cost
• In a product cost there are fixed costs and variable
costs (number of products sold is the sales volume):
total product cost = fixed product cost + variable product cost x
products sold
• In a product made from parts the total cost for any part is
total part cost = fixed part cost + variable cost per part x volume of
parts
• For example, suppose we have the following costs:
– FPGA: $21,800 (fixed) $39 (variable)
– MGA: $86,000 (fixed) $10 (variable)
– CBIC $146,000 (fixed) $8 (variable)
• Then we can calculate the following break-even volumes:
– FPGA/MGA » 2000 parts
– FPGA/CBIC » 4000 parts
– MGA/CBIC » 20,000 parts
Break-even graph
ASIC Fixed Costs
Examples of fixed costs are:
– training cost for a new (EDA) system
– hardware and software cost
– productivity
– production test and design for test
– programming costs for an FPGA
– nonrecurring-engineering (NRE)
– test vectors and test-program development cost
– pass (turn or spin)
– profit model represents the profit
– flow during the product lifetime
ASIC Variable Costs
Factors affecting fixed costs are:
– wafer size and cost
– gate density and gate utilization
– die size and die cost
– die per wafer
– defect density
– yield
– profit margin (depends on fab or fabless)
– price per gate
– part cost
ASIC Vs FPGA
The Performance Cube
Delay

“Smaller is Better”

Power

Cost
FPGA Vs ASIC
Design Flow Comparison
• The FPGA design flow eliminates the
– complex and time-consuming floorplanning, place and
route, timing analysis, and mask / re-spin stages of
the project since the design logic is already
synthesized to be placed onto an already verified,
characterized FPGA device.
• However, when needed, Xilinx provides the
advanced floorplanning, hierarchical design, and
timing tools to allow users to maximize
performance for the most demanding designs.
FPGA and ASIC Design Flows
Fundamentally Similar
• Scan insertion and clock tree synthesis are not required
in FPGA design flows because of the inherent design of
Altera FPGAs.
• In FPGA design flows, place-and-route is performed by
the customer using the FPGA vendor place-and-route
tools.
• In ASIC design, place-and-route and physical design
verification such as crosstalk analysis between internal
device signals may be performed by the customer or
handed off to an ASIC foundry for implementation.
• Developing ASICs requires careful design and
placement of I/O cells to support the latest complex I/O
standards with good signal integrity on all pins.
• ASIC testing and fault coverage are important parts of
the ASIC development process.
• Testing involves the desired design functionality and the
design of the ASIC, and uses boundary scan insertion,
built-in-self-test (BIST), signature analysis, Iddq, and
automatic test pattern generation (ATPG) techniques.
• FPGAs already include boundary scan logic as opposed
to ASIC design flow where you must insert boundary
scan logic and simulate it on top of the actual design
logic.
• FPGAs have already been extensively tested during
manufacturing.
• In FPGA design flows, you can focus on testing design
functionality and timing requirements and do not need to
perform device design tests such as crosstalk analysis.
• Altera’s FPGAs include advanced, low-skew
clock networks for clock distribution within the
device.
• The FPGA designer must give up the ASIC
designer's freedom to implement fully custom
clock networks;
• The pre-defined clock tree structure in an FPGA
enormously simplifies the design process and
satisfies a majority of applications.
• By using FPGA devices in place of ASICs, you
can potentially reduce the complexity of the
design process as well as significantly reduce
cost.
• The back-end design of an ASIC device involves a wide
variety of complex tasks, including placement, physical
optimization, clock tree synthesis, signal integrity analysis
and routing using different EDA software tools.
• When compared to ASIC devices, the physical, or back-
end design of FPGA devices is very simple and is
accomplished with a single software tool (Quartus II).
• The Quartus II software is a fully integrated, architecture
independent package offering a full spectrum of logic
design capabilities for designing with FPGAs.
• The flow discusses and compares each of the tasks
involved in the design flows for both FPGA and ASIC
devices.
CMOS
Logic
Agenda
• CMOS logic
• Logic levels
• Transmission Gates
• Sequential Logic Cells
• Datapath Logic Cells
• Adders
• Multipliers
• Other Arithmetic Systems
• Other Datapath Operators
• I/O Cells
• ESD
CMOS Logic
Logic Levels
VSS is a strong '0‘;
VDD is a strong '1'
degraded logic levels:
VDD–Vtn is a weak '1' ; VSS–Vtp (Vtp is negative) is a weak '0'
Naming of complex CMOS combinational
logic cells

Drive Strength
We ratio a cell to adjust its drive strength and make
bn=bp to create equal rise and fall times
Transmission Gates

CMOS transmission gate (TG, TX gate, pass gate, coupler)


Sequential Logic Cells
CMOS flip-flop
Datapath Logic Cells
Full Adder (FA):

parity function ('1' for an odd numbers of '1's)


majority function ('1' if the majority of the inputs are '1')
S[i] = SUM (A[i], B[i], CIN)
COUT = MAJ (A[i], B[i], CIN)
Adders
Carry-Save Adder
• CSA cell CSA(A1[i], A2[i], A3[i ], CIN, S1[i], S2[i], COUT)
has three outputs:

• In CSA the carries are saved at each stage and shifted


left onto the bus S1.
• There is thus no carry propagation and the delay of a
CSA is constant.
• At the output of a CSA we need to add the S1 bus (all
saved carries) and the S2 bus (all the sums) to get an n-
bit result using a final stage.
• The n-bit sum is encoded in the two buses S1 and S2 in
the form of parity and majority functions.
• The last stage sums two input buses using a carry-
propagate adder.
• By registering the CSA stages by adding vectors of flip-
flops, reduces the adder delay to that of the slowest
adder stage.
• By using the registers between the stages of
combinational logic we use pipelining to increase the
speed and pay a price of increased area and introduce
latency.
• It takes a few clock cycles to fill the pipeline, but once it
is filled, we get output for every clock cycle.
Carry-Bypass Adder
• In RCA every stage has to wait to make its carry
decision until the previous stage has calculated.
• Bypass the critical path.
• Bypass the carries for bits 4-7 (stages 5-8) of an adder
we can compute BYPASS = P[4].P[5].P[6].P[7] and then
use a MUX as follows:
Carry-Skip Adder
• Instead of checking the propagate signals, we
can check the inputs.
• We can compute
SKIP = (A[i – 1] B[i – 1]) + A[i] B[i]) and
then use a 2:1 MUX to select C[i].
Thus,
Carry-Lookahead Adder
• Consider recursively for i = 1 in C[i] = G[1] + P[i].C[0]

• The Brent-Kung adder reduces the delay and increases


the regularity of the carry-lookahead scheme.
• The 4-bit CLA using carry-lookahead generator cell
(CLG) is as shown
Carry-Select Adder
• Carry-select adder duplicates two small adders
for the cases CIN='0' and CIN='1' and then uses
a MUX to select the case that we need
• A carry-select adder is often used as the fast
adder in a datapath library because its layout is
regular.
• Extending the idea of carry-select adder we can
design Conditional-sum adder.
• The n-bit conditional-sum adder that uses n
single-bit conditional adders, together with a tree
of 2:1 MUXs is as shown.
• Figure below shows the normalized delay and area
figures for a set of pre-designed datapath adders.
Multipliers
• The holes or dots are the outputs of one stage (and the
inputs of the next).
• At each stage we have the following three choices:
(1) sum three outputs using a full adder (denoted by a box
enclosing three dots);
(2) sum two outputs using a half adder (a box with two dots);
(3) pass the outputs directly to the next stage.
• The two outputs of an adder are joined by a diagonal line
(full adders use black dots, half adders white dots).
• The object of the game is to choose (1), (2), or (3) at
each stage to maximize the performance of the
multiplier.
• In tree-based multipliers there are two ways to do this-
working forward and working backward.
• In a Wallace-tree multiplier we work forward from the
multiplier inputs, compressing the number of signals to
be added at each stage
Other Arithmetic Systems
Residue Number System
Other Datapath Operators
• A subtracter is similar to an adder.
• A barrel shifter rotates or shifts an input bus by a
specified amount.
• A leading-one detector is used with a normalizing (left-
shift) barrel shifter to align mantissas in floating-point
numbers.
• The output of a priority encoder is the binary-encoded
position of the leading one in an input.
• An accumulator is an adder/subtracter and a register.
• An incrementer/decrementer.
• The all-zeros detectors and all-ones detectors.
Symbols for datapath elements
I/O Cells
• Three-state bidirectional output buffer is as shown
below:
Electrostatic Discharge
• The gate oxide in CMOS transistors is extremely thin, this
leaves the gate oxide of the cell input transistors
susceptible to breakdown from static electricity ESD.
• Sometimes this problem is called Electrical Overstress
(EOS).
• Electrostatic discharge is defined as the transfer of charge
between bodies at different electric potentials.
• The amount of charge created by contact charging is
affected by area of contact, speed of separation, relative
humidity, and other factors.
• To protect I/O cells from ESD, the input pads are normally
tied to device structures that clamp the input voltage to
below the gate breakdown voltage.
• I/O cells use transistors with a special ESD
implant that increases breakdown voltage and
provides protection.
• Damage from ESD’s can
– cause complete device failure by parametric shifts
– device weakness by locally heating, melting or
otherwise damaging oxides, junctions or device
components
• Remedies:
– Special circuitry with ESD protection diodes,
gaurdings etc . .
• There are several ways to model capability of an
I/O cell to withstand EOS.
• The human-body model (HBM): represents an
ESD by a 100 pF capacitor discharging through a
1500 ohms resistor.
– Typically voltages generated are in the range of 2-4 kV.
• The machine model (MM): represents an ESD
by 200 pF capacitor (typically charged to 200 V)
discharging through a 25 ohms resistor, corres-
ponding to a peak initial current of nearly 10 A.
• The charge-device model (MM): represents an
problem when an IC package is charged.
– If the maximum charge on a package is 3 nC and the
package capacitance to ground is 1.5 pF, this event is
simulated by charging a 1.5 pF capacitor to 2kV and
discharging through a 1 ohms resistor.
ASIC
LIBRARY
DESIGN
Agenda
• Transistors as Resistors
• Transistor Parasitic Capacitance
• Logical Effort
• Library-Cell Design
• Gate-Array Design
• Standard-Cell Design
• Cell Compilers
 ASIC design uses predefined and pre-characterized cells
from a library.
 So we need to design or buy a cell library.
 A knowledge of ASIC library design is not necessary but
makes it easier to use library cells effectively.
Transistors as Resistors
Transistor Parasitic Capacitance
• Junction Capacitance
• Overlap Capacitance
• Gate Capacitance
• Input Slew Rate
Logical Effort
• The delay equation is the sum of three terms,
d = f + p + q or
delay = effort delay + parasitic delay + nonideal delay
• The effort delay f is the product of logical effort, g, and
electrical effort, h:
f = gh
• Thus, delay = logical effort x electrical effort + parasitic
delay + nonideal delay.
• R and C will change as we scale a logic cell, but the RC
product stays the same.
• Logical effort is independent of the size of a logic cell.
• We can find logical effort by scaling a logic cell to have
the same drive as a 1X minimum-size inverter.
• Then the logical effort, g, is the ratio of the input
capacitance, Cin, of the 1X logic cell to Cinv.
• Inverter has the smallest logical effort and
intrinsic delay of all static CMOS gates.
• Logical effort of a gate presents the ratio of its
input capacitance to the inverter capacitance
when sized to deliver the same current.
• Logical effort is the ratio of input capacitance of
a gate to the input capacitance of an inverter
with the same output current.
• Logical effort increases with the gate complexity.
• Logical effort is a function of topology,
independent of sizing.
VDD VDD VDD

A 2 A 2 B 2 B 4

F
F
A 4
A 2
A 1
r=2 F

A 1 B 1
B 2

Inverter 2-input NAND 2-input NOR

g=1 g = 4/3 g = 5/3


Example – 8-input AND
Logical Effort of Gates

Normalized delay (d)


t pNAND
g= t pINV
p=
d=
g=
p=
d=

F(Fan-in)
1 2 3 4 5 6 7
Fan-out (h)
Logical Effort of Gates

Normalized delay (d)


t pNAND
g = 4/3 t pINV
p=2
d = (4/3)h+2
g=1
p=1
d = h+1

F(Fan-in)
1 2 3 4 5 6 7
Fan-out (h)
Logical Effort of Gates

2
=
p
3;
4/
5 1

=
p=

g
Normalized Delay

D:
1;

AN
4 =
tN : g
rt er
pu

3 ve
in

In Effort
2-

Delay
2

1
Intrinsic
Delay

1 2 3 4 5
Fanout f
Predicting Delay
Logical Area and Logical Efficiency
Multistage Cells
Add Branching Effort

Branching effort:

Con − path + Coff − path


b=
Con − path
Multistage Networks

Stage effort: fi = gihi


Path electrical effort: H = Cout/Cin
Path logical effort: G = g1g2…gN
Branching effort: B = b1b2…bN
Path effort: F = GHB
= f1f2…fN

Path delay D = Σdi = Σpi + Σfi


Optimum Delay
Optimum Number of Stages
• Chain of N inverters each with equal stage effort, f = gh
• Total path delay is Nf=Ngh=Nh, since g=1 for an inverter.
• To drive a path electrical effort H, hN=H, or Nlnh = lnH
• Delay, Nh = hlnH/lnh
• Since lnH is fixed, we can only vary h/ln(h)
• h/ln(h) is a shallow function with a minimum at h = e =
2.718
• Total delay is Ne = elnH
• Figure shows us how to minimize delay regardless of
area or power and neglecting parasitic and non-ideal
delays.
• Calculate the path delay (ignoring p & q) for the
following logic chain. Given logical ratio = 2.

Find: g1, g2, g3, g4, h1, h2, h3, h4

1
2
3 4
CL

g1 = 1, g2 = 7/3, g3 = 5/3, g4 = 1

h1 = 7/3, h2 = 5/7, h3 = 3/5, h4 = CL

Delay = 5 + CL
Example: Optimize Path
1 b c
a
5
g=1 g = 5/3 g = 5/3 g=1

Effective fanout, =
G=
H=
F=
fi =
a=
b=
c=
1 b c
a
5

Effective fanout = 5
G = 25/9
H=5
F = 125/9 = 13.9
fi = 1.93
c = 5g4/fi = 2.59
b = 2.59g3/fi = 2.23
a = 2.23g2/fi = 1.93
Delay = 7.72
Summary
Library-Cell Design
• A big problem in library design is dealing with design
rules.
• The optimum cell layout for each process generation
changes because the design rules for each ASIC
vendor’s process are always slightly different even for
the same generation of technology.
• For example, two companies may have very similar 0.35
mm CMOS process technologies, but the third-level
metal spacing might be slightly different.
• If a cell library is to be used with both processes, we
could construct the library by adopting the most stringent
rules from each process.
• A library constructed in this fashion may not be
competitive with one that is constructed specifically for
each process.
• Even though ASIC vendors prize their design rules as
secret, it turns out that they are similar except for a few
details.
• Unfortunately, it is the details that stop us moving
designs from one process to another.
• Unless we are a very large customer it is difficult to have
an ASIC vendor change or waive design rules for us.
• We would like all vendors to agree on a common set of
design rules.
• Layout of library cells is either hand-crafted or uses
some form of symbolic layout.
• Symbolic layout is usually performed in one of two ways:
– using either interactive graphics or
– a text layout language

• Shapes are represented by simple lines or rectangles,


known as sticks or logs, in symbolic layout.
• The actual dimensions of the sticks or logs are
determined after layout is completed in a post-
processing step.
• An alternative to graphical symbolic layout uses a text
layout language, similar to a programming language
such as C, that directs a program to assemble layout.
• The spacing and dimensions of the layout shapes are
defined in terms of variables rather than constants.
• These variables can be changed after symbolic layout is
complete to adjust the layout spacing to a specific
process.
• Mapping symbolic layout to a specific process
technology uses 10–20 percent more area than hand-
crafted layout (though this can then be further reduced to
5–10 percent with compaction).
• Most symbolic layout systems do not allow 45°layou t
and this introduces a further area penalty.
• As libraries get larger, and the capability to quickly move
libraries and ASIC designs between different generations
of process technologies becomes more important.
• The advantages of symbolic layout may outweigh the
disadvantages.
Gate-Array Design
• Each logic cell or macro in a gate-array library is pre-
designed using fixed tiles of transistors known as the
gate-array base cell.
• Arrangement of base cells across a whole chip in a
complete GA is gate-array base (base).
• ASIC vendors offer a selection of bases, with a different
total number of transistors on each base.
• We isolate the transistors on a GA from one another
either with thick field oxide (in oxide-isolated GAs) or by
using other transistors that are wired permanently off (in
gate-isolated GAs).
• Channeled or channelless GAs may use either oxide or
gate isolation.
The construction of a gate-isolated gate array
(a) The one-track-wide base cell containing one p-channel and
one n-channel transistor.
(b) The center base cell is isolating the base cells on either side
from each other.
(c) The base cell is 21 tracks high There are 21 horizontal tracks
in the cell or the cell is 21 tracks high.

• Space need for a vertical interconnect is vertical track.


• The logic cells are isolated from each other in gate-
isolated GAs by connecting transistor gates to the supply
bus, hence name gate isolation.
• If gate of an n-channel transistor connected to Vss, we
isolate the regions of n-diffusion on each side of that
transistor.
• If gate of an p-channel transistor connected to Vdd, we
isolate adjacent p-diffusion regions.
• Oxide-isolated GAs often contain four transistors in the
base cell:
– Two n-channel transistors share an n-diffusion strip
– Two p-channel transistors share a p-diffusion strip.
• The base cells are isolated from each other using oxide
isolation.

An oxide-isolated gate-array base cell is shown:


• Two base cells, each contains eight transistors and two
well contacts
• The p-channel and n-channel transistors are each 4
tracks high (corresponding to the width of the transistor)
• The cell is 12 tracks high (8–12 is typical for a modern
library)
• The base cell is 7 tracks wide
Standard-Cell Design
• Layout of a standard cell from a standard-cell library is
as shown.
• Each standard cell in a library is rectangular with the
same height but different widths.
• The bounding box (BB) of a logic cell is the smallest
rectangle that encloses all of the geometry of the cell.
• The cell BB is normally determined by the well layers.
• Cell connectors or terminals (the logical connectors)
must be placed on the cell abutment box (AB).
• The physical connector must normally overlap the
abutment box slightly, usually by at least 1, to assure
connection without leaving a tiny space between the
ends of two wires.
• The standard cells are constructed so they can all be
placed next to each other horizontally with the cell ABs
touching.
• When a library developer creates a gate-array, standard-
cell, or datapath library, there is a trade-off between
using wide, high-drive transistors that result in large cells
with high-speed performance and using smaller
transistors that result in smaller cells that consume less
power.
• A performance-optimized library with large cells might be
used for ASICs in a high-performance workstation, for
example.
• An area-optimized library might be used in an ASIC for a
battery-powered portable computer.
Cell Compilers
• The process of hand crafting circuits and layout for a full-
custom IC is a tedious, time-consuming, and error-prone
task.
• There are two types of automated layout assembly tools,
often known as a silicon compilers.
• The first type produces a specific kind of circuit, a RAM
compiler or multiplier compiler, for example.
• The second type of compiler is more flexible, usually
providing a programming language that assembles or
tiles layout from an input command file, but this is full-
custom IC design.
• RAM compilers are available that produce single-port
RAM as well as dual-port RAMs, and multiport RAMs.
PROGRAMMABLE
ASICs
Agenda
• The Antifuse
• Static RAM
• EPROM and EEPROM Technology
• FPGA Economics
• Programmable ASIC technologies
comparison
• All FPGAs have a regular array of basic logic cells that
are configured using a programming technology.
• The chip inputs and outputs use special I/O logic cells
that are different from the basic logic cells.
• A programmable interconnect scheme forms the wiring
between the two types of logic cells.
• Finally, the designer uses custom software, tailored to
each programming technology and FPGA architecture,
to design and implement the programmable connections.
• The programming technology in an FPGA determines
the type of basic logic cell and the interconnect scheme.
• The logic cells and interconnection scheme, in turn,
determine the design of the input and output circuits as
well as the programming scheme.
Programming Technologies
• Currently, there are four programming
technologies for FPGAs:
– Anti fuse
– Static RAM cells
– EPROM transistor
– EEPROM transistor
The Antifuse
• An anti-fuse is the opposite of normal fuse.
• Anti-fuse are made with a modified CMOS process
having an extra step.
• This step creates a very thin insulating layer which
separates two conducting layers.
• That thin insulating layer is fused by applying a high
voltage across the conducting layer.
• Such high voltage can be destructive for CMOS logic
circuit.
• Non-volatile (Permanent).
• Requires extra programming circuitry, including a
programming transistor.
• Actel antifuse is as shown.
(a) A cross section.
(b) A simplified drawing.
(c) Contact.
• Actel calls its antifuse a programmable low-impedance
circuit element (PLICE).
• Figure shows a poly–diffusion antifuse with an oxide–
nitride–oxide (ONO) dielectric sandwich of: silicon
dioxide (SiO2) grown over the n -type antifuse diffusion,
a silicon nitride (Si3N4) layer, and another thin SiO2 layer.
• The layered ONO dielectric results in a tighter spread of
blown antifuse resistance values than using a single
oxide dielectric.
• The effective electrical thickness is equivalent to 10nm of
SiO2 (Si3N4 has a higher dielectric constant than SiO2, so
the actual thickness is less than 10 nm).
• Sometimes this device is called a fuse even though it is
an anti fuse, and both terms are often used
interchangeably.
• The fabrication process and the programming current
control the average resistance of a blown antifuse, but
values vary.
Metal–Metal Antifuse
• There are two advantages of a metal–metal antifuse
over a poly–diffusion antifuse.
 The first is that connections to a metal–metal antifuse
are direct to metal—the wiring layers.
• Connections from a poly–diffusion antifuse to the wiring
layers require extra space and create additional parasitic
capacitance.
 The second advantage is that the direct connection to
the low-resistance metal layers makes it easier to use
larger programming currents to reduce the antifuse
resistance.
• The size of an antifuse is limited by the resolution of the
lithography equipment used to makes ICs.
• The Actel antifuse connects diffusion and polysilicon,
and both these materials are too resistive for use as
signal interconnects.
• To connect the antifuse to the metal layers requires
contacts that take up more space than the antifuse itself,
reducing the advantage of the small antifuse size.
• However, the antifuse is so small that it is normally the
contact and metal spacing design rules that limit how
closely the antifuses may be packed rather than the size of
the antifuse itself.
• An antifuse is resistive and the addition of contacts adds
parasitic capacitance.
• The intrinsic parasitic capacitance of an antifuse is small
but to this we must add the extrinsic parasitic capacitance
that includes the capacitance of the diffusion and poly
electrodes and connecting metal wires.
• These unwanted parasitic elements can add considerable
RC interconnect delay if the number of antifuses connected
in series is not kept to an absolute minimum.
• Clever routing techniques are crucial to antifuse-based
FPGAs.
Static RAM
• Xilinx SRAM (static RAM) configuration cell is
constructed from two cross-coupled inverters and uses
standard CMOS process.
• use in reconfigurable hardware.
• use of programmable read-only memory or PROM to
hold configuration.
SRAM based programming
• Completely reusable - no limit concerning re-
programmability
• Pass gate closes when a “1” is stored in the SRAM cell
allows iterative prototyping.
• Volatile memory - power must be maintained.
• Large area - five transistor SRAM cell plus pass gate.
• Memory cells distributed throughout the chip.
• Fast re-programmability (tens of milliseconds)
• Standard CMOS process required.
• Advantage: reprogram.
• Disadvantage: keep power supplied.
EPROM and EEPROM Technology
An EPROM transistor
(a) With a high (>12V) programming voltage, VPP, applied
to the drain, electrons gain enough energy to “jump” onto
the floating gate (gate1).
(b) Electrons stuck on gate1 raise the threshold voltage so
that the transistor is always off for normal operating
voltages.
(c) UV light provides enough energy for the electrons stuck
on gate1 to “jump” back to the bulk, allowing the
transistor to operate normally.
Facts and keywords: Altera MAX 5000 EPLDs and Xilinx
EPLDs both use UV-erasable
• electrically programmable read-only memory (EPROM)
• hot-electron injection or avalanche injection
• floating-gate avalanche MOS (FAMOS)
FPGA Economics
• Xilinx part-naming convention
• Not all parts are available in all packages
• Some parts are packaged with fewer leads than
I/Os
Programmable ASIC technologies
Summary
• All FPGAs have the following key
elements:
– The programming technology
– The basic logic cells
– The I/O logic cells
– Programmable interconnect
– Software to design and program the FPGA
PROGRAMMABLE
ASIC LOGIC
CELLS
Agenda
• Actel ACT
• Xilinx LCA
• Altera FLEX
• Altera MAX
• Comparisions
• All programmable ASICs or FPGAs contain a
basic logic cell replicated in a regular array
across the chip (analogous to a base cell in an
MGA).
• There are the following three different types of
basic logic cells:
– multiplexer based
– look-up table based
– programmable array logic
• The choice among these depends on the
programming technology.
Actel ACT
ACT 1 Logic Module
The Actel ACT architecture
(a) Organization of the basic logic cells.
(b) The ACT 1 Logic Module (LM, the Actel
basic logic cell).
• The ACT 1 family uses just one type of LM.
• ACT 2 and ACT 3 FPGA families both use two
different types of LM
(c) An example LM implementation using pass
transistors (without any buffering).
(d) An example logic macro.
• Connect logic signals to some or all of the LM
inputs, the remaining inputs to VDD or GND
Shannon’s Expansion Theorem
Multiplexer Logic as Function Generators
The 16 logic functions of 2 variables:
• 2 of 16 functions are not very interesting (F='0' and F='1')
• There are 10 functions that we can implement using just
one 2:1 MUX
• 6 functions are useful: INV, BUF, AND, OR, AND1-1,
NOR1-1
ACT 2 and ACT 3 Logic Modules
• ACT 1 requires 2 LMs per flip-flop: with unknown
interconnect capacitance.
• ACT 2 and ACT 3 use two types of LMs, one includes a
D flip-flop.
• ACT 2 C-Module is similar to the ACT 1 LM but can
implement five-input logic functions.
• combinatorial module implements combinational logic
(blame MMI for the misuse of terms).
• ACT 2 S-Module (sequential module) contains a C-
Module and a sequential element.
Timing Model and Critical Path
Speed Grading
• Speed grading (or speed binning) uses a
binning circuit.
• Measure tPD = (tPLH + tPHL)/2 — and use the fact
that properties match across a chip.
• Actel speed grades are based on 'Std' speed
grade:
– '1' speed grade is approximately 15 percent faster
than 'Std'
– '2' speed grade is approximately 25 percent faster
than 'Std'
– '3' speed grade is approximately 35 percent faster
than 'Std'.
Actel Logic Module Analysis
• Actel uses a fine-grain architecture which allows you to
use almost all of the FPGA.
• Synthesis can map logic efficiently to a fine-grain
architecture.
• Physical symmetry simplifies place-and-route (swapping
equivalent pins on opposite sides of the LM to ease
routing).
• Matched to small antifuse programming technology.
• LMs balance efficiency of implementation and efficiency
of utilization.
• A simple LM reduces performance, but allows fast and
robust place-and-route.
Xilinx LCA
XC3000 CLB
• A 32-bit look-up table (LUT).
• CLB propagation delay is fixed (the LUT access
time) and independent of the logic function.
– 7 inputs to the XC3000 CLB
– 5 CLB inputs (A–E), and 2 flip-flop outputs (QX and
QY)
• 2 outputs from the LUT (F and G).
– Since a 32-bit LUT requires only five variables to form
a unique address, there are several ways to use the
LUT
The Xilinx XC3000 CLB
• It has 5 logic inputs (A, B, C, D, E), a data input (DI), a
clock input (K), a clock enable (EC), a direct reset (DR)
and two outputs (X and Y).
• The trapezoidal blocks are programmable multiplexers.
• These can be programmed using configuration memory
cell to select the inputs.
• Each block with letter M attached to the multiplexer
represents 1 bit configuration memory cell.
• Use 5 of the 7 possible inputs (A–E, QX, QY) with the
entire 32-bit LUT (the CLB outputs (F and G) are then
identical).
• The combinational function block contains RAM memory
cells and can be programmed to realize any function of 5
variables or 2 functions of 4 variables.
• The functions are stored in the truth table form.
• This block can be operated in 3 different mode namely,
FG, F and FGM modes.
• Split the 32-bit LUT in half to implement 2 functions of 4
variables each:
• FG mode generates two functions of four variables
each.
– One variable (A) must be common to both functions.
– The next two variables can be chosen from B, C, QX, and QY.
– The remaining variable can be either D or E.
• F mode can generate one function of 5 variables:
– A, D, E.
– 2 variables to be selected from B, C, Qx and Qy.
• FGM mode uses a multiplexer at the output with E as a
select line to select one of 2 functions of 4 variables.
XC4000 Logic Block
Two 4-input Functions, Registered Output and
a two input Functions
5-input Function, Combinational Output
XC5200 Logic Block
• The logic cell is similar to the CLBs in the
XC3000/4000 CLBs, but simpler.
• The XC5200 logic cell contains four input LUT, a
flip-flop and MUXs to handle signal switching.
• The arithmetic carry logic is separate from the
LUTs.
• A limited capability to cascade functions is
provided to gang two logic cells in parallel to
provide equivalent of a five-input LUT.
• The Xilinx XC5200 family basic logic cell (LC) is
as shown:
Xilinx CLB Analysis
• The use of a LUT has advantages and disadvantages:
– An inverter is as slow as a five-input NAND
– A LUT simplifies timing of synchronous logic
– Matched to large SRAM programming technology

• Xilinx uses two speed-grade systems:


– Maximum guaranteed toggle rate of a CLB flip-flop (in MHz) as a
suffix—higher is faster
• Example: Xilinx XC3020-125 has a toggle frequency of 125MHz
– Delay time of the combinational logic in a CLB in ns—lower is
faster
• Example: XC4010-6 has tILO = 6.0 ns
– Correspondence between grade and tILO is fairly accurate for the
XC2000, XC4000, and XC5200 but not for the XC3000
Xilinx LCA timing model
Altera FLEX
• The basic logic cell, logic element (LE) that uses in FLEX
8000 series of FPGAs are shown:
(a) Chip floorplan
(b) Logic Array Block (LAB)
(c) Details of the Logic Element (LE)

• Apart from cascaded logic the FLEX cell resembles the


XC5200 LC architecture.
• The FLEX LE uses a four input LUT, a flip-flop, cascade
logic and carry logic.
• Eight LEs are stacked to form a Logic Array Block (LAB).
The Altera FLEX architecture
The Altera MAX architecture
• The Altera MAX architecture (the macrocell details vary
between the MAX families—the functions shown here
are closest to those of the MAX 9000 family macrocells)
(a) Organization of logic and interconnect
(b) LAB (Logic Array Block)
(c) Macrocell
• Features:
– Logic expanders and expander terms (helper terms) increase
term efficiency
– Shared logic expander (shared expander, intranet) and parallel
expander (internet)
– Deterministic architecture allows deterministic timing before logic
assignment
– Any use of two-pass logic breaks deterministic timing
– Programmable inversion increases term efficiency
Logic Expanders
• A registered PAL with i inputs, j product terms,
and k macrocells.
• Features and keywords:
– product-term line
– programmable array logic
– bit line
– word line
– programmable-AND array (or product-term array)
– pull-up resistor
– wired-logic
– wired-AND
– macrocell
– 22V10 PLD
Altera MAX Timing Model
Logic cells used by programmable ASICs

Comparisons
PROGRAMMABLE
ASIC
I/O CELLS
Agenda
• DC Output
• AC Output
• DC Input
• AC Input
• Clock Input
• Xilinx XC4000 I/O Block
• Xilinx LCA Timing Model
• Other I/O Cells
DC Output
• A circuit to drive a small electric motor (0.5A) using ASIC
I/O buffers.
• Work from the outputs to the inputs.
• The 470W resistors drop up to 5V if an output buffer
current approaches 10mA, reducing the drive to the
output transistors.
Totem-Pole Output and Clamp Diodes
AC Output
Supply Bounce
Supply bounce
• The voltage drop across RS, and LS, causes a spike on
the GND net, changing the vlaue Vss, leading to a
problem known as supply bounce.
• A substantial current IOL may flow in the resistance, RS,
and inductance, LS, that are between the on-chip GND
net and the off-chip, external ground connection.
(a) As the pull-down device, M1, switches, it causes the
GND net (value Vss) to bounce
(b) The supply bounce is dependent on the output slew
rate
(c) Ground bounce can cause other output buffers to
generate a logic glitch
(d) Bounce can also cause errors on other inputs
Transmission Lines
DC Input
A switch input
(a) A pushbutton switch connected to an input buffer with a
pull-up resistor
(b) As the switch bounces several pulses may be
generated
• We might have to debounce this signal using an SR flip-
flop or small state machine
Noise Margins
(a) Transfer characteristics of a CMOS inverter with the
lowest switching threshold
(b) The highest switching threshold
(c) A graphical representation of CMOS logic thresholds
(d) Logic thresholds at the inputs and outputs of a logic
gate or an ASIC
(e) The switching thresholds viewed as a plug and socket
(f) CMOS plugs fit CMOS sockets and the clearances are
the noise margins
Mixed-Voltage Systems
AC Input
• A synchronizer is built from two flip-flops in cascade,
and greatly reduces the effective values of tc and T0 over
a single flip-flop.
• The penalty is an extra clock cycle of latency.
Clock Input
• Once the clock signal reach the chip, adjust the logic
level and then we need to distribute the clock signal
around the chip as it is needed.
• FPGAs normally provide special clock buffers and
networks.
• Minimize the clock delay (or latency) and clock skew.
• Figure shows part of the I/O timing model for a Xilinx
XC4005-6:
(a) Timing model
(b) A simplified view of clock distribution
(c) Timing diagram
• Xilinx eliminates the variable internal delay tPG, by
specifying a pin-to-pin setup time, tPSUFmin=2ns.
Registered Inputs
Registered Output

(a) Timing model with values for an XC4005-6 programmed


with the fast slew-rate option
(b) Timing diagram
Xilinx XC3000 I/O Block
• The I/O pad connects to one of the pins on the IC
package so that external signals can be input to or
output from the array of logic cells.
• To use the cell as an input, the 3-state control must be
set to place the tristate buffer, which drives the output
pin, in the high-impedance state.
• To use the cell as an output, the tristate buffer must be
enabled.
• Flip-flops are provided so that input and output values
can be stored within the IOB.
• The flip-flops are bypassed when direct input or output is
desired.
• Two clock lines (CK1 and CK2) can be programmed to
connect to either flip-flop.
• The input flip-flop can be programmed to act as an edge
triggered D flip-flop or as a transparent latch.
• Even if the I/O pin is not used, I/O flip-flops can still be
used to store data.
• An OUT signal coming from the logic array, first goes
through an ex-or gate, where it is either complimented or
not, depending on how the OUT-INVERT bit is
programmed.
• The OUT signal can be stored in the flip-flop if desired.
• Depending on how the OUTPUT-SELECT bit is
programmed, either the OUT signal or the flip-flop out
put goes to the output buffer.
• If the 3-STATE signal is 1 and the 3 STATE INVERT bit
is 0, the output buffer has a high impedance output.
• Otherwise, the buffer drives the output signal to the I/O
pad.
• When the I/O pad is used as an input, the output buffer
must be in the high –impedance state.
• An external signal coming into the I/O pad goes through
a buffer and then to the input of a D flip-flop.
• The buffer output provides a DIRECT IN signal to the
logic array.
• The input signal can be stored in the D flip-flop, which
provides the REGISTRERD IN signal to the logic array.
• The input threshold can be programmed to respond to
either TTL or CMOS signal levels.
• SLEW RATE bit controls the rate at which the output
signal can change
• When the output drives an external device, reduction of
the slew rate is desirable to reduce the induced noise
that can occur when the output changes rapidly.
• When the PASSIVE PULL-UP bit is set, the pull-up
resistor is connected to the I/O pad, this internal pull up
resistor can be used to avoid floating inputs.
Xilinx XC4000 I/O Block
The outputs contain features that allows to perform
following:
• Switch between a totem-pole and a complementary
output.
• Include a passive pull-up or pull-down with a typical
resistance of about 50 kohms.
• Invert the three-state control.
• Include a flip-flop or latch or a direct connection in the
output path.
• Control the slew rate of the output.
The features on the inputs allows to do:
• Configure the input buffer with TTL or CMOS thresholds.
• Include a flip-flop or latch or a direct connection in the
input pad.
• Switch in a delay to eliminate a input hold time.
Xilinx LCA Timing Model (XC5200)
Other I/O Cells
• A simplified block diagram of the Altera I/O Control
Block (IOC) used in the MAX 5000 and MAX 7000
series.
• The I/O pin feedback allows the I/O pad to be isolated
from the macrocell.
• It is thus possible to use a LAB without using up an I/O
pad (as you often have to do using a PLD such as a
22V10).
• The PIA is the chipwide interconnect.
• A simplified block diagram of the Altera I/O
Element (IOE), used in the FLEX 8000 and 10k
series.
• The MAX 9000 IOC (I/O Cell) is similar.
• The Fast Track Interconnect bus is the chip-wide
interconnect.
• The Peripheral Control Bus (PCB) is used for
control signals common to each IOE.
PROGRAMMABLE
ASIC
INTERCONNECT
Agenda
• Actel ACT
• Xilinx LCA
• Xilinx EPLD
• Altera MAX 5000 and 7000
• Altera MAX 9000
• Altera FLEX
Design Goals for Programmable Interconnect:
• Provide adequate flexible
• Use configuration memory efficiently
• Balance bisection bandwidth
• Minimize delays
– propagation and fanout delay
– switched element delay

 The structure and complexity of the interconnect is


largely determined by the programming technology and
the architecture of the basic logic cell.
 The first programmable ASICs were constructed using
two layers of metal: newer programmable ASICs use
three or more layers of metal interconnect.
Actel ACT

The interconnect architecture used in an Actel ACT family FPGA.


• The channel routing uses dedicated rectangular
areas of fixed size within the chip called wiring
channels.
• Within the horizontal or vertical channels wires
run horizontally or vertically, respectively, within
tracks.
• Actel divides the fixed interconnect wires within
each channel into various lengths or wire
segments.
• The designer then programs the interconnections
by blowing antifuses and making connections
between wire segments.
Routing Resources

• 22 horizontal tracks per channel for signal


routing with three tracks dedicated to VDD,
GND and the global clock (GCLK).
• Eight vertical tracks per Logic Module are
available for inputs.
• A vertical track that extends across the two
channels above the module and across the two
channels below the module - Output Stub.
• One vertical track per column is a long vertical
track (LVT) that spans the entire height of the
chip.
Elmore’s Constant
Measuring the delay of a net
(a) An RC tree
(b) The waveforms as a result of closing the switch at t=0
• The time constant tDi is often called the Elmore delay
and is different for each node.
• I call tDi the Elmore time constant as a reminder that, if
we approximate Vi by an exponential waveform, the
delay of the RC tree using 0.35/0.65 trip points is
approximately tDi seconds.
• Elmore’s Constant aims at analysis of RC networks to
examine the delays due to interconnects.
RC Delay in Antifuse Connections
Actel routing model
Antifuse Parasitic Capacitance
• The number of antifuses connected to the horizantal and
vertical lines for the Actel architecture.
• Each column contains 13 vertical signal tracks and each
channel contains 25 horizantal tracks.
• A connection to the diffusion of an Actel antifuse has a
parasitic capacitance due to the diffusion junction.
• The polysilicon of the antifuse has a parasitic capacitance
due to the thin oxide, these capacitances are
approximately equal.
• Thus assuming channels are fully populated with antifuse:
– An input stub (1 channel) connects to 25 antifuses
– An output stub (4 channels) connects to 100 (25x4) antifuses
– An LVT (1010, 8 channels) connects to 200 (25x8) antifuses
– An LVT (1020, 14 channels) connects to 350 (25x14) antifuses
– A four-column horizontal track connects to 52 (13x4) antifuses
– A 44-column horizontal track connects to 572 (13x44) antifuses
ACT 2 and ACT 3 Interconnect
• The ACT 1 architecture uses two antifuses for routing
nearby modules, three antifuses to join horizontal
segments and four antifuses to use a horizontal or
vertical long track.
• The ACT 2 and ACT 3 architectures use increased
interconnect resources over ACT 1 device.
• This reduces further the number of connections that
need more than two antifuses.
• Delay is also reduced by decreasing the population of
antifuses in the channels, and by decreasing the antifuse
resistance of certain critical antifuses.
• The channel density is the absolute minimum number of
tracks needed in a channel to make a given set of
connections.
• ACT 2 devices have 36 horizantal tracks per channel
rather than the 22 available in the ACT 1 architecture.
• The horizontal track segments in the an ACT 3 device
range from a module pair to the full channel length.
• Vertical tracks are:
– input with a two channel span (one up, one down)
– Output with a four-channel span (two up, two down)
– Long vertical track (LVT)
• Four LVTs shared by each column pair.
• ACT 2/3 LMs can accept five inputs, rather than four
inputs for the ACT 1 modules, thus ACT 2/3 LMs need
an extra two vertical tracks per channel.
• The number of tracks per column thus increases from
13 to 15 in the ACT 2/3 architecture.
Xilinx LCA
Xilinx LCA interconnect
• The vertical lines and horizontal lines run between
CLBs.
• The general-purpose interconnect joins switch boxes
(also known as magic boxes or switching matrices).
• The long lines run across the entire chip. It is possible to
form internal buses using long lines and the three-state
buffers that are next to each CLB.
• The direct connections (not used on XC4000) bypass
the switch matrices and directly connect adjacent CLBs.
• The Programmable Interconnection Points (PIPs) are
programmable pass transistors that connect the CLB
inputs and outputs to the routing network.
• Bidirectional (BIDI) interconnect buffers restore the
logic level and logic strength on long interconnect paths.
Components of interconnect delay in a Xilinx LCA array
(a) A portion of the interconnect around the CLBs
(b) A switching matrix
(c) A detailed view inside the switching matrix showing
the pass-transistor arrangement
(d) The equivalent circuit for the connection between
nets 6 and 20 using the matrix
(e) A view of the interconnect at a Programmable
Interconnection Point (PIP)
(f) and (g) The equivalent schematic of a PIP connection
(h) The complete RC delay path
Xilinx 4000 Interconnect Details
• Interconnects of XC4000 device are arranged in
horizontal and vertical channels.
• Each channel contains some number of wire segments,
they are:
 Single length lines:
– They span a single CLB
– Provide highest interconnect flexibility and offer fast routing
– Acquire delay whenever line passes through switch matrix
– They are not suitable for routing signal for long distance
 Double length lines:
– They span two CLB so that each line is twice as long as single
length lines
– Provide faster signal routing over intermediate distance
 Long lines:
– Long lines form a grid of metal interconnect segments that run
entire length or width of the array
– They are for high fan-out and nets with critical delay
Xilinx EPLD
• The Xilinx family uses an interconnect bus known as
Universal Interconnection Module (UIM) to distribute
signals within the FPGA.
• CG is the fixed gate capacitance of the EPROM device.
• CD is the fixed drain parasitic capacitance of the EPROM
device.
• CW is the variable horizontal bus capacitance.
• CB is the variable vertical bus capacitance.
• The UIM is as shown:
(a) A simplified block diagram of the UIM. The UIM bus
width, n, varies from 68 (XC7236) to 198 (XC73108)
(b) The UIM is actually a large programmable AND array
(c) The parasitic capacitance of the EPROM cell
Altera MAX 5000 and 7000
Altera MAX 9000
The Altera MAX 9000 interconnect scheme
(a) A 4x5 array of Logic Array Blocks (LABs), the same
size as the EMP9400 chip
(b) A simplified block diagram of the interconnect
architecture showing the connection of the Fast Track
buses to a LAB
Altera FLEX
Summary
• The RC product of the parasitic elements of an antifuse
and a pass transistor are not too different.
• However, an SRAM cell is much larger than an antifuse
which leads to coarser interconnect architectures for
SRAM-based programmable ASICs.
• The EPROM device lends itself to large wired-logic
structures.
• These differences in programming technology lead to
different architectures:
– The antifuse FPGA architectures are dense and regular.
– The SRAM architectures contain nested structures of
interconnect resources.
– The complex PLD architectures use long interconnect lines but
achieve deterministic routing.
Layout
Agenda
• CMOS Process Layers
• Well
• Active
• Poly
• Select
• Poly Contact
• Active Contact
• Metal
• Via
CMOS Process Layers
1. Well

Rule Description Lambda Microns


1.1 Minimum width 12 3.6u
Minimum spacing between
1.2 18 5.4u
wells at different potential
Minimum spacing between
1.3 6 1.8u
wells at same potential
Well
Minimum width
Well

Minimum spacing between wells at different potential


Well

Minimum spacing between wells at same potential


2. Active

Rule Description Lambda Microns


2.1 Minimum width 3 0.9u
2.2 Minimum spacing 3 0.9u
2.3 Source/drain active to well edge 6 1.8u
2.4 Substrate/well contact active to well edge 3 0.9u
Minimum spacing between
2.5 0 or 4 0 or 1.2u
active of different implant
Active
Minimum width
Active

Minimum spacing
Active

Source/drain active to well edge


Active

Substrate/well contact active to well edge


Active

Minimum spacing between active of different implant


3. Poly

Rule Description Lambda Microns


3.1 Minimum width 2 0.6u
3.2 Minimum spacing 3 0.9u
3.3 Minimum gate extension of active 2 0.6u
3.4 Minimum active extension of poly 3 0.9u
3.5 Minimum field poly to active 1 0.3u
Poly

Minimum width
Poly

Minimum spacing
Poly
Minimum gate extension of active
Poly
Minimum active extension of poly
Poly

Minimum field poly to active


4. Select

Rule Description Lambda Microns


Minimum select spacing to
4.1 3 0.9u
channel of transistor
4.2 Minimum select overlap of active 2 0.6u
4.3 Minimum select overlap of contact 1 0.3u
4.4 Minimum select width and spacing 2 0.6u
Select

Minimum select spacing to channel of transistor


Select

Minimum select overlap of active


Select
Minimum select overlap of contact
Select

Minimum select width and spacing


5. Poly Contact

Rule Description Lambda Microns


5.1 Exact contact size 2x2 0.6u x 0.6u
5.2b Minimum poly overlap 1 0.3u
5.3 Minumum contact spacing 3 0.9u
Minimum spacing to
5.4 2 0.6u
gate of transistor
5.5b Minimum spacing to other poly 5 1.5u
Minimum spacing to active
5.6b 2 0.6u
(single contact)
Minimum spacing to active
5.7b 3 0.9u
(multiple contacts)
Poly-Contact

Exact contact size


Poly-Contact

Minimum contact spacing


Poly-Contact

Minimum spacing to gate of transistor


Poly-Contact

Minimum spacing to other poly


Poly-Contact
Minimum spacing to active, single contact
Poly-Contact

Minimum spacing to active, multiple contacts


6. Active Contact

Rule Description Lambda Microns


6.1 Exact contact size 2x2 0.6u x 0.6u
6.2b Minimum active overlap 1 0.3u
6.3 Minimum contact spacing 3 0.9u
6.4 Minimum spacing to gate of transistor 2 0.6u
6.5b Minimum spacing to diffusion active 5 1.5u
Minimum spacing to field poly
6.6b 2 0.6u
(single contact)
Minimum spacing to field poly
6.7b 3 0.9u
(multiple contacts)
6.8b Minimum spacing to poly contact 4 1.2u
Active-Contact

Exact contact size


Active-Contact

Minimum active overlap


Active-Contact

Minimum contact spacing


Active-Contact
Minimum spacing to gate of transistor
Active-Contact
Minimum spacing to diffusion active
Active-Contact
Minimum spacing to field poly, single contact
Active-Contact

Minimum spacing to field poly, multiple contacts


Active-Contact

Minimum spacing to poly contact


7. Metal-1

Rule Description Lambda Microns


7.1 Minimum width 3 0.9u
7.2a Minimum spacing 3 0.9u
7.3 Minimum overlap of any contact 1 0.3u
Metal-1

Minimum width
Metal-1

Minimum spacing
Metal-1

Minimum overlap of any contact


8. Metal-2

Rule Description Lambda Microns


9.1 Minimum width 3 0.9u
9.2b Minimum spacing 3 0.9u
9.3 Minimum overlap of via1 1 0.3u
Metal-2

Minimum width
Metal-2

Minimum spacing
Metal-2

Minimum overlap of via1


9. Metal-3

Rule Description Lambda Microns


15.1 Minimum width 5 1.5u
15.2 Minimum spacing 3 0.9u
15.3 Minimum overlap of Via-2 2 0.6u
Metal-3

Minimum width
Metal-3

Minimum spacing
Metal-3

Minimum overlap of Via-2


10. Via-1

Rule Description Lambda Microns


8.1 Exact size 2x2 0.6u x 0.6u
8.2 Minimum spacing 3 0.9u
8.3 Minimum overlap by Metal-1 1 0.3u
8.4 Minimum spacing to contact 2 0.6u
Via-1

Exact size
Via-1

Minimum spacing
Via-1

Minimum overlap by Metal-1


Via-1

Minimum spacing to contact


11. Via-2

Rule Description Lambda Microns


14.1 Exact size 2x2 0.6u x 0.6u
14.2 Minimum spacing 3 0.9u
14.3 Minimum overlap by Metal-2 1 0.3u
14.4 Minimum spacing Via-1 2 0.6u
Via-2

Exact size
Via-2

Minimum spacing
Via-2

Minimum overlap by Metal-2


Via-2

Minimum spacing Via-1


Agenda
• Design Rules
• Intra-Layer Design Rules
• Transistor Layout
• Via’s and Contacts
• Select Layer
• CMOS Design Rules
• CMOS Inverter Layout
Design Rules
• Interface between designer and process
engineer.
• Guidelines for constructing process
masks.
• Unit dimension: Minimum line width:
– scalable design rules: lambda parameter
– absolute dimensions (micron rules)
Intra-Layer Design Rules
Transistor Layout
Via’s and Contacts
Select Layer
CMOS Design Rules
• Refer to the other attachment……
CMOS Inverter Layout
Layout Compaction
Agenda
• Introduction
• Example
• Layout Constraints
• Different Approaches to Compaction
– Virtual Grid Compaction
– Constraint-based Compaction
• Constraint Graph Generation
• Problem formulations
• Equality Constraints
• Compaction Enhancements
Introduction
• After P&R, the layout is functionally complete.
• Some vacant space may still be present in the
layout.
– Due to non-optimality of P&R.
• Compaction = removing vacant space.
– Improves cost, performance, and yield.
• Key for high-performance full-custom layouts.
• Standard cells only channel heights may be
minimized.
– But channel compactors are near-optimal.
• Compaction tries to minimize total layout area
while:
– Retaining speed
– Respecting violating design rules
– Designer specified constraints

• Three ways to minimize the layout area:


– Reducing inter-feature space.
• Check spacing design rules
– Reducing feature size.
• Check size rules
– Reshaping features.
• Electrical connectivity must be preserved
Example
Layout Constraints
• For a given layout instance, all features may be
described by a set of placement constraints.
• These layout constraints are imposed by design
rules.
– Min spacing (separation) design rules
Expressing constraints
• Assuming a left-to-right ordering, can capture constraints
by non-equalities.
• Example: if c is to the right of b, and they have to be 2
units apart: c ≥ b + 2
Different Approaches to Compaction
• One dimensional Vs two-dimensional
compaction.
1-D compaction
– Components moved only in x- or y-direction
– Efficient algorithms available

2-D compaction
– Components may be moved both in x- and y-direction
– More effective compaction
– NP-hard
• Determining how x and y should interact to reduce area is
hard!
Historical interest: Virtual grid based compaction Vs
Constraint-graph based compaction .

Virtual Grid Compaction:


• Every features is assumed to lie on a virtual grid line.
• Required to stay at grid locations during compaction no
distortion of topology.
• Compact by finding minimum possible spacing between
each adjacent pair of grids.
– Min spacing is given by worst-case design-rule for any feature
on the grid
• Advantage: Algorithm is simple and fast.
• Results in larger area.
Constraint-based Compaction
Constraint Graph Generation
• Constraint graph generation is the most time-consuming
part of constraint-based compaction, approached naively
O(n2).
– In the worst case, there may be a design rule between every
shape of layout.
• In reality only a small local subset of constraints are
needed.
• First generate connectivity (grouping) constraints.
• In generating separation constraints, need to define a set
of non-redundant constraints.
– Shadow-propagation algorithm
– Scan-line algorithm
How do we solve these constraints?
Use graphical structure
• We can construct a constraint graph to capture layout
constraints
Problem formulation-1

• Use a labeled directed graph G = <V,E>.


• Vertices layout objects.
• Edges represent constraints between vertices.
• Labels represent constraint values.
• Now what do we do with this?
Problem formulation-2
• Goal of 1-D compaction is to generate a minimum width
layout.
• Determination of minimum width is equivalent to solving
a longest path problem.
• The longest path from source to a vertex is the
coordinate of the vertex.
• In practical layouts, the constraints graphs are very local.
– Most edges represent very local constraints in the layout.
• Theoretically, run time is O(|V|+|E|).
• Practically, run time is close to linear in |V|, the size of
the layout!
Problem formulation-3
Compute the longest path in a graph G = <V,E,constraints,Origin>
(constraints is a set of labels, Origin is the super source of the DAG)
Forward-prop(W)
{
for each vertex v in W
for each edge <v,w> from v
value(w) = max(value(w), value(v) + value(w) +
constraint(<v,w>))
if all incoming edges of w have been traversed
add w to W
}
Longest path(G)
Forward_prop(Origin)
First Elaboration: Equality Constraints-1
• Inequality constraints capture min spacing design rules.
• Need also to capture grouping constraints:
– Features from same circuit component
– Need to be moved together
• Grouping constraints are described by equality
constraints.
First Elaboration: Equality Constraints-2
Reflecting Equality Constraints
Alternative Approach to Equality
Constraints
Constraint-Based Compaction
Approach
Build constraint graph
if equality constraint
add to equality constraint graph
if inequality constraint
find distinguished representatives of each
vertex in constraint
add constraint between distinguished
representatives
Check for cycles in inequality constraint graph
If cycles exist terminate with error
Solve equality constraint graph
Solve inequality constraint graph
Compaction Enhancements
• 1-D constraint-based compaction problem can be
formulated optimally and computationally efficiently.
• In real circuits what we want is often more complex than
can be captured in simple linear inequalities of the form:
• e≥c+1
• Or equalities of the form:
• u=t–3
• For example:
– Wire length minimization
– Spacing, slack distribution
– Jog introduction
• Improving area minimization using:
– 1½ D compaction
– 2D compaction
Wire-Length Minimization
• Features not on the critical path will be pulled towards
a layout edge because they are given their minimal
legal spacing.
• May lead to increased wire length and slower circuit
performance.
• Can re-distribute ‘slack’ (the available empty space) to
the features not on the critical path.
One-and-a-half Dimensional Compaction
• Key idea: provide enough lateral movements to blocks
during compaction to resolve interferences.
• Algorithm starts with a partially compacted layout (two
applications of 1-D compaction).
• Maintain two lists:
– floor
– ceiling
• Floor is a list of blocks visible from the top.
• Ceiling is the list of blocks visible from the bottom.
• Select the lowest block in the ceiling and move it to the
floor maximizing the gap between floor and ceiling.
• Continue until all blocks have been moved from ceiling to
floor.
1-Dimensional Compaction in 2D
True Two-Dimensional Compaction
Logic synthesis
Agenda
• What is Synthesis?
• Why Synthesis?
• Logic Synthesis
• HDL Synthesis
• Optimization
• Two-Level Logic Optimization
• Multi-Level Logic Optimization
• Timing Optimization
What is Synthesis?
• It is a process of automatically transforming a
hardware description from a higher level of
abstraction to lower level of abstraction, or from
behavioral to structural domain.
• Translate HDL descriptions into logic gate
networks (structural domain) in a particular
library.
• Advantages are:
– Reduce time to generate netlists
– Easier to retarget designs from one technology to
another
– Reduce debugging effort
• Requirement:
– Robust HDL synthesizers
• Aim of synthesis: To get a physical realization
of a digital circuit starting with an abstract
functional description.
• The transformations have to be done under
some constraints.
• Constraints are:
– Area
– Delay
– Power
– Testability
Abstraction levels and domains of
hardware descriptions
Why Synthesis?
• Increasing complexity of designs.
– Multi-million gate designs
• Time to market.
• Multiple target technologies.
• Conflicting constrains.
• Correct by construction.
Logic Synthesis
• Logic synthesis provides a link between an HDL and a
netlist similarly to the way that a C compiler provides a
link between C code and machine language.
• C was developed for use with compilers, but HDLs
were not developed for use with logic-synthesis tools.
• Verilog was designed as a simulation language.
• VHDL was designed as a documentation and
description language.
• Both Verilog and VHDL were developed in the early
1980s, well before the introduction of commercial logic
synthesis software.
• Because these HDLs are now being used for purposes
for which they were not intended, the state of the art in
logic synthesis falls far short of that for computer-
language compilers.
• Logic synthesis forces designers to use a subset of both
Verilog and VHDL.
• This makes using logic synthesis more difficult rather
than less difficult.
• The current state of synthesis software is rather like
learning a foreign language, and then having to talk to a
five-year-old.
• When talking to a logic-synthesis tool using an HDL, it is
necessary to think like hardware, anticipating the netlist
that logic synthesis will produce.
• Designers use graphic or text design entry to create an
HDL behavioral model, which does not contain any
references to logic cells.
• State diagrams, graphical datapath descriptions, truth
tables, RAM/ROM templates, and gate-level schematics
may be used together with an HDL description.
• Once a behavioral HDL model is complete, two items are
required to proceed:
– A logic synthesizer (software and documentation)
– A cell library (the logic cells NAND gates and such) that is called
the target library

• Most synthesis software companies produce only


software.
• Most ASIC vendors produce only cell libraries.
• The behavioral model is simulated to check that the
design meets the specifications and then the logic
synthesizer is used to generate a netlist, a structural
model, which contains only references to logic cells.
• There is no standard format for the netlists that logic
synthesis produces, but EDIF is widely used.
• Some logic-synthesis tools can also create structural
HDL (Verilog, VHDL, or both).
• Following logic synthesis the design is simulated again,
and the results are compared with the earlier behavioral
simulation.
• Layout for any type of ASIC may be generated from the
structural model produced by logic synthesis.
• Using logic synthesis we move from a behavioral model
to a structural model.
• Logic Synthesis optimize the implementation cost of
Boolean logic.
• Cost:
– area, delay, power.

• Optimization can be:


– Technology independent optimization (logic optimization)
– Technology dependent optimization (technology mapping)
HDL Synthesis
Domain Translation
Optimization
• Technology-independent optimization (logic
optimization):
– Work on Boolean expression equivalent
– Estimate size based on # of literals
– Use simple delay models
• Technology-dependent optimization (technology
mapping/library binding):
– Map Boolean expressions into a particular cell library
– May perform some optimizations in addition to simple
mapping
– Use more accurate delay models based on cell
structures
Two-Level Logic Optimization
• Two-level logic representations
– Sum-of-product form
– Product-of-sum form
• Two-level logic optimization
– Key technique in logic optimization
– Many efficient algorithms to find a near minimal
representation in a practical amount of time
– In commercial use for several years
– Minimization criteria: number of product terms
Multi-Level Logic Optimization
• Translate a combinational circuit to meet performance or
area constraints:
– Two-level minimization
– Common factors or kernel extraction
– Common expression re-substitution

• In commercial use for several years.


• Technology independent.
• Decomposition/Restructuring:
– Algebraic
– Functional

• Node optimization
– Two-level logic optimization techniques are used
Example:
Technology Mapping
• Goal: Translation of a technology independent represen-
tation (Boolean networks) of a circuit into a circuit in a
given technology (standard cells) with optimal cost.
• Optimization criteria:
– Minimum area
– Minimum delay
– Meeting specified timing constraints
– Meeting specified timing constraints with minimum area

• Usage:
– Technology mapping after technology independent logic
optimization
– Technology translation
Timing Optimization
• There is always a trade-off between area and
delay.
• Optimize timing to meet delay spec. with
minimum area.
Multi-Level Vs Two-Level
Summary
• A logic synthesizer may contain over 500K lines of code.
• With such a complex system, complex inputs, and little
feedback at the output there is a danger of the garbage.
• Ask what do I expect to see at the output? and Does the
output make sense? If you cannot answer these
questions, you should simplify the input (reduce the width
of the buses, simplify or partition the code, and so on).
• The worst thing you can do is write and simulate a huge
amount of code, read it into the synthesis tool, and try
and optimize it all at once with the default settings.
• With experience it is possible to recognize what the logic
synthesizer is doing by looking at the number of cells,
their types, and the drive strengths.
• For example, if there are many minimum drive strength
cells on the critical path it is usually an indication that the
synthesizer has room to increase speed by substituting
cells with stronger drive.
• This is not always true, sometimes a higher-drive cell
may actually slow down the circuit.
• This is because adding the larger cell increases load
capacitance, but not enough drive to make up for it.
• This is why logical effort is a useful measure.
• Because interconnect delay is increasingly dominant, it
is important to begin the physical design steps as early
as possible.
• Ideally floorplanning and logic synthesis should be
completed at the same time.
• This ensures that the estimated interconnect delays are
close to the actual delays after routing is complete.
Simulation
Agenda
• Introduction
• How Logic Simulation Works
• Types of Simulation
– Behavioral Simulation
– Functional Simulation
• Formal Verification
– Static Timing Analysis
– Gate-Level Simulation
– Switch-Level Simulation
– Transistor-Level Simulation
Introduction
• Simulation is used at many stages during
ASIC design.
• Initial prelayout simulations include logic-
cell delays but no interconnect delays.
• Estimates of capacitance may be included
after completing logic synthesis, but only
after physical design it is possible to
perform an accurate postlayout simulation.
How Logic Simulation Works
• The most common type of digital simulator is an event-
driven simulator.
• When a circuit node changes in value the time, the node,
and the new value are collectively known as an event.
• The event is scheduled by putting it in an event queue or
event list.
• When the specified time is reached, the logic value of
the node is changed.
• The change affects logic cells that have this node as an
input.
• All of the affected logic cells must be evaluated, which
may add more events to the event list.
• The simulator keeps track of the current time, the current
time step, and the event list that holds future events.
• For each circuit node the simulator keeps a record of the
logic state and the strength of the source or sources
driving the node.
• When a node changes logic state, whether as an input or
as an output of a logic cell, this causes an event.
• An interpreted-code simulator uses the HDL model as
data, compiling an executable model as part of the
simulator structure, and then executes the model.
• This type of simulator usually has a short compile time
but a longer execution time compared to other types of
simulator.
Types of Simulation
• Simulators are usually divided into the following
categories or simulation modes.
– Behavioral simulation
– Functional simulation
– Static timing analysis
– Gate-level simulation
– Switch-level simulation
– Transistor-level or circuit-level simulation
• This list is ordered from high-level to low-level simulation
(high-level being more abstract, and low-level being
more detailed).
• Proceeding from high-level to low-level simulation, the
simulations become more accurate, but they also
become progressively more complex and take longer to
run.
• While it is just possible to perform a behavioral-level
simulation of a complete system, it is impossible to
perform a circuit-level simulation of more than a few
hundred transistors.
• There are several ways to create an imaginary
simulation model of a system.
Behavioral Simulation
• The method which models large pieces of a
system as black boxes with inputs and outputs.
• This type of simulation (often using VHDL or
Verilog) is called behavioral simulation.
• Thus, behavioral simulation can only tell us if our
design does work; it cannot tell us that real
hardware will work.
• We use logic synthesis to produce a structural
model from a behavioral model.
Functional Simulation
• To find the critical path using logic simulation
requires simulating all possible input transitions
and then shifting through the output to find the
critical path.
• Vector-based simulation (or dynamic simulation)
can show us that our design functions correctly
hence the name functional simulation.
• Functional simulation ignores timing and
includes unit-delay simulation, which sets delays
to a fixed value (for example, 1 ns).
• However, functional simulation does not work
well if we wish to find the critical path.
• Once a behavioral or functional simulation predicts that
a system works correctly, the next step is to check the
timing performance.
• At this point a system is partitioned into smaller ASICs
and a timing simulation is performed for each ASIC
separately (otherwise the simulation run times become
too long).
• One class of timing simulators employs timing analysis
that analyzes logic in a static manner, computing the
delay times for each path, this is called static timing
analysis because it does not require the creation of a
set of test (or stimulus) vectors (difficult for a large
ASIC).
• Timing analysis works best with synchronous systems
whose maximum operating frequency is determined by
the longest path delay between successive flip-flops.
• The path with the longest delay is the critical path.
• Logic simulation or gate-level simulation can also be
used to check the timing performance of an ASIC.
• In a gate-level simulator a logic gate or logic cell (NAND,
NOR, and so on) is treated as a black box modeled by a
function whose variables are the input signals.
• The function may also model the delay through the logic
cell.
• Setting all the delays to unit value is the equivalent of
functional simulation.
• Timing information for most gate-level simulators is
calculated once, before simulation, using a delay
calculator.
• This works as long as the logic cell delays and signal
ramps do not change.
• In addition to pin-to-pin timing differences there is a
timing difference depending on state.
• If the timing simulation provided by a black-box model of
a logic gate is not accurate enough?
• The next, more detailed, level of simulation is switch-
level simulation which models transistors as switches
on or off.
• Switch-level simulation can provide more accurate timing
predictions than gate-level simulation, but without the
ability to use logic-cell delays as parameters of the
models.
• The most accurate, but also the most complex and time-
consuming, form of simulation is transistor-level
simulation.
• A transistor-level simulator requires models of
transistors, describing their nonlinear voltage and current
characteristics.
• Each type of simulation normally uses a different
software tool.
• A mixed-mode simulator permits different parts
of an ASIC simulation to use different simulation
modes.
• For example, a critical part of an ASIC might be
simulated at the transistor level while another
part is simulated at the functional level.
• Be careful not to confuse mixed-level simulation
with a mixed analog/digital simulator, these are
mixed-level simulators .
Formal Verification
• Using logic synthesis we move from a behavioral model
to a structural model.
• How are we to know (other than by trusting the logic
synthesizer) that the two representations are the same?
• We have already seen that we may have to alter the
original reference model because the HDL acceptable to
a synthesis tool is a subset of HDL acceptable to
simulators.
• Formal verification can prove, in the mathematical
sense, that two representations are equivalent.
• If they are not, the software can tell us why and how two
representations differ.
Static Timing Analysis
• A timing analyzer answers the question: What is the
longest delay in my circuit?
• The timing analyzer gives us only the critical path and its
delay.
• A timing analyzer does not give us the input vectors that
will activate the critical path.
• In fact input vectors may not exist to activate the critical
path.
• The Paths which are impossible to activate, are known
as false paths.
• Timing analysis is essential to ASIC design but has
limitations.
• A timing-analysis tool is more logic calculator than logic
simulator.
Gate-Level Simulation
• The simulator is adding capacitance to the outputs of
each of the logic cells to model the parasitic net
capacitance (interconnect capacitance or wire
capacitance) that will be present in the physical layout.
• The model that predicts these values is known as a wire-
load model, wire-delay model, or interconnect model.
• Changing the wire-load model parameters to zero and
repeating the simulation changes the critical-path delay,
which agrees exactly with the logic-synthesizer timing
analysis.
• This emphasizes that the net capacitance may contribute
a significant delay.
Switch-Level Simulation
• The switch-level simulator is a more detailed
level of simulation than other simulators.
• A switch-level simulator keeps track of voltage
levels as well as logic levels, and it may do this
in several ways.
• The simulator may use a large possible set of
discrete values or the value of a node may be
allowed to vary continuously.
• Switch-level simulation is required to check the
behavior of circuits that may not always have
nodes that are driven or that use logic that is not
complementary.
Transistor-Level Simulation
• When we need to simulate a logic circuit with more
accuracy than provided by switch-level simulation.
• Then we turn to simulators that can solve circuit
equations exactly, given models for the nonlinear
transistors, and predict the analog behavior of the node
voltages and currents in continuous time.
• This type of transistor-level simulation or circuit-level
simulation is costly in computer time.
• It is impossible to simulate more than a few hundred
logic cells using a circuit-level simulator.
• Virtually all circuit-level simulators used for ASIC design
are commercial versions of the SPICE.
Summary
• Behavioral simulation includes no timing
information and can tell you only if your design
will not work.
• Pre-layout simulation of a structural model can
give you estimates of performance, but finding a
critical path is difficult because you need to
construct input vectors to exercise the model.
• Static timing analysis is the most widely used
form of simulation.
– It is convenient because you do not need to create
input vectors.
– Its limitations are that it can produce false paths
critical paths that may never be activated.
• Formal verification is a powerful adjunct to
simulation to compare two different
representations and formally prove if they are
equal.
– It cannot prove your design will work.
• Switch-level simulation is required to check
the behavior of circuits that may not always have
nodes that are driven or that use logic that is not
complementary.
• Transistor-level simulation is used when you
need to know the analog, rather than the digital,
behavior of circuit voltages.
ASIC
CONSTRUCTION
Agenda
• Physical Design
• CAD Tools
• Methods and Algorithms
• Timing-driven floorplanning and placement
design flow
• System Partitioning
• Floorplanning
• Placement
• Routing
Physical Design
• The physical design of ASICs is normally
divided into system partitioning, floorplanning,
placement, and routing.
• A microelectronic system is the town and the
ASICs are the buildings.
– System partitioning corresponds to town planning.
– ASIC floorplanning is the architects job.
– Placement is done by the builder.
– Routing is done by the electrician.

• We shall design most, but not all, ASICs using


these design steps.
• The steps may be performed in a
slightly different order, iterated or
omitted depending on the type and
size of the system and its ASICs.
• As the focus shifts from logic to
interconnect, floorplanning
assumes an important role.
• Each of the steps shown in the
figure must be performed and
each depends on previous step.
• However, the trend is toward
completing these steps in a
parallel fashion and iterating,
rather than in a sequential manner.
• We must first apply system partitioning to divide a
microelectronics system into separate ASICs.
• In floorplanning we estimate sizes and set the initial
relative locations of the various blocks in our ASIC
• At the same time we allocate space for clock and power
wiring and decide on the location of the I/O and power
pads.
• Placement defines the location of the logic cells within
the flexible blocks and sets aside space for the
interconnect to each logic cell.
• Placement for a gate-array or standard-cell design
assigns each logic cell to a position in a row.
• For an FPGA, placement chooses which of the fixed
logic resources on the chip are used for which logic cells.
• Floorplanning and placement are closely related and are
sometimes combined in a single CAD tool.
• Routing makes the connections between logic cells.
• Routing is a hard problem by itself and is normally split
into two distinct steps, called global and local routing.
• Global routing determines where the interconnections
between the placed logic cells and blocks will be
situated.
• Only the routes to be used by the interconnections are
decided in this step, not the actual locations of the
interconnections within the wiring areas.
• Global routing is called loose routing for this reason.
• Local routing joins the logic cells with interconnections.
• Information on which interconnection areas to use
comes from the global router.
• Only at this stage of layout do we finally decide on the
width, mask layer, and exact location of the
interconnections.
• Local routing is also known as detailed routing.
CAD Tools
• In order to develop a CAD tool it is necessary to
convert each of the physical design steps to a
problem with well-defined goals and objectives.
• The goals for each physical design step are the
things we must achieve.
• The objectives for each step are things we
would like to meet on the way to achieving the
goals.
• Some examples of goals and objectives for
each of the ASIC physical design steps are as
follows:
• System partitioning:
– Goal: Partition a system into a number of
ASICs.
– Objective: Minimize the number of external
connections between the ASICs, keep each
ASIC smaller than a maximum size.

• Floorplanning:
– Goal: Calculate the sizes of all the blocks and
assign them locations.
– Objective: Keep the highly connected blocks
physically close to each other.
• Placement:
– Goal: Assign the interconnect areas and the
location of all the logic cells within the flexible
blocks.
– Objective: Minimize the ASIC area and the
interconnect density.
• Global routing:
– Goal: Determine the location of all the
interconnect.
– Objective: Minimize the total interconnect
area used.
• Detailed routing:
– Goal: Completely route all the interconnect on
the chip.
– Objective: Minimize the total interconnect
length used.

• There is no magic recipe involved in the


choice of the ASIC physical design steps.
• Floorplanning and placement are often
thought of as one step and in some tools
placement and routing are performed
together.
Methods and Algorithms
• A CAD tool needs methods or algorithms to generate a
solution to each problem using a reasonable amount of
computer time.
• Often there is no best solution possible to a particular
problem, and the tools must use heuristic algorithms, or
rules of thumb, to try and find a good solution.
• To solve each of the ASIC physical design steps we
require:
– a set of goals and objectives, a way to measure the goals and
objectives
– an algorithm or method to find a solution that meets the goals
and objectives
• The term algorithm is usually reserved for a method that
always gives a solution.
• We need to know how practical any algorithm is.
• We say the complexity of an algorithm is O(f(n)).
• The function f (n) is usually one of the following kinds:
– f (n) = constant
– f (n) = log n
– f (n) = n
– f (n) = n log n
– f (n) = n 2
• As designers attempt to achieve a desired ASIC
performance they make a continuous trade-off between
speed, area, power, and several other factors.
• Presently CAD tools are not smart enough to be able to
do this alone.
• Current CAD tools are only capable of finding a solution
subject to a few, very simple, objectives.
Timing-driven floorplanning and
placement design flow
• The flow consists of the following steps:
– Design entry
– Synthesis
– Initial floorplan
– Synthesis with load constraints
– Timing-driven placement
– Synthesis with in-place optimization
– Detailed placement
• Design entry:
– The input is a logical description with no physical
information.

• Synthesis:
– The initial synthesis contains little or no information
on any interconnect loading.
– The output of the synthesis tool (EDIF netlist) is the
input to the floorplanner.

• Initial floorplan:
– From the initial floorplan interblock capacitances are
input to the synthesis tool as load constraints and
intrablock capacitances are input as wire-load tables.
• Synthesis with load constraints:
– At this point the synthesis tool is able to resynthesize
the logic based on estimates of the interconnect
capacitance each gate is driving.
– The synthesis tool produces a forward annotation file
to constrain path delays in the placement step.

• Timing-driven placement:
– After placement using constraints from the synthesis
tool, the location of every logic cell on the chip is fixed
and accurate estimates of interconnect delay can be
passed back to the synthesis tool.
• Synthesis with in-place optimization (IPO):
– The synthesis tool changes the drive strength of
gates based on the accurate interconnect delay
estimates from the floorplanner without altering the
netlist structure.

• Detailed placement:
– The placement information is ready to be input to
the routing step.
System Partitioning
Agenda
• Introduction
• Measuring Connectivity
• A Simple Partitioning Example
• Partitioning Methods
• Constructive Partitioning
• Iterative Partitioning Improvement
– The Kernighan Lin Algorithm
– The Fiduccia Mattheyses Algorithm
– The Ratio-Cut Algorithm
– The Look-ahead Algorithm
– Simulated Annealing
Introduction
• Microelectronic systems typically consist of
many functional blocks.
• If a functional block is too large to fit in one
ASIC, we may have to split, or partition, the
function into pieces using goals and objectives
that we need to specify.
• We can use CAD tools to help us with this type
of system partitioning.
• System partitioning requires goals and
objectives, methods and algorithms to find
solutions, and ways to evaluate these solutions.
• The levels of partitioning:
– System
– Board
– Chip

• The goal of partitioning is to divide the part of the


system so that each partition is a single ASIC.
• To do this we may need to take into account any
or all of the following objectives:
– A maximum size for each ASIC
– A maximum number of ASICs
– A maximum number of connections for each ASIC
– A maximum number of total connections between all
ASICs
Measuring Connectivity
• To measure connectivity we need some help from the
mathematics of graph theory.
• Figure (a) shows a circuit schematic, netlist, or network.
• The network consists of circuit modules A-F.
• Equivalent terms for a circuit module are a cell, logic cell,
macro, or a block.
• A cell or logic cell usually refers to a small logic gate, but
can also be a collection of other cells.
• Macro refers to gate-array cells.
• Block is usually a collection of gates or cells.
• Each logic cell has electrical connections between the
terminals (connectors or pins).
• Figure 1 shows Networks, graphs, and partitioning.
(a) A network containing circuit logic cells and nets.
(b) The equivalent graph with vertexes and edges.
– For example: logic cell D maps to node D in the graph; net 1
maps to the edge (A, B) in the graph.
– Net 3 (with three connections) maps to three edges in the graph:
(B, C), (B, F), and (C, F).

(c) Partitioning a network and its graph.


– A network with a net cut that cuts two nets.

(d) Network graph showing the corresponding edge cut.


– The net cutset in c contains two nets, but the corresponding
edge cutset in d contains four edges.
– This means a graph is not an exact model of a network for
partitioning purposes.
Figure 1:
• The network can be represented as the mathematical
graph shown in Figure (b).
• A graph is like a spiders web: it contains vertexes (or
vertices) A-F (also known as graph nodes or points) that
are connected by edges.
• A graph vertex corresponds to a logic cell.
• An electrical connection (a net or a signal) between two
logic cells corresponds to a graph edge.
• Figure (c) shows a network with nine logic cells A-I.
• A connection, for example between logic cells A and B in
Figure (c), is written as net (A, B).
• Figure (d) shows a possible division, called a cutset.
• We say that there is a net cutset (for the network) and
an edge cutset (for the graph).
• The connections between the two ASICs are external
connections, the connections inside each ASIC are
internal connections.
• Notice that the number of external connections is not
modeled correctly by the network graph.
• When we divide the network into two by drawing a line
across connections, we make net cuts.
• The resulting set of net cuts is the net cutset.
• The number of net cuts we make corresponds to the
number of external connections between the two
partitions.
• When we divide the network graph into the same
partitions we make edge cuts and we create the edge
cutset.
• Note that nets and graph edges are not equivalent when
a net has more than two terminals.
• Thus the number of edge cuts made when we partition a
graph into two is not necessarily equal to the number of
net cuts in the network.
• The differences between nets and graph edges is
important when we consider partitioning a network by
partitioning its graph.
A Simple Partitioning Example
• Figure 2 (a) shows a simple network we need to partition.
• There are 12 logic cells, labeled A-L, connected by 12
nets (labeled 1-12).
• At this level, each logic cell is a large circuit block and
might be RAM, ROM, an ALU, and so on.
• Each net might also be a bus, but, for the moment, we
assume that each net is a single connection and all nets
are weighted equally.
• The goal is to partition our simple network into ASICs.
• Our objectives are the following:
– Use no more than three ASICs.
– Each ASIC is to contain no more than four logic cells.
– Use the minimum number of external connections for each ASIC.
– Use the minimum total number of external connections.
• Figure 2 (a): We wish to partition this network into three
ASICs with no more than four logic cells per ASIC.
• Figure 2 (b) shows a partitioning with five external
connections;
– two of the ASICs have three pins;
– the third has four pins
• A partitioning with five external connections (nets 2, 4, 5,
6, and 8) the minimum number.
• Figure 2 (c): A constructed partition using logic
cell C as a seed. It is difficult to get from this
local minimum, with seven external connections
(2, 3, 5, 7, 9,11,12), to the optimum solution of
(b).
Partitioning Methods
• Two types of algorithms are used:
– Constructive partitioning
– Iterative partitioning improvement
• Constructive partitioning, which uses a set of
rules to find a solution.
• Iterative partitioning improvement (or iterative
partitioning refinement), which takes an existing
solution and tries to improve it.
• Often we apply iterative improvement to a
constructive partitioning.
Constructive Partitioning
• The most common constructive partitioning algorithms
use seed growth or cluster growth.
• A simple seed-growth algorithm for constructive
partitioning consists of the following steps:
1. Start a new partition with a seed logic cell.
2. Consider all the logic cells that are not yet in a partition.
Select each of these logic cells in turn.
3. Calculate a gain function, g(m), that measures the
benefit of adding logic cell m to the current partition.
One measure of gain is the number of connections
between logic cell m and the current partition.
4. Add the logic cell with the highest gain g(m) to
the current partition.
5. Repeat the process from step 2. If you reach
the limit of logic cells in a partition, start again at
step 1.
• We may choose different gain functions
according to our objectives (but we have to be
careful to distinguish between connections and
nets).
• The algorithm starts with the choice of a seed
logic cell (seed module, or just seed).
• The logic cell with the most nets is a good choice
as the seed logic cell.
• We can also use a set of seed logic cells known
as a cluster or clique borrowed from graph
theory
• A clique of a graph is a subset of nodes where
each pair of nodes is connected by an edge like
your group of friends at school where everyone
knows everyone else in your clique.
• In some tools you can use schematic pages (at
the leaf or lowest hierarchical level) as a starting
point for partitioning.
• If you use a high-level design language, you can
use a Verilog module or VHDL entity/architecture
as seeds.
Iterative Partitioning Improvement
• The most common iterative improvement algorithms are
based on interchange and group migration.
• The process of interchanging (swapping) logic cells in
an effort to improve the partition is an interchange
method.
• If the swap improves the partition, we accept the trial
interchange; otherwise we select a new set of logic cells
to swap.
• There is a limit to what we can achieve with a partitioning
algorithm based on simple interchange.
• For example, Figure (c) shows a partitioning of the
network of part a using a constructed partitioning
algorithm with logic cell C as the seed.
• To get from the solution shown in part (c) to the solution
of part (b), which has a minimum number of external
connections, requires a complicated swap.
• The three pairs: D and F, J and K, C and L need to be
swapped all at the same time.
• It would take a very long time to consider all possible
swaps of this complexity.
• A simple interchange algorithm considers only one
change and rejects it immediately if it is not an
improvement.
• Algorithms of this type are greedy algorithms in the
sense that they will accept a move only if it provides
immediate benefit.
• Such short sightedness leads an algorithm to a local
minimum from which it cannot escape.
• Group migration consists of swapping groups of logic
cells between partitions.
• The group migration algorithms are better than simple
interchange methods at improving a solution but are
more complex.
• Almost all group migration methods are based on the
powerful and general Kernighan Lin algorithm (KL
algorithm) that partitions a graph.
• The problem of dividing a graph into two pieces,
minimizing the nets that are cut, is the min-cut problem
a very important one in VLSI design.
• The KL algorithm can be applied to many different
problems in ASIC design.
• We shall examine the algorithm next and then see how
to apply it to system partitioning.
The Kernighan Lin Algorithm
• Consider a network with 2 m nodes (where m is an
integer) each of equal size.
• External edges cross between partitions, internal edges
are contained inside a partition.
• If we assign a cost to each edge of the network graph,
we can define the cost matrix C = cij, where cij = cji and
cii = 0.
• If all connections are equal in importance, the elements
of the cost matrix are 1 or 0, and in this special case we
usually call the matrix the connectivity matrix.
• Costs higher than 1 could represent the number of wires
in a bus, multiple connections to a single logic cell, or
nets that we need to keep close for timing reasons.
• Figure below illustrates some of the terms and definitions
needed to describe the K L algorithm.
• (a) An example network graph
• (b) The connectivity matrix, C
• The column and rows are labelled to help you see how
the matrix entries correspond to the node numbers in the
graph.
• For example, C17 (column 1, row 7) equals 1 because
nodes 1 and 7 are connected.
• In this example all edges have an equal weight of 1, but
in general the edges may have different weights.
• Suppose we already have split a network into two
partitions, A and B, each with m nodes (perhaps using a
constructed partitioning).
• Our goal now is to swap nodes between A and B with
the objective of minimizing the number of external edges
connecting the two partitions.
• Each external edge may be weighted by a cost, and our
objective corresponds to minimizing a cost function that
we shall call the total external cost, cut cost, or cut
weight, W :
Σ

• In Figure (a) the cut weight is 4 (all the edges have


weights of 1).
• In order to simplify the measurement of the change in cut
weight when we interchange nodes, we need some more
definitions.
• First, for any node a in partition A, we define an external
edge cost, which measures the connections from node a
to B, Σ
• For example, in Figure (a) E1 = 1, and E3 = 0.
• Second, we define the internal edge cost to measure the
internal connections to a,

• So, in Figure (a), I1 = 0, and I3 = 2.


• We define the edge costs for partition B in a similar way
(so E8 = 2, and I8 = 1).
• The cost difference is the difference between external
edge costs and internal edge costs,
-

• Thus, in Figure (a) D1 = 1, D3 = 2, and D8 = 1.


• Now pick any node in A, and any node in B.
• If we swap these nodes, a and b, we need to
measure the reduction in cut weight, which we
call the gain, g.
• We can express g in terms of the edge costs as
follows:
-

• The last term accounts for the fact that a and b


may be connected.
• In figure (a), if we swap nodes 1 and 6, then g =
2.
• If we swap nodes 2 and 8, then g = 1.
• The KL algorithm finds a group of node pairs to swap
that increases the gain even though swapping individual
node pairs from that group might decrease the gain.
• First we pretend to swap all of the nodes a pair at a
time.
• Pretend swaps are like studying chess games when you
make a series of trial moves in your head.
The algorithm is:
1. Find two nodes, ai from A, and bi from B, so that the
gain from swapping them is a maximum. The gain gi is
gi = Dai + Dbi - 2 caibi.
2. Next pretend swap ai and bi even if the gain gi is zero or
negative, and do not consider ai and bi eligible for being
swapped again.
3. Repeat steps 1 and 2 a total of m times until all the
nodes of A and B have been pretend swapped. We are
back where we started, but we have ordered pairs of
nodes in A and B according to the gain from
interchanging those pairs.
4. Now we can choose which nodes we shall actually
swap. Suppose we only swap the first n pairs of nodes
that we found in the preceding process. In other words
we swap nodes X = a1, a2, &…., an from A with nodes
Y = b1, b2,&…..,bn from B. The total gain would be,

5. We now choose n corresponding to the maximum


value of Gn .
• If the maximum value of Gn > 0, then we swap the sets
of nodes X and Y and thus reduce the cut weight by Gn.
• We use this new partitioning to start the process again at
the first step.
• If the maximum value of Gn = 0, then we cannot improve
the current partitioning and we stop.
• We have found a locally optimum solution.
• Figure below shows an example of partitioning a graph
using the KL algorithm.
• Each completion of steps 1 through 5 is a pass through
the algorithm.
• Kernighan and Lin found that typically 24 passes were
required to reach a solution.
• FIGURE: Partitioning a graph using the KL algorithm.
• (a) Shows how swapping node 1 of partition A with node
6 of partition B results in a gain of g = 1.
• (b) A graph of the gain resulting from swapping pairs of
nodes.
• (c) The total gain is equal to the sum of the gains
obtained at each step.
• The most important feature of the KL algorithm is that
we are prepared to consider moves even though they
seem to make things worse.
• The KL algorithm works well for partitioning graphs.
• However, there are the following problems that
we need to address before we can apply the
algorithm to network partitioning:
– It minimizes the number of edges cut, not the
number of nets cut.
– It does not allow logic cells to be different sizes.
– It is expensive in computation time.
– It does not allow partitions to be unequal or find the
optimum partition size.
– It does not allow for selected logic cells to be fixed
in place.
– The results are random.
– It does not directly allow for more than two
partitions.
• To implement a net-cut partitioning rather than an
edge-cut partitioning, we can just keep track of the
nets rather than the edges.
• We can no longer use a connectivity or cost matrix to
represent connections, though.
• To represent nets with multiple terminals in a network
accurately, we can extend the definition of a network
graph.
• Figure next shows how a hypergraph with a special
type of vertex, a star, and a hyperedge, represents a
net with more than two terminals in a network.
• FIGURE: A hypergraph.
(a) The network contains a net y with three terminals.
(b) In the network hypergraph we can model net y by a
single hyperedge (B, C, D) and a star node.
Now there is a direct correspondence between wires or
nets in the network and hyperedges in the graph.
Summary
• In the KL algorithm, the internal and external
edge costs have to be calculated for all the
nodes before we can select the nodes to be
swapped.
• Then we have to find the pair of nodes that give
the largest gain when swapped.
• This requires an amount of computer time that
grows as n2logn for a graph with 2n nodes.
• This n2 dependency is a major problem for
partitioning large networks.
The Fiduccia Mattheyses Algorithm
• The FM algorithm is an extension to the KL algorithm
that addresses the differences between nets and edges
and also reduces the computational effort.
• The key features of this algorithm are the following:
1. Only one logic cell, the base logic cell, moves at a
time.
– In order to stop the algorithm from moving all the logic cells to
one large partition, the base logic cell is chosen to maintain
balance between partitions.
– The balance is the ratio of total logic cell size in one partition to
the total logic cell size in the other.
– Altering the balance allows us to vary the sizes of the partitions.
2. Critical nets are used to simplify the gain calculations.
– A net is a critical net if it has an attached logic cell that, when
swapped, changes the number of nets cut.
– It is only necessary to recalculate the gains of logic cells on
critical nets that are attached to the base logic cell.

3. The logic cells that are free to move are stored in a


doubly linked list.
– The lists are sorted according to gain.
– This allows the logic cells with maximum gain to be found
quickly.

• These techniques reduce the computation time so that it


increases only slightly more than linearly with the
number of logic cells in the network, a very important
improvement.
Comparison between KL & FM
Algorithms
The Ratio-Cut Algorithm
• The ratio-cut algorithm removes the restriction of
constant partition sizes.
• The cut weight W for a cut that divides a network into
two partitions, A and B, is given by,
Σ

• The KL algorithm minimizes W while keeping partitions A


and B the same size.
• The ratio of a cut is defined as

• In this equation |A| and |B| are the sizes of partitions A


and B.
• The size of a partition is equal to the number of nodes it
contains (also known as the set cardinality).
• The cut that minimizes R is called the ratio cut.
• The original description of the ratio-cut algorithm uses
ratio cuts to partition a network into small, highly
connected groups.
• Then you form a reduced network from these groups
each small group of logic cells forms a node in the
reduced network.
• Finally, you use the FM algorithm to improve the
reduced network.
The Look-ahead Algorithm
• Both the KL and FM algorithms consider only the
immediate gain to be made by moving a node.
• When there is a tie between nodes with equal gain, there
is no mechanism to make the best choice.
• This is like playing chess looking only one move ahead.
• Figure below shows an example of two nodes that have
equal gains, but moving one of the nodes will allow a
move that has a higher gain later.
• FIGURE illustrates an example of network partitioning
that shows the need to look ahead when selecting logic
cells to be moved between partitions.
• Partitions (a), (b), and (c) show one sequence of moves.
• Partitions (d), (e), and (f) show a second sequence.
• The partitioning in (a) can be improved by moving node
2 from A to B with a gain of 1.
• The result of this move is shown in (b).
• This partitioning can be improved by moving node 3 to B,
again with a gain of 1.
• The partitioning shown in (d) is the same as (a).
• We can move node 5 to B with a gain of 1 as shown in
(e), but now we can move node 4 to B with a gain of 2.
• We call the gain for the initial move the first-level gain.
• Gains from subsequent moves are then second-level
and higher gains.
• We can define a gain vector that contains these gains.
• Figure shows how the first-level and second-level gains
are calculated.
• Using the gain vector allows us to use a look-ahead
algorithm in the choice of nodes to be swapped.
• This reduces both the mean and variation in the number
of cuts in the resulting partitions.
• Normally we wish to divide a system into more than two
pieces.
• We can do this by recursively applying the algorithms.
• For example, if we wish to divide a system network into
three pieces, we could apply the FM algorithm first, using
a balance of 2:1, to generate two partitions, with one
twice as large as the other.
• Then we apply the algorithm again to the larger of the
two partitions, with a balance of 1:1, which will give us
three partitions of roughly the same size.
Simulated Annealing
• A different approach to solving large graph problems
(and other types of problems) that arise in VLSI layout,
including system partitioning, uses the simulated-
annealing algorithm.
• Simulated annealing takes an existing solution and then
makes successive changes in a series of random
moves.
• Each move is accepted or rejected based on an energy
function, calculated for each new trial configuration.
• The minimums of the energy function correspond to
possible solutions.
• The best solution is the global minimum.
• In an interchange strategy we accept the new trial
configuration only if the energy function decreases,
which means the new configuration is an improvement.
• However, in the simulated-annealing algorithm, we
accept the new configuration even if the energy function
increases for the new configuration which means things
are getting worse.
• The probability of accepting a worse configuration is
controlled by the exponential expression exp(-DE / T ),
where DE is the resulting increase in the energy
function.
• The parameter T is a variable that we control and
corresponds to the temperature in the annealing of a
metal cooling (this is why the process is called simulated
annealing).
• We accept moves that seemingly take us away from a
desirable solution to allow the system to escape from a
local minimum and find other, better, solutions.
• The name for this strategy is hill climbing.
• As the temperature is slowly decreased, we decrease
the probability of making moves that increase the
energy function.
• Finally, as the temperature approaches zero, we refuse
to make any moves that increase the energy of the
system and the system falls and comes to rest at the
nearest local minimum.
• Hopefully, the solution that corresponds to the minimum
we have found is a good one.
• The critical parameter governing the behaviour of the
simulated-annealing algorithm is the rate at which the
temperature T is reduced.
• This rate is known as the cooling schedule.
• Often we set a parameter a that relates the
temperatures, Ti and Ti +1 , at the ith and i + 1th iteration:
Ti +1 = a Ti .
• To find a good solution, a local minimum close to the
global minimum, requires a high initial temperature and a
slow cooling schedule.
• This results in many trial moves and very long computer
run times.
• If we are prepared to wait a long time, simulated
annealing is useful because we can guarantee that we
can find the optimum solution.
Floorplanning
Agenda
• Introduction
• Floor planning Goals and Objectives
• Measurement of delay in floorplanning
• Floorplanning Tools
• Channel Definition
• I/O and Power Planning
• Clock Planning
Introduction
• Floor planning is the mapping between logical
description (the net list) and the physical
description (the Floor plan).
• Floorplanning gives early feedback: thinking of
layout at early stages may suggest valuable
architectural modifications; floorplanning also
aids in estimating delay due to wiring.
• Floorplanning fits very well in a top-down design
strategy, the step-wise refinement strategy also
propagated in software design.
• Floorplanning precedes placement.
• The input to the floorplanning step is the output
of system partitioning and design entry a netlist.
• The netlist is a logical description of the ASIC;
the floorplan is a physical description of an
ASIC.
• Inputs to the floorplanning problem:
– A set of blocks, hard or soft.
– Pin locations of hard blocks.
– A netlist: It describing circuit blocks, the logic cells
within the blocks, and their connections.

• The output of the placement step is a set of


directions for the routing tools.
Floorplanning Goals and Objectives
• The goals of floorplanning are to:
– arrange the blocks on a chip,
– decide the location of the I/O pads,
– decide the location and number of the power pads,
– decide the type of power distribution,
– decide the location and type of clock distribution.

• The objectives of floorplanning are to:


– minimize the chip area,
– minimize delay (reduce wirelength for critical nets),
– maximize routability (minimize congestion).
Measurement of delay in floorplanning

• In floorplanning we wish to predict the interconnect


delay before we complete any routing.
• Delay is dependent on resistance and capacitance.
• Parasitic associated with interconnect is not known i.e.,
interconnect capacitance (wiring capacitance or
routing capacitance) and interconnect resistance.
• Only fanout (FO) of a net and the size of the block is
known.
• Interconnect length is determined by the predicted-
capacitance tables (wire load tables).
Predicted capacitance.
(a) Interconnect lengths as a function of fanout (FO)
and circuit-block size.
(b) Wire-load table.
There is only one capacitance value for each fanout
(typically the average value).
(c) The wire-load table predicts the capacitance and
delay of a net (with a considerable error).
Net A and net B both have a fanout of 1, both have
the same predicted net delay, but net B in fact has a
much greater delay than net A in the actual layout (of
course we shall not know what the actual layout is
until much later in the design process).
Floorplanning Tools
• Figure 1 (a) shows an initial random floorplan generated
by a floorplanning tool.
• Two of the blocks, A and C in this example, are
standard-cell areas.
• These are flexible (or variable) blocks because, although
their total area is fixed, their shape and connector
locations may be adjusted during the placement step.
• The dimensions and connector locations of the other
fixed blocks (perhaps RAM, ROM, compiled cells, or
megacells) can only be modified when they are created.
• We may force logic cells to be in selected flexible blocks
by seeding.
FIGURE 1: Floorplanning a cell-based ASIC.
(a) Initial floorplan generated by the floorplanning tool.
Two of the blocks are flexible (A and C) and contain
rows of standard cells (unplaced).
A pop-up window shows the status of block A.
(b) An estimated placement for flexible blocks A and C.
The connector positions are known and a rat’s nest
display shows the heavy congestion below block B.
(c) Moving blocks to improve the floorplan.
(d) The updated display shows the reduced congestion
after the changes.
Figure 1
• We choose seed cells by name.
• For example, ram_control* would select all logic cells
whose names started with ram_control to be placed in
one flexible block.
• The special symbol, usually ' * ', is a wildcard symbol.
• Seeding may be hard or soft.
– A hard seed is fixed and not allowed to move during the
remaining floorplanning and placement steps.
– A soft seed is an initial suggestion only and can be altered if
necessary by the floorplanner.

• We may also use seed connectors within flexible blocks


forcing certain nets to appear in a specified order, or
location at the boundary of a flexible block.
• The floorplanner can complete an estimated placement
to determine the positions of connectors at the
boundaries of the flexible blocks
• Figure (b) illustrates a rat's nest display of the
connections between blocks.
• Connections are shown as bundles between the centers
of blocks or as flight lines between connectors.
• Figure (c) and (d) show how we can move the blocks in
a floorplanning tool to minimize routing congestion.

• We need to control the aspect ratio of our floorplan


because we have to fit our chip into the die cavity (a
fixed-size hole, usually square) inside a package.
• Figure 2 (a) - (c) show how we can rearrange our chip to
achieve a square aspect ratio.
• Figure (c) also shows a congestion map, another form of
routability display.
• There is no standard measure of routability.
FIGURE 2: Congestion analysis.
(a) The initial floorplan with a 2:1.5 die aspect ratio.
(b) Altering the floorplan to give a 1:1 chip aspect ratio.
(c) A trial floorplan with a congestion map.
• Blocks A and C have been placed so that we know the
terminal positions in the channels.
• Shading indicates the ratio of channel density to the
channel capacity.
• Dark areas show regions that cannot be routed
because the channel congestion exceeds the
estimated capacity.
(d) Resizing flexible blocks A and C alleviates congestion.
Figure 2
• Generally the interconnect channels, (or wiring channels)
have a certain channel capacity that is, they can handle
only a fixed number of interconnects.
• One measure of congestion is the difference between
the number of interconnects that we actually need, called
the channel density, and the channel capacity.
• Another measure, shown in Figure (c), uses the ratio of
channel density to the channel capacity.
• With practice, we can create a good initial placement by
floorplanning and a pictorial display.
• This is one area where the human ability to recognize
patterns and spatial relations is currently superior to a
computer programs ability.
Channel Definition
• During the floorplanning step we assign the
areas between blocks that are to be used for
interconnect.
• This process is known as channel definition or
channel allocation.
• Figure 3 shows a T-shaped junction between
two rectangular channels and illustrates why we
must route the stem (vertical) of the T before the
bar.
• The general problem of choosing the order of
rectangular channels to route is channel
ordering .
Figure 3: Routing a T-junction between two channels in two-
level metal. The dots represent logic cell pins.
(a) Routing channel A (the stem of the T) first allows us to adjust
the width of channel B.
(b) If we route channel B first (the top of the T), this fixes the
width of channel A.
We have to route the stem of a T-junction before we route the top.
• Figure 4 shows a floorplan of a chip containing several
blocks.
• Suppose we cut along the block boundaries slicing the
chip into two pieces (Figure a).
• Then suppose we can slice each of these pieces into
two.
• If we can continue in this fashion until all the blocks are
separated, then we have a slicing floorplan (Figure b).
• Figure (c) shows how the sequence we use to slice the
chip defines a hierarchy of the blocks.
• Reversing the slicing order ensures that we route the
stems of all the channel T-junctions first.
4
• Figure 5 shows a floorplan that is not a slicing structure.
• We cannot cut the chip all the way across with a knife
without chopping a circuit block in two.
• This means we cannot route any of the channels in this
floorplan without routing all of the other channels first.
• We say there is a cyclic constraint in this floorplan.
• There are two solutions to this problem.
– One solution is to move the blocks until we obtain a slicing
floorplan.
– The other solution is to allow the use of L-shaped, rather than
rectangular, channels (or areas with fixed connectors on all sides
a switch box).

• We need an area-based router rather than a channel


router to route L-shaped regions or switch boxes.
Figure 5: Cyclic constraints.
(a) A nonslicing floorplan with a cyclic constraint that prevents
channel routing.
(b) In this case it is difficult to find a slicing floorplan without
increasing the chip area.
(c) This floorplan may be sliced (with initial cuts 1 or 2) and has
no cyclic constraints, but it is inefficient in area use and will be
very difficult to route.
• Figure 6 (a) displays the floorplan of the ASIC.
• We can remove the cyclic constraint by moving the
blocks again, but this increases the chip size.
• Figure (b) shows an alternative solution.
• We merge the flexible standard cell areas A and C.
• We can do this by selective flattening of the netlist.
• Sometimes flattening can reduce the routing area
because routing between blocks is usually less
efficient than routing inside the row-based blocks.
• Figure (b) shows the channel definition and routing
order for our chip.
Figure 6: Channel definition and ordering.
(a) We can eliminate the cyclic constraint by merging the
blocks A and C.
(b) A slicing structure.
I/O and Power Planning
• A silicon chip or die is mounted on a chip carrier inside a
chip package.
• Connections are made by bonding the chip pads to
fingers on a metal lead frame that is part of the package.
• The metal lead-frame fingers connect to the package
pins.
• A die consists of a logic core inside a pad ring.
• Figure 7 (a) shows a pad-limited die and Figure (b) shows
a core-limited die.
• On a pad-limited die we use tall, thin pad-limited pads,
which maximize the number of pads we can fit around the
outside of the chip.
• On a core-limited die we use short, wide core-limited pad.
Figure 7: Pad-limited and core-limited die.
(a) A pad-limited die: The number of pads determines the die size
(b) A core-limited die: The core logic determines the die size.
(c) Using both pad-limited pads and core-limited pads for a
square die.
• Figure (c) shows how we can use both types of pad to
change the aspect ratio of a die to be different from that
of the core.
• Special power pads are used for the positive supply, or
VDD, power buses (or power rails) and the ground or
negative supply, VSS or GND.
• Usually one set of VDD/VSS pads supplies one power
ring that runs around the pad ring and supplies power to
the I/O pads only.
• Another set of VDD/VSS pads connects to a second
power ring that supplies the logic core.
• We sometimes call the I/O power (dirty power) since it
has to supply large transient currents to the output
transistors.
• We keep dirty power separate to avoid injecting noise
into the internal-logic power (clean power).
• I/O pads also contain special circuits to protect against
ESD.
• These circuits can withstand very short high-voltage
(several kilovolt) pulses that can be generated during
human or machine handling.
• Depending on the package design, the type and
positioning of down bonds may be fixed.
• This means we need to fix the position of the chip pad
for down bonding using a pad seed .
• If we make an electrical connection between the
substrate and a chip pad, or to a package pin, it must be
to VDD (n -type substrate) or VSS (p -type substrate).
• This substrate connection (for the whole chip) employs a
down bond (or drop bond) to the carrier.
• We have several options:
– We can dedicate one (or more) chip pad(s) to down bond to the
chip carrier.
– We can make a connection from a chip pad to the lead frame
and down bond from the chip pad to the chip carrier.
– We can make a connection from a chip pad to the lead frame
and down bond from the lead frame.
– We can down bond from the lead frame without using a chip
pad.
– We can leave the substrate and/or chip carrier unconnected.

• A double bond connects two pads to one chip-carrier


finger and one package pin.
• We can do this to save package pins or reduce the
series inductance of bond wires by parallel connection of
the pads.
• A multiple-signal pad or pad group is a set of pads.
• For example, an oscillator pad usually comprises a set of
two adjacent pads that we connect to an external crystal.
• The oscillator circuit and the two signal pads form a
single logic cell.
• Another common example is a clock pad.
• Some foundries allow a special form of corner pad
(normal pads are edge pads) that squeezes two pads
into the area at the corners of a chip using a special two-
pad corner cell, to help meet bond-wire angle design
rules (see Figure b and c).
• To reduce the series resistive and inductive impedance
of power supply networks, it is normal to use multiple
VDD and VSS pads.
• The output pads can easily consume most of the power
on a CMOS ASIC, because the load on a pad is much
larger than typical on-chip capacitive loads.
• Depending on the technology it may be necessary to
provide dedicated VDD and VSS pads for every few
SSOs.
• Design rules set how many SSOs can be used per
VDD/VSS pad pair.
• These dedicated VDD/VSS pads must follow groups of
output pads as they are seeded or planned on the
floorplan.
• With some chip packages this can become difficult
because design rules limit the location of package pins
that may be used for supplies (due to the differing series
inductance of each pin).
• Using a pad mapping we translate the logical pad in a
netlist to a physical pad from a pad library.
• We need to have separate horizontal, vertical, left-
handed, and right-handed pad cells in the library with
appropriate logical to physical pad mappings.
• If we mix pad-limited and core-limited edge pads in the
same pad ring, this complicates the design of corner
pads.
• In this case a corner pad also becomes a pad-format
changer, or hybrid corner pad.
• In single-supply chips we have one VDD net and one
VSS net, both global power nets.
• It is also possible to use mixed power supplies or
multiple power supplies.
• Figure 8 (a) and (b) represents the magnified views of
the southeast corner of example chip and show the
different types of I/O cells.
• Figure (c) shows a stagger-bond arrangement using two
rows of I/O pads.
• In this case the design rules for bond wires (the spacing
and the angle at which the bond wires leave the pads)
become very important.
• Figure (d) shows an area-bump bonding arrangement
(also known as flip-chip, solder-bump or C4).
• Though the bonding pads are located in the center of the
chip, the I/O circuits are still often located at the edges of
the chip because of difficulties in power supply
distribution and integrating I/O circuits together with logic
in the center of the die.
FIGURE 8: Bonding pads.
(a) This chip uses both pad-limited and core-limited pads.
(b) A hybrid corner pad.
FIGURE Bonding pads.
(c) A chip with stagger-bonded pads.
(d) An area-bump bonded chip (or flip-chip).
• The chip is turned upside down and solder bumps
connect the pads to the lead frame.
• In an MGA the pad spacing and I/O-cell spacing is fixed
each pad occupies a fixed pad slot (or pad site ).
• This means that the properties of the pad I/O are also
fixed but, if we need to, we can parallel adjacent output
cells to increase the drive.
• To increase flexibility further the I/O cells can use a
separation, the I/O-cell pitch, that is smaller than the pad
pitch.
• For example, three 4 mA driver cells can occupy two pad
slots.
• Then we can use two 4 mA output cells in parallel to
drive one pad, forming an 8 mA output pad as shown in
Figure 9.
• This arrangement also means the I/O pad cells can be
changed without changing the base array.
• This is useful as bonding techniques improve and the
pads can be moved closer together.

• FIGURE 9: Gate-array I/O pads.


(a) Cell-based ASICs may contain pad cells of different
sizes and widths.
(b) A corner of a gate-array base.
(c) A gate-array base with different I/O cell and pad
pitches.
Figure 9
• Figure 10 shows two possible power distribution
schemes.
• The long direction of a rectangular channel is the
channel spine.
• Some automatic routers may require that metal lines
parallel to a channel spine use a preferred layer (either
m1, m2, or m3).
• Since we can have both horizontal and vertical channels,
we may have the situation shown in figure, where we
have to decide whether to use a preferred layer or the
preferred direction for some channels.
• This may or may not be handled automatically by the
routing software.
Figure 10
FIGURE 10: Power distribution.
(a) Power distributed using m1 for VSS and m2 for VDD.
– This helps minimize the number of vias and layer crossings
needed but causes problems in the routing channels.
(b) In this floorplan m1 is run parallel to the longest side
of all channels, the channel spine.
– This can make automatic routing easier but may increase the
number of vias and layer crossings.
(c) An expanded view of part of a channel (interconnect
is shown as lines).
– If power runs on different layers along the spine of a channel,
this forces signals to change layers.
(d) A closeup of VDD and VSS buses as they cross.
– Changing layers requires a large number of via contacts to
reduce resistance.
Clock Planning
• Figure 11 (a) shows a clock spine routing scheme with
all clock pins driven directly from the clock driver.
• MGAs and FPGAs often use this fish bone type of clock
distribution scheme.
• Figure (b) shows a clock spine for a cell-based ASIC.
• Figure (c) shows the clock-driver cell, often part of a
special clock-pad cell.
• Figure (d) illustrates clock skew and clock latency.
• Since all clocked elements are driven from one net with
a clock spine, skew is caused by differing interconnect
lengths and loads.
• If the clock-driver delay is much larger than the
interconnect delays, a clock spine achieves minimum
skew but with long latency.
Figure 11
FIGURE 11: Clock distribution.
(a) A clock spine for a gate array.
(b) A clock spine for a cell-based ASIC.
(c) A clock spine is usually driven from one or more clock-
driver cells.
– Delay in the driver cell is a function of the number of stages and
the ratio of output to input capacitance for each stage (taper).
(d) Clock latency and clock skew.
– We would like to minimize both latency and skew.
• The delay through a chain of CMOS gates is minimized
when the ratio between the input capacitance C1 and the
output capacitance C2 is about 3 (exactly e = 2.7).
• The fastest way to drive a large load is to use a chain of
buffers with their input and output loads chosen to
maintain this ratio.
• We can design a tree of clock buffers so that the taper of
each stage is e = 2.7 by using a fanout of three at each
node, as shown in Figure 12 (a) and (b).
• The clock tree, shown in Figure (c), uses the same
number of stages as a clock spine, but with a lower peak
current for the inverter buffers.
• Figure (c) illustrates that we now have another problem
we need to balance the delays through the tree carefully
to minimize clock skew.
• We have to balance the clock arrival times at all of the
leaf nodes to minimize clock skew.
• Designing a clock tree that balances the rise and fall
times at the leaf nodes has the beneficial side-effect of
minimizing the effect of hot-electron wearout.
Figure 12
(a) Minimum delay is achieved when the taper of successive
stages is about 3.
(b) Using a fanout of three at successive nodes.
(c) A clock tree for a cell-based ASIC
Summary
• Floorplanning initializes the physical design
process.
• Floorplanning is the center of ASIC design
operations for all types of ASIC.
• There are many factors to be considered during
floorplanning:
– minimizing connection length and signal delay
between blocks,
– arranging fixed blocks and reshaping flexible blocks
to occupy the minimum die area,
– organizing the interconnect areas between blocks,
– planning the power, clock, and I/O distribution.
• The handling of some of these factors may be
automated using CAD tools, but many still need
to be dealt with by hand.
• In both floorplanning and placement, with
minimum interconnect length as an objective, it
is necessary to find the shortest total path length
connecting a set of terminals.
• This path is the MRST, which is hard to find.
• The alternative, for both floorplanning and
placement, is to use simple approximations to
the length of the MRST (usually the half-
perimeter measure).
Placement
Agenda
• Introduction
• Placement Goals and Objectives
• Placement Algorithms
• Min-cut Placement
• Iterative Placement Improvement
• Simulated Annealing
• Timing-Driven Placement Methods
• A Simple Placement Example
Introduction
• Placement is the problem of automatically assigning
correct positions on the chip to predesigned cells, such
that some cost function is optimized.
• Inputs: A set of fixedcells/modules, a netlist.
• Different design styles create different placement
problems.
• E.g., building-block, standard-cell, gate-array placement.
• Building block: The cells to be placed have arbitrary
shapes.
• Standard cells are designed in such a way that power
and clock connections run horizontally through the cell
and other I/O leaves the cell from the top or bottom
sides.
• CBIC, MGA, and FPGA architectures all have rows of
logic cells separated by the Interconnect these are row-
based ASICs
• The cells are placed in rows.
• Sometimes feedthrough cells are added to ease wiring.

feedthrough
• Figure 1 shows an example of the interconnect structure
for a CBIC.
• Interconnect runs in horizontal and vertical directions in
the channels and in the vertical direction by crossing
through the logic cells.
• Figure (a) A two-level metal CBIC floorplan.
• Figure (b) A channel from the flexible block A.
• This channel has a channel height equal to the
maximum channel density of 7 (there is room for seven
interconnects to run horizontally in m1).
• Figure (c) illustrates the fact that it is possible to use
over-the-cell (OTC) routing in areas that are not blocked.
• OTC routing is complicated by the fact that the logic cells
themselves may contain metal on the routing layers.
Figure 1
• Figure 2 shows the interconnect structure of a two-level
metal MGA.
(a) A small two-level metal gate array.
(b) Routing in a block.
(c) Channel routing showing channel density and
channel capacity.
• The channel height on a gate array may only be
increased in increments of a row.
• If the interconnect does not use up all of the channel, the
rest of the space is wasted.
• The interconnect in the channel runs in m1 in the
horizontal direction with m2 in the vertical direction.
Figure 2
• Most ASICs currently use two or three levels of metal for
signal routing.
• With two layers of metal, we route within the rectangular
channels using:
– The first metal layer for horizontal routing, parallel to the channel
spine,
– The second metal layer for the vertical direction,
– If there is a third metal layer it will normally run in the horizontal
direction again.

• The maximum number of horizontal interconnects that


can be placed side by side, parallel to the channel spine,
is the channel capacity.
• Vertical interconnect uses feedthroughs to cross the
logic cells.
• Here are some commonly used terms with explanations:
 An unused vertical track (or just track) in a logic cell is
called an uncommitted feedthrough.
 A vertical strip of metal that runs from the top to bottom
of a cell (for double-entry cells), but has no connections
inside the cell, is also called a feedthrough or jumper.
 Two connectors for the same physical net are electrically
equivalent connectors.
 A dedicated feedthrough cell is an empty cell that can
hold one or more vertical interconnects. These are used
if there are no other feedthroughs available.
 A feedthrough pin or terminal is an input or output that
has connections at both the top and bottom of the
standard cell.
 A spacer cell is used to fill space in rows so that the
ends of all rows in a flexible block may be aligned to
connect to power buses.
Placement Goals and Objectives
• The goal of a placement tool is to arrange all the logic
cells within the flexible blocks on a chip.
• Ideally, the objectives of the placement step are to:
– Guarantee the router can complete the routing step
– Minimize all the critical net delays
– Make the chip as dense as possible
• We may also have the following additional objectives:
– Minimize power dissipation
– Minimize cross talk between signals
• The most commonly used placement objectives are one
or more of the following:
– Minimize the total estimated interconnect length
– Meet the timing requirements for critical nets
– Minimize the interconnect congestion
Placement Algorithms
• What does a placement algorithm try to
optimize?
– the total area,
– the total wire length,
– the number of horizontal/vertical wire segments
crossing a line.

• Constraints:
– the placement should be routable (no cell overlaps;
no density overflow),
– timing constraints are met (some wires should always
be shorter than a given length).
• Popular placement algorithms are:
– Constructive algorithms: once the position of a cell
is fixed, it is not modified anymore.
• Cluster growth, min cut, etc.
– Iterative algorithms: intermediate placements are
modified in an attempt to improve the cost function.
• Force-directed method, etc.
– Nondeterministic approaches: Simulated
annealing, genetic algorithm, etc.
• Most approaches combine multiple elements:
– Constructive algorithms are used to obtain an initial
placement.
– The initial placement is followed by an iterative
improvement phase.
– The results can further be improved by simulated
annealing.
• A constructive placement method uses a set of
rules to arrive at a constructed placement.
• The most commonly used methods are
variations on the min-cut algorithm.
• The other commonly used constructive
placement algorithm is the eigen value method.
• As in system partitioning, placement usually
starts with a constructed solution and then
improves it using an iterative algorithm.
• In most tools we can specify the locations and
relative placements of certain critical logic cells
as seed placements.
Min-cut Placement
• The min-cut placement method uses successive
application of partitioning.
1. Cut the placement area into two pieces.
2. Swap the logic cells to minimize the cut cost.
3. Repeat the process from step 1, cutting smaller
pieces until all the logic cells are placed.
• Usually we divide the placement area into bins.
• The size of a bin can vary, from a bin size equal to the
base cell (for a gate array) to a bin size that would hold
several logic cells.
• We can start with a large bin size, to get a rough
placement, and then reduce the bin size to get a final
placement.
• Basic idea is to split the circuit into two sub circuits.
• The recursive bipartitioning of a circuit leading to min-cut
placement.
FIGURE 3 shows the Min-cut placement.
(a) Divide the chip into bins using a grid.
(b) Merge all connections to the center of each bin.
(c) Make a cut and swap logic cells between bins to
minimize the cost of the cut.
(d) Take the cut pieces and throw out all the edges that
are not inside the piece.
(e) Repeat the process with a new cut and continue until
we reach the individual bins.
Figure 3
Iterative Placement Improvement
• An iterative placement improvement algorithm takes an
existing placement and tries to improve it by moving the
logic cells.
• There are two parts to the algorithm:
– The selection criteria that decides which logic cells to try moving.
– The measurement criteria that decides whether to move the
selected cells.

• There are several interchange or iterative exchange


methods that differ in their selection and measurement
criteria:
– pairwise interchange,
– force-directed interchange,
– force-directed relaxation,
– force-directed pairwise relaxation.
• All of these methods usually consider only pairs of logic
cells to be exchanged.
• A source logic cell is picked for trial exchange with a
destination logic cell.
• The most widely used methods use group migration,
especially the Kernighan Lin algorithm.
• The pairwise-interchange algorithm is similar to the
interchange algorithm used for iterative improvement in
the system partitioning step:
1. Select the source logic cell at random.
2. Try all the other logic cells in turn as the destination
logic cell.
3. Use any of the measurement methods we have
discussed to decide on whether to accept the
interchange.
4. The process repeats from step 1, selecting each logic
cell in turn as a source logic cell.
• Figure (a) and (b) show how we can extend pairwise
interchange to swap more than two logic cells at a time.
• If we swap X logic cells at a time and find a locally
optimum solution, we say that solution is X-optimum.
• The neighborhood exchange algorithm is a modification
to pairwise interchange that considers only destination
logic cells in a neighborhood cells within a certain
distance, e, of the source logic cell.
• Limiting the search area for the destination logic cell to
the e -neighborhood reduces the search time.
• Neighborhoods are also used in some of the force-
directed placement methods.
• Imagine identical springs connecting all the logic cells we
wish to place.
• The number of springs is equal to the number of
connections between logic cells.
• The effect of the springs is to pull connected logic cells
together.
• The more highly connected the logic cells, the stronger
the pull of the springs.
• The force on a logic cell i due to logic cell j is given by
Hooke’s law, which says the force of a spring is
proportional to its extension:
Fij = - cij xij
• The vector component xij is directed from the center of
logic cell i to the center of logic cell j.
• The vector magnitude is calculated as either the
Euclidean or Manhattan distance between the logic cell
centers.
• The cij form the connectivity or cost matrix (the matrix
element cij is the number of connections between logic
cell i and logic cell j).
• If we want, we can also weight the cij to denote critical
connections.
• Figure 4 illustrates the force-directed placement
algorithm.
• Nets such as clock nets, power nets, and global reset
lines have a huge number of terminals.
• The force-directed placement algorithms usually make
special allowances for these situations to prevent the
largest nets from snapping all the logic cells together.
FIGURE 4: Force-directed placement.
(a) A network with nine logic cells.
(b) We make a grid (one logic cell per bin).
(c) Forces are calculated as if springs were attached to the
centers of each logic cell for each connection.
– The two nets connecting logic cells A and I correspond
to two springs.
(d) The forces are proportional to the spring extensions.
• Figure 5 illustrates the different kinds of force-directed
placement algorithms.
• The force-directed interchange algorithm uses the force
vector to select a pair of logic cells to swap.
• In force-directed relaxation a chain of logic cells is
moved.
• The force-directed pairwise relaxation algorithm swaps
one pair of logic cells at a time.
• We reach a force-directed solution when we minimize
the energy of the system, corresponding to minimizing
the sum of the squares of the distances separating logic
cells.
• Force-directed placement algorithms thus also use a
quadratic cost function.
FIGURE 5: Force-directed iterative placement
improvement.
(a) Force-directed interchange.
(b) Force-directed relaxation.
(c) Force-directed pairwise relaxation.
Simulated Annealing
• Simulated annealing requires many iterations, it is critical
that placement objectives be easy and fast to calculate.
• Optimum connection pattern, MRST, is difficult to calculate
• Using the half-perimeter measure corresponds to
minimizing the total interconnect length.
• Applying simulated annealing to placement, the algorithm
is as follows:
1. Select logic cells for a trial interchange, usually at random.
2. Evaluate the objective function E for the new placement.
3. If D E is negative or zero, then exchange the logic cells. If D E is
positive, then exchange the logic cells with a probability of exp(-D
E / T).
4. Go back to step 1 for a fixed number of times, and then lower the
temperature T according to a cooling schedule: Tn +1 = 0.9 Tn.
Timing-Driven Placement Methods
• Minimizing delay is becoming more and more important
as a placement objective.
• There are two main approaches:
– net based
– path based

• We can use net weights in our algorithms.


• The problem is to calculate the weights.
• One method finds the n most critical paths (using a
timing-analysis engine, possibly in the synthesis tool).
• The net weights might then be the number of times each
net appears in this list.
• Another method to find the net weights uses the zero-
slack algorithm.
• Figure 6 (a) shows a circuit with primary inputs at which
we know the arrival times (actual times) of each signal.
• We also know the required times for the primary outputs
the points in time at which we want the signals to be
valid.
• We can work forward from the primary inputs and
backward from the primary outputs to determine arrival
and required times at each input pin for each net.
• The difference between the required and arrival times at
each input pin is the slack time (the time we have to
spare).
• The zero-slack algorithm adds delay to each net until the
slacks are zero, as shown in Figure 6 (b).
• The net delays can then be converted to weights or
constraints in the placement.
FIGURE 6: The zero-slack algorithm.
(a) The circuit with no net delays.
0

FIGURE 6: The zero-slack algorithm.


(b) The zero-slack algorithm adds net delays (at the
outputs of each gate, equivalent to increasing the gate
delay) to reduce the slack times to zero.
A Simple Placement Example
• Figure 7 shows an example network and placements to
illustrate the measures for interconnect length and
interconnect congestion.
• Figure (b) and (c) illustrate the meaning of total routing
length, the maximum cut line in the x -direction, the
maximum cut line in the y -direction, and the maximum
density.
• In this example we have assumed that the logic cells are
all the same size, connections can be made to terminals
on any side, and the routing channels between each
adjacent logic cell have a capacity of 2.
• Figure (d) shows what the completed layout might look
like.
7:
Summary
• Placement follows the floorplanning step and is more
automated.
• It consists of organizing an array of logic cells within a
flexible block.
• The criterion for optimization may be minimum
interconnect area, minimum total interconnect length, or
performance.
• Because interconnect delay in a submicron CMOS
process dominates logic-cell delay, planning of
interconnect will become more and more important.
• Instead of completing synthesis before starting
floorplanning and placement, we have to use synthesis
and floorplanning/placement tools together to achieve an
accurate estimate of timing.
Routing
• Once the designer has floorplanned a chip and
the logic cells within the flexible blocks have
been placed, it is time to make the connections
by routing the chip.
• Given a placement, and a fixed number of metal
layers, find a valid pattern of horizontal and
vertical wires that connect the terminals of the
nets.
• This is still a hard problem that is made easier
by dividing it into smaller problems.
• Routing is usually split into global routing
followed by detailed routing.
Global Routing
Agenda
• Introduction
• Global Routing Goals & Objectives
• Measurement of Interconnect Delay
• Global Routing Methods
• Global Routing Between Blocks
• Global Routing Inside Flexible Blocks
• Timing-Driven Methods
• Back-annotation
Introduction
• The details of global routing differ slightly
between cell-based ASICs, gate arrays, and
FPGAs, but the principles are the same in each
case.
• A global router does not make any connections,
it just plans them.
• We typically global route the whole chip (or large
pieces if it is a large chip) before detail routing
the whole chip (or the pieces).
• There are two types of areas to global route:
– inside the flexible blocks,
– between blocks.
• The input to the global router is:
– a floorplan that includes the locations of all the fixed
and flexible blocks,
– the placement information for flexible blocks and the
locations of all the logic cells.
• Floorplanning and placement both assume that
interconnect may be put anywhere on a
rectangular grid, since at this point nets have not
been assigned to the channels, but the global
router must use the wiring channels and find the
actual path.
• Often the global router needs to find a path that
minimizes the delay between two terminals this
is not necessarily the same as finding the
shortest total path length for a set of terminals.
Global Routing Goals & Objectives
• The goal of global routing is to
– Provide complete instructions to the detailed router on
where to route every net.

• The objectives of global routing are one or more


of the following:
– Minimize the total interconnect length.
– Maximize the probability that the detailed router can
complete the routing.
– Minimize the critical path delay.
Measurement of Interconnect
Delay
Figure 1: Measuring the delay of a net.
(a) A simple circuit with an inverter A driving a net with a
fanout of two.
Voltages V1, V2, V3, and V4 are the voltages at
intermediate points along the net.
(b) The layout showing the net segments (pieces of
interconnect).
(c) The RC model with each segment replaced by a
capacitance and resistance.
The ideal switch and pull-down resistance Rpd model the
inverter A.
Figure 1: Measuring the delay of a net.
• The global router does not know how much of the routing
is on which of the layers, or how many vias will be used
and of which type, or how wide the metal lines will be.
• It may be possible to estimate how much interconnect
will be horizontal and how much is vertical.
• This knowledge does not help much if horizontal
interconnect may be completed in either m1 or m3 and
there is a large difference in parasitic capacitance
between m1 and m3, for example.
• When the global router attempts to minimize
interconnect delay, there is an important difference
between a path and a net.
• The path that minimizes the delay between two terminals
on a net is not necessarily the same as the path that
minimizes the total path length of the net.
Global Routing Methods
• Global routing cannot use the interconnect-length
approximations, such as the half-perimeter measure,
that were used in placement.
• What is needed now is the actual path and not an
approximation to the path length.
• Many of the methods used in global routing are still
based on the solutions to the tree on a graph problem.
• One approach to global routing takes each net in turn
and calculates the shortest path using tree on graph
algorithms with the added restriction of using the
available channels.
• This process is known as sequential routing.
• As a sequential routing algorithm proceeds, some
channels will become more congested since they hold
more interconnects than others.
• In the case of FPGAs and channeled gate arrays, the
channels have a fixed channel capacity and can only
hold a certain number of interconnects.
• There are two different ways that a global router
normally handles this problem:
– Using order-independent routing, a global router proceeds by
routing each net, ignoring how crowded the channels are.
• Whether a particular net is processed first or last does not matter,
the channel assignment will be the same.
– In order-independent routing, after all the interconnects are
assigned to channels, the global router returns to those channels
that are the most crowded and reassigns some interconnects to
other, less crowded, channels.
• Alternatively, a global router can consider the number of
interconnects already placed in various channels as it
proceeds.
• In this case the global routing is order dependent the
routing is still sequential, but now the order of
processing the nets will affect the results.
• Iterative improvement or simulated annealing may be
applied to the solutions found from both order-
dependent and order-independent algorithms.
• This is implemented in the same way as for system
partitioning and placement.
• A constructed solution is successively changed, one
interconnect path at a time, in a series of random
moves.
• In contrast to sequential global-routing methods, which
handle nets one at a time, hierarchical routing handles
all nets at a particular level at once.
• Rather than handling all of the nets on the chip at the
same time, the global-routing problem is made more
tractable by dividing the chip area into levels of
hierarchy.
• By considering only one level of hierarchy at a time the
size of the problem is reduced at each level.
• There are two ways to traverse the levels of hierarchy.
– Starting at the whole chip, or highest level, and proceeding down
to the logic cells is the top-down approach.
– The bottom-up approach starts at the lowest level of hierarchy
and globally routes the smallest areas first.
Global Routing Between Blocks
• Figure 2 illustrates the global-routing problem for a cell-
based ASIC.
• Each edge in the channel-intersection graph in Figure (c)
represents a channel.
• The global router is restricted to using these channels.
• The weight of each edge in the graph corresponds to the
length of the channel.
• The global router plans a path for each interconnect using
this graph.
FIGURE 2: (a) A cell-based ASIC with numbered channels
(b) The channels form the edges of a graph.
(c) The channel-intersection graph.
Each channel corresponds to an edge on a graph whose
weight corresponds to the channel length.
Figure 2
• Figure 3 shows an example of global routing for a net
with five terminals, labeled A1 through F1, for the cell-
based ASIC shown in Figure 2.
• If a designer wishes to use minimum total interconnect
path length as an objective, the global router finds the
minimum-length tree shown in Figure (b).
• This tree determines the channels the interconnects will
use.
• For example, the shortest connection from A1 to B1
uses channels 2, 1, and 5 (in that order).
• This is the information the global router passes to the
detailed router.
• Figure (c) shows that minimizing the total path length
may not correspond to minimizing the path delay
between two points.
FIGURE 3: Finding paths in global routing.
(a) A cell-based ASIC (from Figure 2) showing a single
net with a fanout of four (five terminals).
(b) The terminals are projected to the center of the
nearest channel, forming a graph.
(c) The minimum-length tree does not necessarily
correspond to minimum delay.
Global Routing Inside Flexible
Blocks
• Figure 4 (a) shows the routing resources on a sea-of-
gates or channelless gate array.
• The gate array base cells are arranged in 36 blocks,
each block containing an array of 8-by-16 gate-array
base cells, making a total of 4068 base cells.
• The horizontal interconnect resources are the routing
channels that are formed from unused rows of the gate-
array base cells, as shown in Figure (b) and (c).
• The vertical resources are feedthroughs.
• For example, the logic cell shown in Figure (d) is an
inverter that contains two types of feedthrough.
FIGURE 4: Gate-array global routing.
(a) A small gate array.
(b) An enlarged view of the routing. The top channel uses three
rows of gate-array base cells, the other channels use only one.
(c) A further enlarged view showing how the routing in the
channels connects to the logic cells.
FIGURE 4: Gate-array global routing.
(d) One of the logic cells, an inverter.
(e) There are seven horizontal wiring tracks available in
one row of gate-array base cells the channel capacity is
thus 7.
• The inverter logic cell uses a single gate-array base cell
with terminals (or connectors) located at the top and
bottom of the logic cell.
• The inverter input pin has two electrically equivalent
terminals that the global router can use as a
feedthrough.
• The output of the inverter is connected to only one
terminal.
• The remaining vertical track is unused by the inverter
logic cell, so this track forms an uncommitted
feedthrough.
• The terms landing pad (because we say that we drop a
via to a landing pad), pick-up point, connector, terminal,
pin, or port used for the connection to a logic cell.
• The term pick-up point refers to the physical pieces of
metal/polysi in the logic cell to which the router connects.
• In a 3-level metal process, the global router may be able
to connect to anywhere in an area an area pick-up point.
• The term pin refers to the connection on a logic
schematic icon (a dot, square box, or whatever symbol is
used), rather than layout.
• Thus the difference between a pin and a connector is that
we can have multiple connectors for one pin.
• The term port is used when we are using text (EDIF
netlists or HDLs, for example) to describe circuits.
• Figure (e) shows a gate-array base cell with seven
horizontal tracks.
• In gate array, we can have a channel with a capacity of 7,
14, 21, . ...horizontal tracks but not between these values.
• Figure 5 shows the inverter macro for the sea-of-gates
array shown in Figure 4.
Figure (a) shows the base cell.
Figure (b) shows how the internal inverter wiring on m1
leaves one vertical track free as a feedthrough in a two-
level metal process (connectors placed at the top and
bottom of the cell).
• In a three-level metal process the connectors may be
placed inside the cell abutment box (Figure c).
• Figure 6 shows the global routing for the sea-of-gates
array.
• We divide the array into nonoverlapping routing bins (or
just bins, also called global routing cells or GRCs ), each
containing a number of gate-array base cells.
FIGURE 5: The gate-array inverter from Figure 4 (d).
(a) An oxide-isolated gate-array base cell, showing the
diffusion and polysilicon layers.
(b) The metal and contact layers for the inverter in a 2LM
(two-level metal) process.
(c) The routers view of the cell in a 3LM process.
• A large routing bin reduces the size of the routing
problem, and a small routing bin allows the router to
calculate the wiring capacities more accurately.
• Figure 6 (a) shows a routing bin that is 2-by-4 gate-array
base cells.
• The logic cells occupy the lower half of the routing bin.
• The upper half of the routing bin is the channel area,
reserved for wiring.
• The global router calculates the edge capacities for this
routing bin, including the vertical feedthroughs.
• The global router then determines the shortest path for
each net considering these edge capacities.
• The path, described by a series of adjacent routing bins,
is passed to the detailed router.
FIGURE 6: Global routing a gate array.
(a) A single global-routing cell (GRC or routing
bin) containing 2-by-4 gate-array base cells.
– For this choice of routing bin the maximum horizontal
track capacity is 14, the maximum vertical track
capacity is 12.
– The routing bin labeled C3 contains three logic cells,
two of which have feedthroughs marked 'f'.
– This results in the edge capacities shown.
(b) A view of the top left-hand corner of the gate
array showing 28 routing bins.
– The global router uses the edge capacities to find a
sequence of routing bins to connect the nets.
Figure 6
Timing-Driven Methods
• Minimizing the total path length using a Steiner tree
does not necessarily minimize the interconnect delay of
a path.
• Alternative tree algorithms apply in this situation, most
using the Elmore constant as a method to estimate the
delay of a path.
• As in timing-driven placement, there are two main
approaches to timing-driven routing:
– net-based
– path-based

• Path-based methods are more sophisticated.


• For example, if there is a critical path from logic cell A
to B to C, the global router may increase the delay due
to the interconnect between logic cells A and B if it can
reduce the delay between logic cells B and C.
• Placement and global routing tools may or may not
use the same algorithm to estimate net delay.
• If these tools are from different companies, the
algorithms are probably different.
• There is no use performing placement to minimize
predicted delay if the global router uses completely
different measurement methods.
• Companies that produce floorplanning and placement
tools make sure that the output is compatible with
different routing tools often to the extent of using
different algorithms to target different routers.
Back-annotation
• After global routing is complete it is possible to accurately
predict what the length of each interconnect in every net
will be after detailed routing (probably to within 5 %).
• The global router can give us not just an estimate of the
total net length, but the resistance and capacitance of
each path in each net.
• This RC information is used to calculate net delays.
• We can back-annotate this net delay information to the
synthesis tool for in-place optimization or to a timing
verifier to make sure there are no timing surprises.
• Differences in timing predictions at this point arise due to
the different ways in which the placement algorithms
estimate the paths and the way the global router actually
builds the paths.
Detailed Routing
Agenda
• Introduction
• Detailed Routing Goals & Objectives
• Measurement of Channel Density
• Algorithms
– Left-Edge Algorithm
• Constraints and Routing Graphs
• Area-Routing Algorithms
• Multilevel Routing
• Timing-Driven Detailed Routing
Introduction
• The global routing step determines the channels to be
used for each interconnect.
• Using this information the detailed router decides the
exact location and layers for each interconnect.

• Figure 1(a) shows typical metal rules.


• These rules determine the m1 routing pitch (track pitch,
track spacing, or just pitch).
• We can set the m1 pitch to one of three values:
– via-to-via ( VTV ) pitch (or spacing),
– via-to-line ( VTL or line-to-via ) pitch, or
– line-to-line ( LTL ) pitch.
FIGURE 1: The metal routing pitch.
(a) An example of l -based metal design rules for m1
and via1 (m1/m2 via).
(b) Via-to-via pitch for adjacent vias.
(c) Via-to-line (or line-to-via) pitch for nonadjacent vias.
(d) Line-to-line pitch with no vias.
• The same choices apply to the m2 and other metal
layers if they are present.
• Via-to-via spacing allows the router to place vias
adjacent to each other.
• Via-to-line spacing is hard to use in practice because it
restricts the router to nonadjacent vias.
• Using line-to-line spacing prevents the router from
placing a via at all without using jogs and is rarely used.
• Via-to-via spacing is the easiest for a router to use and
the most common.
• Using either via-to-line or via-to-via spacing means that
the routing pitch is larger than the minimum metal pitch.
Figure 2: Vias.
• In a two-level metal CMOS ASIC technology we complete
the wiring using the two different metal layers for the
horizontal and vertical directions, one layer for each
direction, this is Manhattan routing,
• Thus, for example, if terminals are on the m2 layer, then
we route the horizontal branches in a channel using m2
and the vertical trunks using m1.
• Figure 3 shows that, although we may choose a preferred
direction for each metal layer, this may lead to problems
in cases that have both horizontal and vertical channels.
• In these cases we define a preferred metal layer in the
direction of the channel spine.
• In Figure 3, the logic cell connectors are on m2, any
vertical channel has to use vias at every logic cell
location.
Figure 3
• By changing the orientation of the metal directions in
vertical channels, we can avoid this, and instead we only
need to place vias at the intersection of horizontal and
vertical channels.

FIGURE 3: An expanded view of part of a cell-based ASIC.


(a) Both channel 4 and channel 5 use m1 in the
horizontal direction and m2 in the vertical direction.
If the logic cell connectors are on m2 this requires vias to
be placed at every logic cell connector in channel 4.
(b) Channel 4 and 5 are routed with m1 along the
direction of the channel spine (the long direction of the
channel).
Now vias are required only for nets 1 and 2, at the
intersection of the channels.
• Figure 4 shows an imaginary logic cell with connectors.
• This cell has connectors at the top and bottom of the cell
(normal for cells intended for use with a two-level metal
process) and internal connectors (normal for logic cells
intended for use with a three-level metal process).
• The interconnect and connections are drawn to scale.
• Double-entry logic cells intended for two-level metal
routing have connectors at the top and bottom of the
logic cell, usually in m2.
• Logic cells intended for processes with three or more
levels of metal have connectors in the center of the cell,
again usually on m2.
• Logic cells may use both m1 and m2 internally, but the
use of m2 is usually minimized.
FIGURE 4: The different types of connections that can be
made to a cell.
• The router normally uses a simplified view of the
logic cell called a phantom.
• The phantom contains only the logic cell
information that the router needs:
– the connector locations, types, and names;
– the abutment and bounding boxes;
– enough layer information to be able to place cells
without violating design rules;
– a blockage map the locations of any metal inside the
cell that blocks routing.
• Figure 5 illustrates some terms used in the detailed
routing of a channel.
FIGURE 5: Terms used in channel routing.
(a) A channel with four horizontal tracks.
(b) An expanded view of the left-hand portion of the
channel showing (approximately to scale) how the m1
and m2 layers connect to the logic cells on either side of
the channel.
(c) The construction of a via1 (m1/m2 via).
• The channel spine in Figure is horizontal with terminals
at the top and the bottom, but a channel can also be
vertical.
• In either case terminals are spaced along the longest
edges of the channel at given, fixed locations.
Figure 5
• Terminals are usually located on a grid defined by the
routing pitch on that layer (we say terminals are either
on-grid or off-grid).
• We make connections between terminals using
interconnects that consist of one or more trunks running
parallel to the length of the channel and branches that
connect the trunk to the terminals.
• If more than one trunk is used, the trunks are connected
by doglegs.
• Connections exit the channel at pseudo terminals.
• The trunk and branch connections run in tracks (equi-
spaced, like railway tracks).
• If the trunk connections use m1, the horizontal track
spacing (usually just called the track spacing for channel
routing) is equal to the m1 routing pitch.
Detailed Routing Goals & Objectives
• The goal of detailed routing is to complete all the
connections between logic cells.
• The most common objective is to minimize one
or more of the following:
– The total interconnect length and area.
– The number of layer changes that the connections
have to make.
– The delay of critical paths.
• Minimizing the number of layer changes
corresponds to minimizing the number of vias
that add parasitic resistance and capacitance to
a connection.
• In some cases the detailed router may not be
able to complete the routing in the area
provided.
• In the case of a cell-based ASIC or sea-of-gates
array, it is possible to increase the channel size
and try the routing steps again.
• A channeled gate array or FPGA has fixed
routing resources and in these cases we must
start all over again with floorplanning and
placement, or use a larger chip.
Measurement of Channel Density
• We can describe a channel-routing problem by
specifying two lists of nets:
– One for the top edge of the channel,
– One for the bottom edge.
• The position of the net number in the list gives the
column position.
• The net number zero represents a vacant or unused
terminal.
• Figure 6 shows a channel with the numbered terminals
to be connected along the top and the bottom of the
channel.
• The number of nets that cross a line drawn vertically
anywhere in a channel is called local density.
• We call the maximum local density of the channel the
global density or sometimes just channel density.
• Figure 6 has a channel density of 4.
• Channel density is an important measure in routing it
tells a router the absolute fewest number of horizontal
interconnects that it needs at the point where the local
density is highest.
• In two-level routing (all the horizontal interconnects run
on one routing layer) the channel density determines the
minimum height of the channel.
• The channel capacity is the maximum number of
interconnects that a channel can hold.
• If the channel density is greater than the channel
capacity, that channel definitely cannot be routed.
FIGURE 6: The definitions of local channel density and
global channel density.
Lines represent the m1 and m2 interconnect in the
channel to simplify the drawing.
Algorithms
• We start discussion of routing methods by simplifying the
general channel-routing problem.
• The restricted channel-routing problem limits each net in
a channel to use only one horizontal segment.
• The channel router uses only one trunk for each net.
• This restriction has the effect of minimizing the number
of connections between the routing layers.
• This is equivalent to minimizing the number of vias used
by the channel router in a two-layer metal technology.
• Minimizing the number of vias is an important objective
in routing a channel, but it is not always practical.
• Sometimes constraints will force a channel router to use
jogs or other methods to complete the routing.
Left-Edge Algorithm
• The LEA applies to two-layer channel routing, using one
layer for the trunks and the other layer for the branches.
• The LEA proceeds as follows:
1. Sort the nets according to the leftmost edges of the
nets horizontal segment.
2. Assign the first net on the list to the first free track.
3. Assign the next net on the list, which will fit, to the track.
4. Repeat this process from step 3 until no more nets will
fit in the current track.
5. Repeat steps 2 - 4 until all nets have been assigned to
tracks.
6. Connect the net segments to the top and bottom of the
channel.
FIGURE 7: Example illustrating the Left-edge algorithm.
(a) Sorted list of segments.
(b) Assignment to tracks.
(c) Completed channel route.
• The algorithm works as long as none of the branches
touch which may occur if there are terminals in the same
column belonging to different nets.
• In this situation we have to make sure that the trunk that
connects to the top of the channel is placed above the
lower trunk.
• Otherwise two branches will overlap and short the nets
together.
Constraints and Routing Graphs
• Two terminals that are in the same column in a channel
create a vertical constraint.
• The terminal at the top of the column imposes a vertical
constraint on the lower terminal.
• Draw a graph showing the vertical constraints imposed
by terminals.
• The nodes in a vertical-constraint graph represent
terminals.
• A vertical constraint between two terminals is shown by
an edge of the graph connecting the two terminals.
• A graph that contains information in the direction of an
edge is a directed graph.
The arrow on the graph edge shows the direction of the
constraint pointing to the lower terminal, which is
constrained.
FIGURE 8: Routing graphs.
(a) Channel with a global density of 4.
(b) The vertical constraint graph.
– If two nets occupy the same column, the net at the top of the
channel imposes a vertical constraint on the net at the bottom.
– For example, net 2 imposes a vertical constraint on net 4.
– Thus the interconnect for net 4 must use a track above net 2.
(c) Horizontal-constraint graph.
– If the segments of two nets overlap, they are connected in the
horizontal-constraint graph.
– This graph determines the global channel density.
FIGURE 8: Routing graphs.
• We can also define a horizontal constraint and a
corresponding horizontal-constraint graph.
• If the trunk for net 1 overlaps the trunk of net 2, then we
say there is a horizontal constraint between net 1 and
net 2.
• Unlike a vertical constraint, a horizontal constraint has
no direction.
• Figure (c) shows an example of a horizontal constraint
graph and shows a group of 4 terminals (numbered 3, 5,
6, and 7) that must all overlap.
• Since this is the largest such group, the global channel
density is 4.
• If there are no vertical constraints at all in a channel, we
can guarantee that the LEA will find the minimum
number of routing tracks.
• There is also an arrangement of vertical constraints that
none of the algorithms based on the LEA can cope with.
• In Figure 9 (a) net 1 is above net 2 in the first column of
the channel.
• Thus net 1 imposes a vertical constraint on net 2.
• Net 2 is above net 1 in the last column of the channel.
• Then net 2 also imposes a vertical constraint on net 1.
• It is impossible to route this arrangement using two
routing layers with the restriction of using only one trunk
for each net.
• If we construct the vertical-constraint graph for this
situation, shown in Figure (b), there is a loop or cycle
between nets 1 and 2.
Figure 9
• If there is any such vertical-constraint cycle (or cyclic
constraint) between two or more nets, the LEA will fail.
• A dogleg router removes the restriction that each net
can use only one track or trunk.
• Figure (c) shows how adding a dogleg permits a channel
with a cyclic constraint to be routed.
• The addition of a dogleg, an extra trunk, in the wiring of a
net can resolve cyclic vertical constraints.
Area-Routing Algorithms
• There are many algorithms used for the detailed routing
of general-shaped areas.
• The first group we shall cover and the earliest to be used
historically are the grid-expansion or maze-running
algorithms.
• A second group of methods, which are more efficient,
are the line-search algorithms.
• Figure 10 illustrates the Lee maze-running algorithm.
• The goal is to find a path from X to Y i.e., from the start
to the finish avoiding any obstacles.
• The algorithm is often called wave propagation because
it sends out waves, which spread out like those created
by dropping a stone into a pond.
FIGURE 10: The Lee maze-running algorithm.
• The algorithm finds a path from source (X) to target (Y)
by emitting a wave from both the source and the target
at the same time.
• Successive outward moves are marked in each bin.
• Once the target is reached, the path is found by
backtracking (if there is a choice of bins with equal
labeled values, we choose the bin that avoids changing
direction).
• The original form of the Lee algorithm uses a single
wave.
• Algorithms that use lines rather than waves to search for
connections are more efficient than algorithms based on
the Lee algorithm.
Figure 11 illustrates the Hightower algorithm a line-search
algorithm (or line-probe algorithm):
1. Extend lines from both the source and target toward
each other.
2. When an extended line, known as an escape line,
meets an obstacle, choose a point on the escape line
from which to project another escape line at right angles
to the old one.
This point is the escape point.
3. Place an escape point on the line so that the next
escape line just misses the edge of the obstacle.
Escape lines emanating from the source and target
intersect to form the path.
FIGURE 11: Hightower area-routing algorithm.
FIGURE 11: Hightower area-routing algorithm.
(a) Escape lines are constructed from source (X) and
target (Y) toward each other until they hit obstacles.
(b) An escape point is found on the escape line so that
the next escape line perpendicular to the original misses
the next obstacle.
The path is complete when escape lines from source
and target meet.

• The Hightower algorithm is faster and requires less


memory than methods based on the Lee algorithm.
Multilevel Routing
• Using two-layer routing, if the logic cells do not contain
any m2, it is possible to complete some routing in m2
using over-the-cell (OTC) routing.
• Sometimes poly is used for short connections in the
channel in a two-level metal technology; this is known as
2.5-layer routing.
• Using a third level of metal in three-layer routing, there is
a choice of approaches.
• Reserved-layer routing restricts all the interconnect on
each layer to flow in one direction in a given routing area
(for example, in a channel, either parallel or
perpendicular to the channel spine).
• Unreserved-layer routing moves in both horizontal and
vertical directions on a given layer.
• Most routers use reserved routing.
• Reserved three-level metal routing offers another choice:
Either use m1 and m3 for horizontal routing (parallel to
the channel spine), with m2 for vertical routing (HVH
routing) or use VHV routing.
• Since the logic cell interconnect usually blocks most of
the area on the m1 layer, HVH routing is normally used.
• It is also important to consider the pitch of the layers
when routing in the same direction on two different
layers.
• Using HVH routing it is preferable for the m3 pitch to be
a simple multiple of the m1 pitch.
• Figure 12 shows an example of three-layer channel
routing.
• In this diagram the m2 and m3 routing pitch is set to
twice the m1 routing pitch.
• Routing density can be increased further if all the routing
pitches can be made equal a difficult process challenge.
• With three or more levels of metal routing it is possible to
reduce the channel height in a row-based ASIC to zero.
• All of the interconnect is then completed over the cell.
• If all of the channels are eliminated, the core area (logic
cells plus routing) is determined solely by the logic-cell
area.
FIGURE 12: Three-level channel routing.
Timing-Driven Detailed Routing
• In detailed routing the global router has already set the
path the interconnect will follow.
• At this point little can be done to improve timing except
to reduce the number of vias, alter the interconnect width
to optimize delay, and minimize overlap capacitance.
• The gains here are relatively small, but for very long
branching nets even small gains may be important.
• For high-frequency clock nets it may be important to
shape and chamfer (round) the interconnect to match
impedances at branches and control reflections at
corners.
Agenda
• Final Routing Steps
• Special Routing
• Clock Routing
• Power Routing
• Circuit Extraction and DRC
• SPF, RSPF, and DSPF
• Design Checks
Final Routing Steps
• If the algorithms to estimate congestion in the
floorplanning tool accurately perfectly reflected the
algorithms used by the global router and detailed router,
routing completion should be guaranteed.
• Often, the detailed router will not be able to completely
route all the nets.
• These problematical nets are known as unroutes.
• Routers handle this situation in one of two ways:
– The first method leaves the problematical nets
unconnected.
– The second method completes all interconnects anyway
but with some design-rule violations.
• If there are many unroutes the designer needs to
discover the reason and return to the floorplanner and
change channel sizes (for a cell-based ASIC) or increase
the base-array size (for a gate array).
• Returning to the global router and changing bin sizes or
adjusting the algorithms may also help.
• In drastic cases it may be necessary to change the
floorplan.
• If just a handful of difficult nets remain to be routed,
some tools allow the designer to perform hand edits
using a rip-up and reroute router (sometimes this is done
automatically by the detailed router as a last phase in the
routing procedure anyway).
• This capability also permits engineering change
orders (ECO) corresponding to the little yellow
wires on a PCB.
• One of the last steps in routing is via removal the
detailed router looks to see if it can eliminate any
vias by changing layers or making other
modifications to the completed routing.
• Vias can contribute a significant amount to the
interconnect resistance.
• Routing compaction can then be performed as
the final step.
Special Routing
• The routing of nets that require special attention,
clock and power nets for example, is normally
done before detailed routing of signal nets.
• The architecture and structure of these nets is
performed as part of floorplanning, but the sizing
and topology of these nets is finalized as part of
the routing step.
• Clock Routing and Power Routing are
considered next.
Clock Routing
• Gate arrays normally use a clock spine (a regular grid),
eliminating the need for special routing.
• The clock distribution grid is designed at the same time
as the gate-array base to ensure a minimum clock skew
and minimum clock latency given power dissipation and
clock buffer area limitations.
• Cell-based ASICs may use either a clock spine, a clock
tree, or a hybrid approach.
• Figure 1 shows how a clock router may minimize clock
skew in a clock spine by making the path lengths, and
thus net delays, to every leaf node equal using jogs in
the interconnect paths if necessary.
• More sophisticated clock routers perform clock-tree
synthesis and clock-buffer insertion.
FIGURE 1: Clock routing.
(a) A clock network for the cell-based ASIC.
(b) Equalizing the interconnect segments between CLK
and all destinations (by including jogs if necessary)
minimizes clock skew.
• More sophisticated clock routers perform clock-tree
synthesis (automatically choosing the depth and
structure of the clock tree) and clock-buffer insertion
(equalizing the delay to the leaf nodes by balancing
interconnect delays and buffer delays).
• The clock tree may contain multiply-driven nodes (more
than one active element driving a net).
• The net delay models that we have used break down in
this case and we may have to extract the clock network
and perform circuit simulation, followed by back-
annotation of the clock delays to the netlist and the bus
currents to the clock router.
• The sizes of the clock buses depend on the current they
must carry.
• Clock skew is induced by hot-electron wearout.
• Another factor contributing to unpredictable clock skew is
changes in clock-buffer delays with variations in power-
supply voltage due to data-dependent activity.
• This activity-induced clock skew can easily be larger
than the skew achievable using a clock router.
• For example, there is little point in using software
capable of reducing clock skew to less than 100 ps if,
due to fluctuations in power-supply voltage when part of
the chip becomes active, the clock-network delays
change by 200 ps.
• The power buses supplying the buffers driving the clock
spine carry direct current (unidirectional current or DC),
but the clock spine itself carries alternating current
(bidirectional current or AC).
• Difference between electromigration failure rates due to
AC and DC leads to different rules for sizing clock buses.
Power Routing
• Each of the power buses has to be sized according to
the current it will carry.
• Too much current in a power bus can lead to a failure
through a mechanism known as electromigration.
• The required power-bus widths can be estimated
automatically from library information, from a separate
power simulation tool, or by entering the power-bus
widths to the routing software by hand.
• Many routers use a default power-bus width so that it is
quite easy to complete routing of an ASIC without even
knowing about this problem.
• Some CMOS processes have maximum metal-width
rules (or fat-metal rules).
• This is because stress (especially at the corners of the
die, which occurs during die attach mounting the die on
the chip carrier) can cause large metal areas to lift.
• A solution to this problem is to place slots in the wide
metal lines.
• These rules are dependent on the ASIC vendors level of
experience.
• To determine the power-bus widths we need to
determine the bus currents.
• The largest problem is emulating the systems operating
conditions.
• Input vectors to test the system are not necessarily
representative of actual system operation.
• Gate arrays normally use a regular power grid as part of
the gate-array base.
• The gate-array logic cells contain two fixed-width power
buses inside the cell, running horizontally on m1.
• The horizontal m1 power buses are then strapped in a
vertical direction by m2 buses, which run vertically
across the chip.
• The resistance of the power grid is extracted and
simulated with SPICE during the base-array design to
model the effects of IR drops under worst-case
conditions.
• Standard cells are constructed in a similar fashion to
gate-array cells, with power buses running horizontally in
m1 at the top and bottom of each cell.
• A row of standard cells uses end-cap cells that connect
to the VDD and VSS power buses placed by the power
router.
• Power routing of cell-based ASICs may include the
option to include vertical m2 straps at a specified
intervals.
• Alternatively the number of standard cells that can be
placed in a row may be limited during placement.
• The power router forms an interdigitated comb structure,
minimizing the number of times a VDD or VSS power
bus needs to change layers.
• This is achieved by routing with a routing bias on
preferred layers.
• For example, VDD may be routed with a left-and-down
bias on m1, with VSS routed using right-and-up bias on
m2.
• Using three or more layers of metal for routing, it is
possible to eliminate some of the channels completely.
• In these cases we complete all the routing in m2 and m3
on top of the logic cells using connectors placed in the
center of the cells on m1.
• If we can eliminate the channels between cell rows, we
can flip rows about a horizontal axis and abut adjacent
rows together.
• If the power buses are at the top (VDD) and bottom
(VSS) of the cells in m1 we can abut or overlap the
power buses (joining VDD to VDD and VSS to VSS in
alternate rows).
• Power distribution schemes are also a function of
process and packaging technology.
Circuit Extraction and DRC
• After detailed routing is complete, the exact length and
position of each interconnect for every net is known.
• Now the parasitic capacitance and resistance associated
with each interconnect, via, and contact can be
calculated.
• This data is generated by a circuit-extraction tool in one
of the formats described next.
• It is important to extract the parasitic values that will be
on the silicon wafer.
• The mask data or CIF widths and dimensions that are
drawn in the logic cells are not necessarily the same as
the final silicon dimensions.
SPF, RSPF, and DSPF
• The standard parasitic format (SPF), describes
interconnect delay and loading due to parasitic
resistance and capacitance.
• There are three different forms of SPF:
• Two of them (regular SPF and reduced SPF) contain the
same information, but in different formats, and model the
behavior of interconnect;
• Third form of SPF (detailed SPF) describes the actual
parasitic resistance and capacitance components of a
net.
• Figure next shows the different types of simplified
models that regular and reduced SPF support.
FIGURE : The regular and reduced standard parasitic
format (SPF) models for interconnect.
(a) An example of an interconnect network with fanout.
The driving-point admittance of the interconnect network
is Y(s).
(b) The SPF model of the interconnect.
(c) The lumped-capacitance interconnect model.
(d) The lumped-RC interconnect model.
(e) The PI segment interconnect model.
• The load at the output of gate A is represented by one of
three models: lumped-C, lumped-RC, or PI segment.
• The pin-to-pin delays are modeled by RC delays.
• The key features of regular and reduced SPF are
as follows:
– The loading effect of a net as seen by the driving gate
is represented by choosing one of three different RC
networks: lumped-C, lumped-RC, or PI segment.
– The pin-to-pin delays of each path in the net are
modeled by a simple RC delay (one for each path).
This can be the Elmore constant for each path, but it
need not be.
• The example regular SPF file for just one net that
uses the PI segment model shown in Figure (e).
• The reduced SPF (RSPF) contains the same
information as regular SPF, but uses the SPICE
format.
• The detailed SPF (DSPF) shows the resistance and
capacitance of each segment in a net, again in a SPICE
format.
• There are no models or assumptions on calculating the
net delays in this format.
• Here is an example DSPF file that describes the
interconnect shown in Figure (a).
• Figure (b) illustrates the meanings of the DSPF terms:
– Instance PinName,
– InstanceName,
– PinName,
– NetName,
– SubNodeName.

• Since the DSPF represents every interconnect segment,


DSPF files can be very large in size (hundreds of MB).
FIGURE: The detailed standard parasitic format (DSPF) for
interconnect representation.
(a) An example network with two m2 paths connected to a
logic cell, INV1. The grid shows the coordinates.
(b) The equivalent DSPF circuit corresponding to the DSPF
file in the text.
Design Checks
• ASIC designers perform two major checks before
fabrication.
– The first check is a design-rule check (DRC).
– The other check is a layout versus schematic (LVS).

• DRC ensure that nothing has gone wrong in the process


of assembling the logic cells and routing.
• The DRC may be performed at two levels.
• Since the detailed router normally works with logic-cell
phantoms, the first level of DRC is a phantom-level DRC,
which checks for shorts, spacing violations, or other
design-rule problems between logic cells.
• This is principally a check of the detailed router.
• If we have access to the real library-cell layouts, we can
instantiate the phantom cells and perform a second-level
DRC at the transistor level.
• This is principally a check of the correctness of the
library cells.
• The Cadence Dracula software is one de facto standard
in this area, and you will often hear reference to a
Dracula deck that consists of the Dracula code
describing an ASIC vendors design rules.
• Sometimes ASIC vendors will give their Dracula decks to
customers so that the customers can perform the DRCs
themselves.
• LVS check to ensure that what is about to be committed
to silicon is what is really wanted.
• An electrical schematic is extracted from the
physical layout and compared to the netlist.
• This closes a loop between the logical and
physical design processes and ensures that
both are the same.
 The first problem with an LVS check is that the
transistor-level netlist for a large ASIC forms an
enormous graph.
 LVS software essentially has to match this graph
against a reference graph that describes the design.
 Ensuring that every node corresponds exactly to a
corresponding element in the schematic (or HDL
code) is a very difficult task.
 The first step is normally to match certain key nodes.
 The second problem with an LVS check is
creating a true reference.
 The starting point may be HDL code or a schematic.
 Logic synthesis, test insertion, clock-tree synthesis,
logical-to-physical pad mapping, and several other
design steps each modify the netlist.
 The reference netlist may not be what we wish to
fabricate.
 In this case designers increasingly resort to formal
verification that extracts a Boolean description of the
function of the layout and compare that to a known
good HDL description.
Summary
• The most important points in this chapter are:
– Routing is divided into global and detailed routing.
– Routing algorithms should match the placement
algorithms.
– Routing is not complete if there are unroutes.
– Clock and power nets are handled as special cases.
– Clock-net widths and power-bus widths must usually
be set by hand.
– DRC and LVS checks are needed before a design is
complete.
Clock
Agenda
• Considerations for clock signals
• Definitions
• Clock Design Considerations
• Different clock design techniques
• Clock trees
• Clock Tree Synthesis
• Effect of Clock Tree Synthesis
• Clock Tree Optimization
Considerations for clock signals
• The clock signal is used to define a time reference for
the movement of data within that system.
• Clock signals are often regarded as simple control
signals; however, these signals have some very special
characteristics and attributes.
• The clock distribution network distributes the clock
signal(s) from a common point to all the elements that
need it.
• The clock distribution network often takes a significant
fraction of the power consumed by a chip.
• The proper design of the clock distribution network
ensures that these critical timing requirements are
satisfied and that no race conditions exist.
• Since this function is vital to the operation of a
synchronous system, much attention has been given to
the characteristics of these clock signals and the
electrical networks used in their distribution.
• Clock signals are typically loaded with the greatest
fanout, travel over the greatest distances, and operate at
the highest speeds of any signal, either control or data,
within the entire synchronous system.
• The control of any differences and uncertainty in the
arrival times of the clock signals can severely limit the
maximum performance of the entire system and create
race conditions in which an incorrect data signal may
latch within a register.
• Significant power can be wasted in transitions within
blocks, even when their output is not needed.
• These observations have lead to a power saving
technique called clock gating, which involves adding
logic gates to the clock distribution tree, so portions of
the tree can be turned off when not needed.
• Clock designs have significant impacts on both timing
convergence and timing yield.
 A carefully designed clock distribution network can
reduce design-inherited clock skews, the discrepancies
between designer intended clock skews and achieved
clock skews under perfect process conditions.
– This can improve circuit performance and timing convergence.
 A clock distribution network can also be optimized to
reduce process-induced clock skews, such that the
circuit would require less timing margin to tolerate
process related timing variations.
– This will also improve timing convergence and timing yield.
Definitions
• Balanced Clock Tree: The delays from the root of the clock
tree to leaves are almost same.
• Clock Distribution: The task of clock distribution is to
distribute the clock signal across the chip in order to
minimize the clock skew.
• Clock Buffer: To keep equal rise and fall delays of the clock
signal.
• Global Skew: Difference in clock timing paths b/w any
combination of two FFs in the design within the same clock
domain.
• Local Skew: Balances the skew only b/w related FF pairs.
FFs are related only when one FF launches date which is
captured by the other.
• Clock Scheduling: process of assigning the clock arrival
times to sequential elements, can greatly improve the timing
yield of the manufactured chips by taking into account
process-induced path delay and clock skew uncertainties.
Clock Design Considerations
• Clock signal is global in nature, so clock nets are
usually very long.
– Significant interconnect capacitance and resistance
• So what are the techniques?
– Routing
• Clock tree versus clock clock mesh (grid)
• Balance skew and total wire length
– Buffer insertion
• Clock buffers to reduce clock skew, delay, and distortion in
waveform.
– Wire sizing
• To further tune the clock tree/mesh
• Clock tree is where wire sizing is really used in practice
Different clock design techniques
• Different clock design techniques are:
1) clock tree optimization,
2) clock scheduling,
3) clock tree topology optimization,
4) post-silicon clock tuning.
• The integration of these techniques offers the maximum
benefit on timing convergence and timing yield
improvements.
• A zero-skew clock tree optimization algorithm for
clock delay and power optimization is proposed.
– The optimized clock trees are zero-skew, or free from design-
inherited clock skews, and their process-induced clock skews
are reduced by clock delay minimization.
• A statistical timing-driven clock scheduling algorithm
then utilizes the statistical timing information for timing
yield improvement.
– False-path-aware gate-sizing methods are also investigated to
preserve more timing margin for clock scheduling.
• A partition-based clock tree topology optimization
algorithm that considers the circuit connectivity and
timing information is proposed.
– The clock trees based on the new topologies require less timing
margin for process-induced clock skews, which makes the timing
convergence easier to achieve.
• Timing yield can be further improved if clock arrival timing
assignment can be performed for each manufactured
chip separately, this can be achieved by using a post-
silicon-tunable clock tree.
• Figure next demonstrates the integration of proposed
clock design techniques.
Clock trees
• A path from the clock source to clock
sinks
Clock Source

FF FF FF FF FF FF FF FF FF FF
Clock trees
• A path from the clock source to clock sinks
Clock Source

FF FF FF FF FF FF FF FF FF FF
Clock Tree Synthesis
• The basics of CTS is to develop the interconnect that
connects the system clock into all the cells in the chip
that uses the clock.
• For CTS, major concerns are:
– Minimizing the clock skew,
– Optimizing clock buffers to meet skew specifications,
– Minimize clock-tree power dissipation.

• The primary job of CTS tools is to vary routing paths,


placement of the clocked cells and clock buffers to meet
maximum skew specifications.
• For a balanced tree without buffers (before CTS), the
clock line's capacitance increases exponentially as you
move from clocked element to the primary clock input.
• The goal of CTS is to minimize skew and insertion delay.
• After CTS hold slack should improve.
• Clock tree begins at .sdc defined clock source and ends
at stop pins of flop.
• If clock is divided then separate skew analysis is
necessary.
• Global skew achieves zero skew between two
synchronous pins without considering logic relationship.
• Local skew achieves zero skew between two
synchronous pins while considering logic relationship.
• If clock is skewed intentionally to improve setup slack
then it is known as useful skew.
• Adding buffers at the branching points of the tree
significantly lowers clock-interconnect capacitance,
because you can reduce clock-line width toward the root.
• When designing a clock tree, you need to consider
performance specifications that are timing-related.
• Clock-tree timing specifications include clock latency,
skew, and jitter.
• Non-timing specifications include power dissipation,
signal integrity…..
• Many clock-design issues affect multiple performance
parameters.
• For example, adding clock buffers to balance clock lines
and decrease skew may result in additional clock-tree
power dissipation.
• The factors that contribute to clock skew include loading
mismatch at he clocked elements, mismatch in RC delay.
• Clock skew adds to cycle times, reducing the clock rate at
which a chip can operate.
• Typically, skew should be 10% or less of a chip's clock
cycle, meaning that for a 100-MHz clock, skew must be 1
nsec or less.
• High-performance designs may require skew to be 5% of
the clock cycle.
• Effect of CTS:
– Lots of clock buffers are added.
– Congestion may increase.
– Non-clock tree cells may have been moved to non-ideal locations.
– Can introduce new timing violations.
Clock Tree Optimization
• In Clock Tree Optimization (CTO), clock can be
shielded so that noise is not coupled to other signals.
• But shielding increases area by 12 to 15%.
• Since the clock signal is global in nature the same metal
layer used for power routing is used for clock also.
• CTO is achieved by buffer sizing, gate sizing, buffer
relocation, level adjustment etc.
• We try to improve setup slack in pre-placement, in
placement and post placement optimization before CTS
stages while neglecting hold slack.
• In post placement optimization after CTS hold slack is
improved.
• As a result of CTS lot of buffers are added.
• Generally for 100k gates around 650 buffers are added.
Deep-Submicron
Effects
Agenda
• Introduction
• What Are Jitter, Noise, and Signal
Integrity?
• Early signals
• Basic Types of Tools
• Signal Integrity and Noise issue
Introduction
• Several independent sources forecast that in deep-
submicron (DSM) process geometries, 80 percent or
more of the delays of critical paths will be directly linked
to interconnect.
• This forecast is supported by the significant timing-
closure problems that are arising in current high-
performance IC designs.
• Based on this, both industry and academia sense the
need for a significant over-haul of current synthesis and
physical-design methodologies.
• In the emerging design scenario, traditional design flows
will no longer be viable for any size of module or block of
gates.
• Deep-submicron effects-particularly those relating to
interconnects-have thus been billed as potential
roadblocks to the continuation of Moore's law.
• Some effects, like the rising resistance-capacitance (RC)
delay of on-chip wiring, noise considerations, reliability
concerns, and increased power dissipation, manifest
themselves in both the devices (transistors) and the
interconnect, though not in ways previously anticipated.
• We should consider the effects of both devices and
interconnect.
• Their analysis shows that interconnect delay actually
decreases for DSM processes in a modular design
approach.
• The physical explanations of these DSM effects shed
insight into this and other potential impacts on future
high-performance ASIC designs.
What Are Jitter, Noise,
and Signal Integrity?
• When a signal is transmitted and received, a physical
process called noise is always associated with it.
• Noise is basically any undesired signals added to the
ideal signal.
• The deviation of a noisy signal from its ideal can be
viewed from two aspects:
– timing deviation,
– amplitude deviation.
• Deviation of time : is defined as the timing jitter (or
simply jitter).
• Timing jitter affects system performance only when an
edge transition exists.
• Amplitude noise is a constant function and can affect
system performance all the time.
• Signal integrity generally is defined as any deviation
from ideal waveform.
• As such, signal integrity contains both amplitude noise
and timing jitter in a broad sense.
• However, certain signal integrity signatures such as
overshoot, undershoot, and ringing may not be well
covered by either noise or jitter alone.
• Knowledge of signal integrity issues is no longer an
option for today’s high-speed systems designers.
• It is essential to the reliable operation of every high-
speed digital product.
• A solid foundation in concepts such as crosstalk, ground
bounce and power supply noise can help designers
identify potential signal integrity issues long before a
product goes to manufacturing and in the process reduce
NRE costs and preserve time-to-market schedules.
• By helping reduce time spent in prototype testing and
redesign, signal integrity analysis can help design teams
become more productive.
• Moreover, identifying signal integrity problems early in the
design, even when a board appears to function correctly,
pays multiple dividends down the road.
• By identifying signal integrity problems in a design that
successfully passes simulation checks, designers can
avoid additional prototype respins and in the process
reduce cost and shorten time-to-market.
Early signals
• How do you know you are facing signal integrity
problems in your designs?
• One good indication is lower product yields.
• Sudden loss of yield on a legacy design, intermittent
failures during testing or functional prototypes failing
electromagnetic compatibility (EMC) testing are reliable
indicators that signal integrity has become a serious
threat to product release schedules.
• Knowing the basic principles of signal integrity can go a
long way toward avoiding costly failures, improving
design efficiency and meeting time-to-market goals.
• Clock rates will continue to rise, component rise times
will be even faster and as a result the transmission line
behavior of PCB traces will have more impact on signal
quality.
• The effects of transmission line behavior on delay will
assume greater importance as designers struggle to
match circuit delays to ensure functionality in these high
speed systems.
• Crosstalk also poses a growing threat to system
performance as clock rates rise and systems begin to
experience unexplained interrupts, transient state
transitions, data dependent logic errors and sudden
system crashes.
• How can designers learn to prevent these potential
problems and avoid increased time-to-market, poor
product yields, decreased reliability and higher costs?
• The key is to build a solid foundation in the basic
principles of high-speed design and then organize the
design process and select the proper tools to deal with
signal integrity issues.
• This foundation should begin with the fundamentals of
transmission line behavior.
• Signal integrity problems often manifest themselves first
as timing problems.
• Ringing, overshoot and undershoot caused by
transmission line effects can threaten noise margins and
monotonicity.
• Designers should become familiar with signal
propagation behavior, including ringing and delay, the
transmission line characteristics of routed traces and the
effects of layer assignment.
• Accordingly, designers must become familiar with the
basics in controlled impedance and reflections.
• Typically, this should include the factors that affect
impedance and ways to control it, and how to calculate
reflections and select terminations to reduce ringing to
acceptable levels.
• Crosstalk or coupling is also becoming a major piece of
any signal integrity knowledge foundation.
• As edge rates move below 1ns, crosstalk can become a
significant problem.
• Sub-nanosecond edge rates bring high-frequency
harmonics that can easily couple into an adjacent trace,
resulting in crosstalk signals that can cause a receiver to
temporarily toggle.
• Similarly, other issues such as power/ground bounce
must be addressed.
• A basic signal integrity background must also include an
understanding of EMI, as well as how to develop design
rules to control its impact on a design.
• No overview of signal integrity would be complete
without a thorough review of the wide variety of signal
integrity tools available to designers.
• As clock and slew rates run ever higher and designers
attempt to squeeze every last bit of performance out of
their designs, timing margins often shrink dramatically.
• Simulation is required, since the slightest changes in
logic timing due to transmission line effects can have a
devastating impact on system performance.
Basic Types of Tools
• To identify these issues before building a prototype,
many design teams turn to any of a number of simulation
tools available on the market.
• Designers will find three basic types of tools on the
market.
 General-purpose mathematical calculation tools, such as
MathCad, allow the engineer to combine pictures,
equations and explanatory text in a single document and
using general purpose math libraries simulate most
digital transmission structures.
– These tools generally are highly flexible and can be used to
simulate a wide variety of scenarios.
– Their primary drawback is that the user must usually customize
the tool to model a particular application.
– That can be a time-consuming task and may make it difficult for
the designer to leverage existing data such as a trace layout
database.

 Another option is analog simulation tools such as Spice.


– These tools prove especially handy for simulating simple
transmission lines and for modeling the behavior of relatively
simple PCBs featuring a small number of traces.
– The GUI front-ends typically allow the user to draw a pictorial
representation of a circuit and input it to the simulator.
– But they typically do not support the importation of an existing
trace layout database, which is critical for post-layout analysis.

 The most feature-rich option for signal integrity


simulation is specialized tools designed specifically for
signal integrity analysis.
– An example is the HyperLynx tool suite from Innoveda Inc.
 Signal Integrity and Noise are Aggravated by
many issues:
– Faster chips -> higher edge rates
– Higher edge rates -> more coupled and supply noise
– Lower Vdd -> lower noise immunity
– Wire scaling -> more R, more capacitive coupling
(Ccoupled/Ctotal)
– Constant power @ lower Vdd -> more I -> more I-R
loss
– Higher I/O Vdd supplies -> increased coupled noise
– Increasing substrate currents -> more substrate noise
Summary
• Since next-generation systems will not run at slower
speeds.
• With standard ICs now featuring edge rates as fast as
300ps and system clocks commonly running at 500MHz
and above, it is a safe bet that issues such as ringing,
crosstalk, ground bounce and EMI are here to stay.
• To design for clean signal integrity and ensure the
development of reliable systems, a firm foundation in the
basics of signal integrity analysis will become an
increasingly crucial component in every designer’s skill
set.
Testing and
Verification
Testing Verification
Hardware is involved No hardware

Physical characteristics are Physical characteristics are


tested not tested
Tester is required EDA tools are required

Each chip fabricated has to Design is verified before


be tested fabrication
Expensive Less expensive

Manufacturing faults are Only functional faults


targeted
Verification
• ASIC technology requires strict verification
before manufacturing.
• In large complex system level designs many
more unexpected situations can occur.
• The designers can mix test-bench verification
strategies with in-circuit testing, thereby offering
faster verification without the cost of accelerated
simulators.
• Surveys of design engineers have found that the
area of test vector generation and vector-based
verification is the least favorite part of system
design process.
Verification can be:
– Functional verification
• Checks functional correctness of ASIC design.
• Need to define and develop test environment per
specification.
• Functional verification to be accomplished using either:
– HDL (Like Verilog ) or HVL (Like “e”, Vera, SystemC)

– Formal verification
• A mathematical method to check design’s functional
property.
• Used to compare and verify design at different levels of
abstraction.
• Most common type is Equivalence Checking.
• Tools:
– FormalPro (From Mentor Graphics)
– ConFormal (From Verplex)
• Importance of Verification:
– Need for Verification at different stages of ASIC
development.
– Helps in building confidence in design and checks
functionality.

• The verification environment contains


verification components and other verification
models, checkers, and monitors that can be
written using any Hardware Verification
Languages (HVL) to speed up the
development time and create lots of reusable
components.
• But the performance of the whole simulation
environment is limited by simulator speed.
Fault Simulation in an ASIC
Design Flow
Fault simulation is used for the following reasons:
• Test-generation software is expensive, and many
designers still create test programs manually and
then grade the test vectors using fault simulation.
• Automatic test programs are not yet at the stage
where fault simulation can be completely omitted in
an ASIC design flow.
– We need fault simulation to add some vectors to test logic
not covered automatically, to check that test logic has been
inserted correctly, or to understand and correct fault
coverage problems.
• The reuse and automatic generation of large cells is
essential to decrease the complexity of large ASIC
designs.
– Megacells and embedded blocks (an embedded
microcontroller, for example) are normally provided with
canned test vectors that have already been fault simulated
and fault graded.
– The megacell has to be isolated during test to apply these
vectors and measure the response.
– Cell compilers for RAM, ROM, multipliers, and other regular
structures may also generate test vectors.
– Fault simulation is one way to check that the various
embedded blocks and their vectors have been correctly
glued together with the rest of the ASIC to produce a
complete set of test vectors and a test program.
• It is far too expensive to use a production tester to
debug a production test.
– One use of a fault simulator is to perform this function off
line.

• Production testers are very expensive.


– There is a trend away from the use of test vectors to include
more of the test function on an ASIC.
– Some internal test logic structures generate test vectors in a
random or pseudorandom fashion.
– For these structures there is no known way to generate the
fault coverage.
– For these types of test structures we will need some type of
fault simulation to measure fault coverage and estimate
defect levels.
EDA Tools in ASIC
Flow
The Seven Key Challenges of
ASIC Design
Agenda
• Introduction
• Challenges of ASIC Design
– The Process Challenge
– The Complexity Challenge
– The Performance Challenge
– The Power Challenge
– The Density Challenge
– The Quality and Reliability Challenge
– The Productivity Challenge
• Recommendations for Reliable ASIC Design
Introduction
• Use of standard parts help to reduce product
development and manufacturing costs and to improve
interoperability across vendors and technologies.
• Of course, companies still need to differentiate their
products, and one method is to integrate a customized
ASIC to support specialized functions.
• In creating these ASICs, the demanding requirements of
networking and communication applications typically
require the use of 0.18- or 0.13-micron process
technologies.
• Yet ASIC design practices that were successful on
earlier process generations are not sufficient to deal with
these complex, high-density solutions.
• Advanced tools and methodologies are needed to
address key challenges that have emerged as by-
products of complex designs and high-density scaling.
• The microelectronics industry is facing a major and
unprecedented shift in ASIC design requirements.
• Today’s advanced process generations have achieved
densities and complexities at which new effects come
into play.
• As a result, standard methodologies can no longer
deliver reliable designs on a predictable schedule.
• Advanced tools and methods have been developed to
deal with these problems, yet many are not yet widely
used in the ASIC industry.
• Altogether, there are seven key design challenges that
have arisen due to the effects of scaling.
• Each challenge is discussed in a section that follows:
Challenges of ASIC Design

• The Process Challenge


• The Complexity Challenge
• The Performance Challenge
• The Power Challenge
• The Density Challenge
• The Quality and Reliability Challenge
• The Productivity Challenge
1. The Process Challenge
 A variety of physical side effects began to emerge with
the 0.25-micron process generation, and have become
more challenging with each successive generation.
 A successful design flow must incorporate advanced
design tools to address each of these issues:
– RC Coupling (0.25 micron and smaller)
– Signal Integrity (0.18 micron and smaller)
– Transistor Leakage (0.13 micron and smaller)
– In-die Variation (0.13 micron and smaller)
– Soft Error Rate (0.13 micron and smaller)
RC Coupling:
• Interconnect resistance and capacitance parasitic can
account for up to 80 percent of the delay on long top-
level interconnect signals, resulting in the thousands of
potential timing violations.
Signal Integrity:
• Cross capacitance between signals can cause a logic
error in a tight speed path.
• Even if early manufacturing runs are fully functional,
signal integrity problems can arise with even a small
amount of manufacturing process drift in later runs.
Transistor Leakage:
• As geometries and threshold voltages are scaled down,
transistor leakage can become a recognizable
percentage of the total power budget of a chip.
• Improved design techniques can reduce overall leakage
power.
In-die Variation:
• Variations in transistor strength across each die can
make it difficult to support a common clock domain
across high-frequency chips without proper margining
and block-to-block timing.
• Logical clock domains often have to be divided into
multiple physical clock domains that are linked with well-
controlled timing interfaces.
Soft Error Rate:
• The energy stored in static memory nodes has become
small enough that a stray alpha particle can cause a
latch to flip states.
• Advanced design tools can detect potential problems
and provide alternatives.
2. The Complexity Challenge
 Today’s most complex ASIC designs can have 5M or
more gates.
 Yet for a typical design with only 800K gates, it may take
a week to run a full back-end iteration using traditional,
flat design methods.
 Even a few unplanned iterations can add several weeks
to the design cycle, making schedules unpredictable.
 New hierarchical design techniques partition a design into
smaller physical blocks, which can be run independently
and in parallel.
 With this approach, a design iteration can be performed in
about a day, schedules are faster and more predictable.
 Advanced tools, methodologies, & expertise are required
to take advantage of hierarchical design strategies.
3. The Performance Challenge
 High-speed functions used to be supported on separate
chips.
 Now they are being integrated on the ASIC to improve
overall system performance and reduce product costs.
 The high-speed functions raise a variety of new issues
when clock frequencies go beyond 300–400 MHz.
 Hold time becomes significantly more important due to
in-die variation and the decreased number of gates
between flip-flops.
 Impedance, signal cross talk, and RC coupling become
increasingly serious concerns.
 Flip-chip die attach may be required to reduce die-
package interconnect inductance.
 Decoupling at die, package, or board should be designed
to eliminate power supply oscillation, and considerable
care must be taken in routing high-frequency signals.
 Routing delays can also become problematic, and must be
included, along with gate delays, for accurate timing
analysis.
 Interconnect performance can be improved significantly by
– Pre-routing critical signals so they have preferred routes
– Inserting repeaters over long signal paths
– Making efficient use of the additional metal layers that are
available in advanced process generations

 Signal integrity analysis should also be included in early


design iterations.
 New tools and methodologies are required to integrate
these options smoothly into the design process.
4. The Power Challenge
 Power supply voltages are dropping about 30 percent
with each new process generation.
 Yet power consumption is doubling with each generation
due to increases in frequency and logic density.
 Leakage power is also increasing because of shrinking
transistor threshold voltages.
 Falling power supply voltages and rising power
consumption is causing dramatic increases in power
supply current.
 The total power supply current for high-end designs may
now exceed 22 amperes.
 With these currents, even a little resistance in the
voltage distribution system can significantly impact
performance.
 Since power issues are not addressed in a normal timing
analysis roll-up, new methodologies are needed to
accommodate the effects of leakage power and to
ensure power supply integrity.
 Each of the following options can be used, but they must
be efficiently integrated into design tools and processes:
 Process selection
 Cell library selection
 Flip Chip
 Logic Synthesis
 Clock Tree Design
 Power Grid
Process selection:
• Foundries now offer multiple options on the 0.13-micron
process generation, such as normal, low power, low
voltage, high Vt, and low Vt.
Cell library selection:
• Every library design involves a variety of unique design
trade-offs, so the choice of library can be critical.
Flip Chip:
• For designs with extreme core power requirements, lots
of I/O, or very high-performance I/O, flip-chip package
assembly can provide better signaling characteristics
and improved power distribution to the die.
Logic Synthesis:
• Most current design methodologies use standard
wireload models that can result in oversized local logic
that increases power consumption, or in undersized
drivers for longer signals.
• Advanced tools can optimize synthesis, sizing, and initial
cell placement to address these issues.
Clock Tree Design:
• Clock gating can be incorporated to turn off portions of
the clock tree for inactive logical blocks.
Power Grid:
• Power delivery can be improved by using upper layers to
form a power grid that can tie directly to cell row power.
5. The Density Challenge
 Layout assembly by abutment along with the use of
additional metal layers in advanced process generations
can significantly improve functional density in ASIC
designs.
 Other factors in optimizing density include:
– Datapath Centric Design Optimization
– Cell Library Selection
– Optimum Synthesis, Sizing, and Placement

Datapath Centric Design Optimization:


• Conventional place and route tools do not differentiate
between a datapath and its control.
• Advanced tools can be used to explore a variety of micro-
architectural approaches, and to achieve placed and
routed designs that rival hand-packed datapath techniques
Cell Library Selection:
• Cell row height is a key parameter of cell library design,
and can have a significant impact on density.
• Additional factors complicate the issue, requiring
sophisticated analysis of library options.
• A library that works well on one design may not have the
characteristics that make it work well for the next design.
Optimum Synthesis, Sizing, and Placement:
• Density can be improved by optimizing the size and
placement of cells in blocks to minimize wire resources
and delays.
• Advanced tools can improve density as well as the
performance by performing gate-level synthesis and gate
placement at the same time.
6. The Quality and Reliability Challenge
• Advanced manufacturing processes create new sources
of potential chip failure.
• One example is SER (soft error rate), logic errors caused
by alpha particles or other subatomic events in static
memory and flip-flops as well as in DRAM memory.
• In earlier process generations, similar physical effects
were negligible.
• Unless the design methodology can detect and resolve
these issues, an increasing number of parts may pass
debug tests and then fail in the field.
7. The Productivity Challenge
• Time-to-market remains a driving force in the communi-
cation industry, so ASIC development must stay off the
critical path in system development schedules.
• Over the past 20 years, the complexity of ASIC design
has grown much faster than development capabilities.
• Significant progress has been made recently, but even
latest EDA tools cannot close this design productivity gap
• To address this issue, ASIC suppliers need to:
– Develop more efficient, hierarchical flows that partition ASICs into
manageable segments that can be designed in parallel with faster
turnaround.
– Re-engineer methodologies, tools, and skill sets to address the
increasing number of deep sub-micron issues.
– Become proficient in rapidly integrating internal or commercially
available IP cores.
Recommendations for Reliable
ASIC Design
• Developers planning to integrate ASICs into their products
should look closely at the tools, methodologies, and
expertise of an ASIC vendor.
• If the ASIC will be implemented using 0.13-micron
process generation and beyond, advanced methods will
be required to ensure predictable, timely delivery of a
reliable product.
• Techniques and methods that worked previously may not
be successful on the next process.
• When assessing an ASIC vendor, each of the seven
challenges should be considered.
• The following questions offer a starting point for exploring
vendor capabilities for deep submicron design:
 How is the design partitioned to ensure development
efficiency?
 Is block partitioning different for synthesis than for place
and route?
 How is the optimal shape and size determined for each
block partition?
 How are boundary signal crossing points planned for
each block, so they support the shortest possible inter-
block routing and the best in-block cell placement for
global optimization?
 How are global repeaters placed and routed to maximize
productivity and timing convergence, while providing the
best possible interconnect performance?
 How are timing interfaces implemented between blocks,
& how is the timing rolled up to support accurate timing
analysis, keeping in mind such issues as in-die variation?
 How are top-level clock buffering and routing are
implemented, as opposed to local block buffering and
routing?
 How is top-level power planned and interconnected to
block-level power and made electrically correct?
 What kind of interconnect resource sharing can be
achieved between top-level and block-level layers to
reduce congestive areas, and to increase speed on
critical speed paths?
 What type of pre-layout is performed to optimize critical
signals?
 How are signal integrity checks and fixes performed?
 What steps are taken to ensure signal, power, and timing
integrity for high-frequency functions?
The End

S-ar putea să vă placă și