Sunteți pe pagina 1din 37

Digital Physical Design:

Hierarchical and Low power


Implementation Flows
Nasim Farahini
farahini@kth.se
Outline
Overview: Digital design flow
Flat physical implementation flow (Basic Flow)
Hierarchical Physical Implementation flow
Low power issues
Low power physical Implementation flow
Design for Manufacturing/ Design for yield
Sign off
2
Overview: Digital Design Flow
System Specification
Architectural Design
Logic Synthesis
Physical Synthesis
Physical Verification / Sign-off
Fabrication
Packaging and Testing
The Physical Design Cycle
Physical Design Cycle (Back End)
Gate level netlist
Timing constraints
(SDC File)
Power Constraints
(CPF File)

GDSII
LEF file:
Standard cell
layout Info

Lib file:
Cell timing Info
placement floorplanning
Metal Wires Clock Tree
mask after OPC
mask
Results of Front
End Design
Mask for IC
Manufacturing
Library of
Technology
Files
Physical Design Cycle
Design objectives:
Power (dynamic/static)
Timing (frequency)
Area (cost)
Yield (cost)
Challenges:
More complex systems; billions of transistors can be
placed on a single chip
Time-to-market
Power-constrained design
5
17
Source: ntel (SSCC-03)
Physical design flows
CAD tools has improved the basic implementation flow to
address the mentioned challenges:
Flat physical implementation flow (Basic Flow)
Used for small and non-power-critical designs
Hierarchical physical implementation flow
Used for complex systems
Divide and conquer method
Sub-designs can be implemented in parallel and in a team
Low power physical implementation flow
Aggressive power management techniques can be used
6
Flat Physical Implementation Flow

Standard cells: layouts of library cells including logic
elements like gates, flip-flops, and ALU functions.
The height of the cells are constant.
Physical Design Based on Standard-Cells
8
Flat Physical Implementation Flow

Floorplanning: Laying out chip
Power Planning: Connecting up power
Placement: Automated std-cell placement
CTS: Clock tree synthesis
Routing: Wiring up the chip
Layout Verification
Finishing: Metal Fill/Antenna fixing/
Via doubling/Wire Spreading
9
10
Full chip Design Overview
The location of the core,
I/O areas P/G pads and
the P/G grid
Core placement
area
Periphery
(I/O) area
Rings
Stripes
P/G
Grid
IP
ROM
RAM
Floorplanning
Define core area : (cells + utilization factor)
Place IO Ring
IO Ring is often decided by front-end designers with some
inputs from physical design and packaging engineers.
Shape and arrange hierarchical blocks
Integrate hard-IP efficiently
Predict and prevent congestion hotspots and critical timing
paths
11
PLL
RAM
RAM
RAMS out
of the way
in the corner
Large
routing
channels
SUB-BLOCK
Standard cells
area
Single
large
core
area
Pins away
from
corners
Die size and initial standard cell utilization factor trade off
Utilization refers to the percentage of core area that is taken
up by standard cells.
A typical starting utilization might be 70%
Space between the cells is used for routing and buffer insertion
Larger die => Higher cost, higher power
High utilization can make it difficult to close a design
Routing congestion,
Negative impact during optimization legalization stages.
Solutions
Run a quick trial route to check for routing congestion
Increase routing resources
Floorplanning
12
Low std-cell utilization
High std-cell utilization
Power Planning
In this step we determine
General grid structure (gating or multi-voltage?)
Number and location of power pads (per voltage)
Metal layers to be used (the top metal layers are typically used)
Width and spacing of stripes
Rings/no rings around the Hard Macros
Hierarchical block shielding
More dense power grid (trade off)
Reduce risk of power related failures
Reduce signal route tracks,
Increase number of metal layer masks
13
14
Power Grid Creation : Macro Placement
Blocks with the
highest
performance and
highest power
consumption

1- Close to border
power pads (less IR
drop)

2- Away from each other
to fed by different I/O
power pins (to prevent
from Eectromigration)
Placement
Cost components for standard-cell placement
Area
Wire length
Timing -> Timing-driven placement
Congestion -> Congestion-driven placement
Clock > Clock gating
Power -> Multi-voltage Placement
Critical paths are determined using static timing analysis (STA).
In general there is a direct trade-off between congestion and timing
Timing-driven placement tries to shorten nets whereas congestion
driven placement tries to spread cells, thus lengthing nets.
Iterative placement trials should be performed to find a balance
between the different tool options/settings.
15
Traditional Placement
General Concept of Clock tree synthesis
16
Skew
Power
Area (#buffers)
Slew rates
NetList: Unbuffered clock tree Buffered/balanced clock tree
CLK
CLK
+ Minimize total insertion delay (latency)
Clock Tree Synthesis
17
Routing Fundamentals
Goal is to realize the metal/copper connections between the pins
of standard cells and macros
Input :
placed design
fixed number of metal/copper layers
Goal:
routed design that is DRC clean and meets setup/hold timing

Consists of two phases
1. Global route: To estimate the routing congestion
2. Detail route: To assign the nets to the routing tracks
Horizontal
routing
tracks
Vertical
routing
tracks
Standard
cell pin
Global Routing
18
Horizontal routing
capacity =9 tracks
Vertical routing
capacity =9 tracks
X
X
Y
Y
Global Routing
Input:
Cell and macro placement
Routing channel capacity per layer / per direction
Goal:
Perform fast, coarse grid routing through global
routing cells (GCells) while considering the following:
Wire length
Congestion
Timing
Noise / SI
19
Often used by placement engines to predict congestion in the
form of a trial route or virtual route
Detailed Routing
Assigns each net to a specific track and lays down the
actual metal traces
Makes long, straight traces and Reduces the number of
vias
Reduce cross couple cap
Solve DRC violations
20
Hierarchical Physical
Implementation Flow

Why to create hierarchy?
22
Hierarchy provides tighter control of individual blocks because
the boundaries are well-defined.
You can eliminate data size issues and tool capacity limitations
Hierarchy reduces design times by
Reducing data size; faster runtime
Using parallelism that is inherent in hierarchical
implementation ; System can be designed in a team
Hierarchy provides support for reuse.
The challenges compared to the flat design flow
Much more difficult for fullchip timing closure
More intensive design planning needed, repeater insertion,
timing constraint budgeting.
What is a hierarchical Design?
Hierarchical design can be divided into three general
stages:
Chip planning
Break down the design to block-level
designs to be implemented separately.
Implementations
This stage consists of two sub-stages:
Block implementation for a block-level design
Top-level implementation for a design based
on block-level design abstracts and timing models
Chip assembly
Connect all block-level designs into the final chip
23
Fullchip Design
Blk 1 Blk 2 Blk 3
P&R
Flow
P&R
Flow
P&R
Flow
Fullchip Timing &
Verification
Top-Down vs. Bottom-up Hierarchical Flow
Top-down Flow:
Import the top level design as a flat design
Floorplan the design and define partitions
Pin assignment and time budgeting of the partitions based on
the top-level constraints.
Block-level design size, pins, and standard-cell placements will
be guided by top level floorplaning and I/O pad locations.
Bottom-up Flow:
It only consists of implementation and assembly stages.
The size, shape, and pin position of block-level designs will
drive the top-level floorplanning.
Each block in the design must be fully implemented. Then they
are imported as black box into top-level.
24
Logical Hierarchy vs. Physical Hierarchy
25
module chip ( in1, in2, in3, out1, out2,
...)
module block1 ( a, b, c, ...)
module sb1 ( x, y, z,
...)
in1
in2
in3
out1 out2
b
c
Chip level Pads
block level
pins
chip
module sb2
...)
module sb3 ( x, y, z,
...)
sb1
sb2
sb3
block1
The modules that correspond to the partitions need to
exist in the netlist.
Chip Planning for Hierarchical Design
Initialize floorplan and IOs
Specify the partitions
Power grid insertion
Clock planning
Feed through insertion
Quick placement
Trial route
Partition pin assignment
Timing Budgeting
Commit partition / Physical pushdown
Partitions are ready for block level implementation
Hierarchical Design: Specify Partitions /
Plan Groups
Netlist must have partitions as top level modules.
Partitions generally sized according to a target initial
utilization : ~70% utilization, ~300k-700k instances
Channels or abutment
Rectilinear block shapes are possible
Abutment Channels
Rectilinear
Blocks
Hierarchical Design : Clock Planning
Global clock trees (H-trees)
Can reduce total insertion delay and balance full chip skew
At least one endpoint per block
Distribution of other high fanout nets should also be
considered
Hierarchical Design : Feedthrough Insertion
For channelless designs or designs with limited channel
resources
Requires a change in the partition netlist
Partition C Partition B Partition A
IN
IN OUT
I/O Pin
Feedthrough
Candidates
Net1
Net2
IN
IN
OUT
I/O Pin
Net2
Net1 Net1a
Net1b
Net2a
Hierarchical Design : Partition Pin Assignment
Pin Guides are created for every partition.
Pins are positioned based on the top level
floorplanning, placement, and routing.
Objectives: reduce total wire length,
reduce congestion, high quality top level
routing
Partition
Pin guide 1
Pin guide 2
Pins at partition
corners can make
routing difficult
Hierarchical Design : Timing Budgeting
Chip level constraints must be mapped correctly to block level
constraints
The design must be placed, trial routed and have pins assigned
before running budgeting
Block level constraints will be assigned as input or output delays
on I/O ports based on the estimated timing slack.
Sign-off must be done on full chip constraints, since budgeted
constraints are rough estimates only.
IN1
Block Boundary
set_input_delay 1.5 [ get_port IN1 ]
1.5ns
Hierarchical Design : Commit Partition
and Block Level Implementation
Commit partition
Power nets and pre-routed signal routes are pushed-down into
the appropriate partition based on their physical location.
A physical database file (e.g. DEF), verilog netlist and constraint
file (SDC) is created for each new partition.
Block Level Implementation
Implementation based on the given guidelines, provided by chip
level planning
The output of this phase is to produce the P&R netlist and timing
model of the block.
These files are used for chip assembling phase.
Hierarchical Design : Fullchip Timing Closure
Fullchip timing closure is typically a bottleneck for design
cycles.
Block-level P&R flow guarantees that the timing constraints
inside the block (flop-to-flop) are met.
Block-level P&R flow does not emphasize io-to-flop, flop-to-
io, io-to-io timing paths, because budgeted constraints are
only estimates
Interface logic models (ILMs) can
be used for fullchip timing closure
Interface Logic Model (ILM)
ILM is a technique to model blocks in Hierarchical
implementation flows.
Logic that only consists of register-to-register paths on a block
is not part of an ILM.
ILMs do not abstract. They simply discard what is not
required for modeling boundary timing.
This model is used to speed-up timing analysis runs when
fullchip design is too large.
34
C C C C
C A
B
Clk
X
Y C
C A
B
Clk
X
Y C
Original Netlist
of the partition
Interface Logic Model (ILM)
of the partition
Low Power Physical
Implementation Flow

Voltage scaling for low power
36
Low Power
Low VDD
Low Speed
Speed Up
Low V
th

P VDD
2

I
ds
(VDD - V
th
)
1~2
I
ds
(VDD - V
th
)
1~2

High Leakage
I
leakage
e
-C x Vth

x 12 per 100mV VT
decrease
Also depends on T
again power problem
37
Power Consumption and Reliability
Static Power
(Leakage Power)
Dynamic Power
IR-Drop /
Voltage Drop
1 out of 5 chips fail due to excessive power consumption
Fail
Electromigration
(EM)
Floorplan
+
Design of the grid
Power density
problem in the
Long run
Average Power
problem
IR-Drop
The drop in supply voltage over the length of the supply line
A resistance matrix of the power grid is constructed
The matrix is solved for the current source model at each node to
determine the IR-drop.
Static IR-Drop Analysis: The average current of each gate is
considered
Dynamic IR-Drop Analysis: The current of the gate as a function of
time is used (actual switching event is considered)
38
VDD Pad
VDD
39
IR-Drop
Minimum
Tolerance
Level
3.0V
Ideal voltage level
Actual voltage level
IR-drop effects
Unpredictable performance (eg. Effect of crosstalk enlarged)
Logic failures due to reduced noise margins
Decreased performance (timing)
Excessive clock skew (clock drivers)
Electromigration (EM)
Electromigration: Refers to the gradual displacement of
the metal atoms of a conductor as a result of the current
flowing through that conductor.
Transfer of electron momentum
Can result in catastrophic failure do to either
Open : void on a single wire
Short : bridging between to wires
Even without open or short, EM can cause
performance degradation
Increase/decrease in wire RC
40
Power reduction at Different Levels
System architecture
Software/hardware power management
Voltage scaling / frequency scaling
Multiple voltage islands
Power aware algorithms
IP selection (performance - power )
Clock gating, logic structuring
Multi-V
th
cell selection to reduce leakage
Multi voltage islands
Power gating
CMOS low-leakage process techniques:
high-K
..etc.
System
Implementation
Process
41
Modern Digital Low Power Flow
42
1 integrated cell to avoid glitches!
Low power logic implementation techniques
1- Multi-voltage and power gating techniques
modify the netlist, connectivity, and insert special cells
2- Use of a set of Power Constraints files (CPF/UPF) just
like Timing Constraint Files
3- Clock gating
Extra cell: Integrated Clock Gate (ICG) prevents glitch propagation to
the gated GCLK

Low power logic implementation techniques
4- Operand isolation 5- Gate level power optimization
No extra library cell is needed Extra specialized standard-cell
Reduces dynamic power are needed
Reduces dynamic power

Modern Digital Low Power Flow
43
Extra cells:
2 or more libraries are needed
ex. High-VT, Low-VT and
Standard-VT


Modern Digital Low Power Flow
44
Leakage Power Optimization
A multi-V
th
library is the key factor
of leakage power optimization
Using high V
th
cells on non-critical
paths to save power
Using low V
th
cells on critical paths
to improve timing
Low V
TH

Nominal V
TH

High V
TH

Delay
Leakage Current
Multi-Threshold
Low power logic implementation techniques
6- Multi-Vth insertion strategies

Modern digital Low Power Flow
Low power physical implementation:
Floorplanning and Power planning
Power Network synthesis (PNS)
Power Network Analysis (PNA)
Low power placement
Register clustering
Low power CTS
Minimizing clock tree capacitance
Low power routing

45
Low Power Techniques Supported by
Physical Implementation tools
7- Multi-voltage (reduces the dynamic power)
Multiple different core voltages in the same chip
8- Power gating (reduces the leakage power)
Coarse and fine grained
State retention mechanism
9- Dynamic voltage and frequency scaling
To adapt the power consumption and workload



46
Standard Databases, Low Power Cells
Additional cells which are required for low power techniques
Integrated clock gating cells
For standard clock gating
Level shifters
For Multi-Voltage implementation
Isolation cells
For Power Gating implementation
State retention registers
For Power Gating implementation
Always-on buffers
For Power Gating implementation
Power Gate Cells
Header/Footer Switches
For Power Gating implementation
47
Multi Voltage Design
48
core
IP
RAM
ROM
PD1
PD2
PD3
Define power domains
create power domain names
list of cells connected to
VDD1, VDD2,GND1,
draw the power domains
Place macros
Take into account:
routing congestion
orientation
Manual usually better than
auto (take info from FE )
Multi-Voltage Design
49
MV Level-Shifter Cells
0.9 0.7
0.7 1.08
LS
LS
L
S

L
S

L
S

L
S

VDD1 VDD2
VSS
IN OUT

logic model
Dual H-L and L-H level shifter
Additional Cells
Low-to-High Level Shifter
High-to-Low Level Shifter
Multi-Voltage Design : Level Shifters
50
Example P&G for a domain
with level shifters
VDD1
VSS
OUT
VDD2
IN
VDD1
VSS
Level shifter cell
LS region
VDD1
VDD2
VDD1
0.9V
0.9V 0.7V
OFF
Additional Cells
Isolation cells
Retention flops
Power gates
Always-on buffers
Power Gating
Header Switch Footer Switch
Floorplan of Footer switch:
same height as standard cells
or double
VDD VDDG
SLEEP
SLEEP
VSS VSSG
Power Gating: Power Gates
Power switches are used to shut down the unused area
53
Power Gating : Switch Layout - Ring style
Sleep switches are located between always-on power ring and virtual
power ring (VVDD)
Easy to implement compared to Grid Style and less impact on placement
and routing
Large IR-drop (switch resistance +thin VVDD net)
Used for power gating of hard-IPs and small blocks
Does not support retention registers
Also called coarse-grained
VDD
Global VDD
VVDD1
domain
VVDD2
domain
VDD
Sleep transistors
54
Power Gating : Switch Layout - Grid style
VDD network all across the chip; Virtual power networks in each
gated domain
Switches are placed in grid connecting VDD and VVDDs
Improved IR-drop characteristics because every switch drives small
number of local instances
Large impact on placement and routing due to distributed switches
Supports retention registers
Also called fine-grained style
Global VDD
VVDD2
VDD
VVDD1
VVDD1
VVDD1
VVDD2
VVDD2
VDD
Power Gating : Isolation Cells
55
Isolation cells ensure the electrical and logical isolation of the cells in
shut down from active logic in a design.
When a blocks is shut down, the internal signals may transition to
unknown, float state ->Incorrect functioning for the rest of the design
Prevent snake paths for current to flow between power and ground if
cells driving the shut down region are improperly designed
To be added to input/output signals
of the shut down region
Retention registers have a shadow high-Vth latch built-in, which is
connected to always-on voltage
Comprehensive testing is required
Data should be restored to the main register (low-Vth) after a few
cycles when the block is awoken
Power Gating : Retention Registers
56
Retention Register - preserve status
while the logic is turned off
1.08V/OFF
1.08V/OFF
0.7V
0.9V
CTR
sleep
RR RR RR
Power Gating : Retention Registers
57
Example P&G for a domain
with retention registers
LS region
VDD1
VDD2
VDD1
VSS
VDD
Retention register region
VDD2
VDD VDDG
VSS
AO
IN OUT
Buffering of signals in powered-down areas
Signals crossing from active to active area that needs buffering in powered-
down block
Power control signals
Always-on VDD or VSS pins
are not directly connected to the power rails
Connected during routing with unswitchable power/ground
VDD
VSS
VDD_local
(on/off)
VDD_global
(always-on)
VSS_global
(always-on)
VSS_local
(on/off)
Normal inverter:
power rails only
Always-on inverter:
power rails +power pins
Power Gating : Always-on Buffers
Design Considerations for
90nm Technology and Beyond

Processing issues for <90nm Technology
61
For <90nm technology aggressive DRC and enhanced
DFM/DFY techniques should be applied to increase
the manufacturability and design yield
Design Rule Check (DRC)
Guidelines about the geometry constraints for
constructing process masks
Design For Manufacturing (DFM)
Design techniques to allow the design to be
manufactured correctly
Design For Yield (DFY)
Design techniques used to improve yield

DRC, DFM and DFY
DRC (Design Rule Check)
Information like:
Routing layers: width, spacing , pitches
General Rules:
a) enclosure, b) space, c) overlap
d) width, e) extension
Specific rules
Antenna rules, metal density rules, minimum area
A compromise between performance and yield
More conservative rules increase probability of correct
circuit function (yield)
More aggressive rules increase circuit performance ( area,
power, delay)
62
63
63
DRC: General Rules
0 1 2 3 4 5
DRC
a
b c
d
e
a enclosure
b space
c overlap
d width
e extension
DRC: Metal Density (AL)
High density Low density
Etching takes
longer
Solution: Add dummy
metal structures here
to maintain minimum
metal density
inter-level dielectric
resist
Al
etching
High density Low density
overetching
minimum density needed
65
DRC: Metal Density (Cu)
Ta barrier hard to remove.
=>not too much Ta barrier. min density
=>max. density
Softness of Cu results in
dishing
inter-level dielectric
Tantalum barrier layer
chemical-mechanical polishing
DRC: Max Metal Density
Solution 1
slots
Solution 2
split wires
In GDSII
Mx: diff datatype
66
Fat wires problem =>
cracks may occur due to thermal expansion stress if large current
DRC: Recommended Rules
Wire spreading
Avoid asymmetrical
contacts
Layout guidelines for yield enhancement
Guidelines for optimal electrical model and silicon correlation
67
68
DRC Challenges
100
200
300
400
500
600
700
800
0
180 130 90 65 45 nm
Count of Design rules in the runset
The number of design rules in the DRC runsets for different
technology processes
Reasons:
- More metal layers
- Diff spacing rules
depending on width
- Recommended rules
general rules
DFM / DFY : Techniques
Redundant via insertion (Multi-cut vias)
90 nm : recommended rule (yield increase)
Some tools do concurrent redundant via
insertion
Can also be done afterwards (post route fixing)
Place where possible
Via Reduction
Minimize total number of vias
A significant percentage of defects are traced to
via failures

69
DFM / DFY : Techniques
Wire Straightening ( reduce jogs )
Bent wires are particularly prone to greater
lithographic variations
Wire Spreading
Spacing wires can reduce the probability of a particle defect causing chip
failure
Particle defect
causing short
More space
prevents short
70
Sign-off
Parasitic RC extraction
Advanced delay calculation & signal integrity analysis
Advanced IR drop and Electromigration analysis
Thermal map and influence on timing
Noise Analysis
Inter-die and intra-die variation
At 65 and 45nm, the effects of inter-die and intra-die variations
become significant
Statistical analysis approach to factor variations in
Logic equivalence check
Send Verilog + SPEF (SDF) to frontend designers for final verifications
Layout verification
Design Rule Check (DRC)
Layout vs. Schematic (LVS)
Transfer to design finishing group
LVS(Layout vs. Schematic)
Compare
with
LVS
vdd
IN
OUT
vss
Extract the designed devices (nmos, pmos,n-well tap,)
Extract the connectivity between
Build a netlist
Compare both netlist
Top level labels needed for
VDD,VSS, inputs and outputs
72
Summary
Flat and Hierarchical Physical Implementation flow is
discussed
Low power challenges and standard low power physical
implementation flow is discussed
Processing issues for small technology nodes are
explained
Solutions to improve the manufacturability and yield are
discussed
73
74
References
1. Advanced Digital Physical Implementation Design, IDESA Course,
2012.
2. Cadence Encounter Digital Implementation (Hierarchical) training
course material, 2013.
3. "Sleep Transistor Design and Implementation - Simple Concepts Yet
Challenges To Be Optimum", K. Shi, D. Howard, VLSI DAT 2006.
4. "Dual threshold voltages and power-gating design flows offer good
results", Kaijian Shi (Synopsys Professional Services), EDN Feb. 2,
2006.
5. Jupiter XT Training Version 2005.09, Synopsys CES.
6. What's New in Galaxy Low Power 2007.03, Manoz Palaparthi,
SNUG 2007 Tutorial.
7. Automating RT-Level Operand Isolation to Minimize Power
Consumption in Datapaths, M. Munch, B.Wurth , R. Mehra , J.
Sproch , and N. Wehn

S-ar putea să vă placă și