Buffer Insertion

Interconnect Optimizations
A scaling primer
G
• Ideal process scaling: S D

– Device geometries shrink by S= 0.7x)
• Device delay shrinks by s w
S
h
– Wire geometries shrink by 
• R/ :  /(ws.hs) = r/s2 l
• Cc/ : (hs). /(Ss) = Cc
• C/  : similar
• R/ doubles, C/  and Cc/ unchanged h
l
w S
Interconnect role
• Short (local) interconnect
– Used to connect nearby cells
– Minimize wire C, i.e., use short min-width wires
• Medium to long-distance (global) interconnect
– Size wires to tradeoff area vs. delay
– Increasing width  Capacitance increases, Resistance
decreases Need to find acceptable tradeoff - wire sizing problem
• “Fat” wires
– Thicker cross-sections in higher metal layers
– Useful for reducing delays for global wires
– Inductance issues, sharing of limited resource
Cross-Section of A Chip
Block scaling
• Block area often stays same

– # cells, # nets doubles
– Wiring histogram shape invariant
• Global interconnect lengths don’t shrink

• Local interconnect lengths shrink by s
Interconnect delay scaling
• Delay of a wire of length l :
  int = (rl)(cl) = rcl2 (first order)
• Local interconnects :
  int : (r/s2)(c)(ls)2 = rcl2
– Local interconnect delay unchanged (compare to faster devices)
• Global interconnects :
  int : (r/s2)(c)(l)2 = (rcl2)/s2
– Global interconnect delay doubles – unsustainable!
• Interconnect delay increasingly more dominant

Buffer Insertion For Delay
Reduction
Analysis of Simple RC Circuiti(t)
R
vT(t) ± C v(t)
R  i (t )  v (t )  vT (t )
d (Cv (t )) dv(t )
i (t )  C
dt dt
dv(t )
 RC  v(t )  vT (t )
dt
state
variable
Input
waveform
Analysis of Simple RC Circuit
dv(t )
Step-input response: RC  v (t )  v0u (t )
dt
t
v (t )  Ke RC  v0u (t )
v0u(t) match initial state:
v0 v (0)  0  K  v0u (t )  0
v0(1-e-t/RC)u(t) output response for step-input:

t
v (t )  v0 (1  e RC
)u (t )
Delays of Simple RC Circuit
• v(t) = v0(1 - e-t/RC) -- waveform
under step input v0u(t)
• v(t)=0.5v0  t = 0.69RC
– i.e., delay = 0.69RC (50% delay)
v(t)=0.1v0  t = 0.1RC
v(t)=0.9v0  t = 2.3RC
– i.e., rise time = 2.2RC (if defined as time from 10% to 90% of Vdd)
• Commonly used metric TD = RC (= Elmore delay)

Elmore Delay
Delay
Elmore Delay
• Driver is modeled as R
• Driver intrinsic gate delay t(B)
• Delay = all Ri all Cj downstream from Ri Ri*Cj
• Elmore delay at n2 R(B)*(C1+C2)+R(w)*C2
• Elmore delay at n1 R(B)*(C1+C2)
n1 n2
B R(B)
C1 R(w) C2
Elmore Delay
• For uniform wire
unit wire capacitance c x
unit wire resistance r

( xr )( xc ) C
delay   ( xr )C
2
• No matter how to lump, the Elmore delay
is the same
Delay for Buffer
u v u
C C(b)
delay (u, v )  t (b)  R(b)C

C (u )  C ( b)
Input capacitance Driver resistance

Intrinsic buffer delay
Buffers Reduce Wire Delay
x/2 x/2
R rx/2 C R rx/2
cx/4 cx/4 cx/4 cx/4 C
∆t
t_unbuf = R( cx + C ) + rx( cx/2 + C )
t_buf = 2R( cx/2 + C ) + rx( cx/4 + C ) + tb x
t_buf – t_unbuf = RC + tb – rcx2/4

Combinational Logic Delay
Register Register
Combinational
Primary Logic Primary
Input Output
clock
Combinational logic delay <= clock period

Buffered global interconnects:
Intuition
l
Interconnect delay = r.c.l2
l1 l2 l3 ln
Now, interconnect delay =  r.c.li2 < r.c.l2 (where l = lj )

since (lj 2) < (lj )2
(Of course, account for buffer delay also)

Optimal inter-buffer length
• First order (lumped parasitic, Elmore delay) analysis
L
… …
Rd – On resistance of inverter
l Cg – Gate input capacitance
r,c – Resistance, cap. per micron
• Assume N identical buffers with equal inter-buffer length l (L = Nl)


T  N Rd  C g  cl   rl  C g  cl / 2 
 L rcl / 2   rC g  Rd c    Rd C g  
 1 
 l 
• For minimum delay,
dT  rc Rd C g  2 Rd C g
0 L  2   0 lopt 
dl  2 lopt  rc
Optimal interconnect delay
• Substituting lopt back into the interconnect delay
expression:
 
 L rcl opt   rC g  Rd c    Rd C g 
1
Topt
 lopt 
 
 2 Rd C g Rd C g 
 L rc   rC g  Rd c   
 rc 2 Rd C g 
 
 rc 

Topt  L 2 Rd C g rc   rC g  Rd c  
Delay grows linearly with L (instead of quadratically)
Total buffer count
80
clk-buf
70
% cells used to buffer nets

buf
60 tot-buf
50
40
30
20
10
0
90nm 65nm 45nm 32nm
• Ever-increasing fractions of total cell count will be buffers
– 70% in 32nm
ITRS projections
Feature size (nm)

Relative
delay 250 180 130 90 65 45 32
100
Gate delay (fanout 4)
Local interconnect (M1,2)
Global interconnect with repeaters
Global interconnect without repeaters
10
Source: ITRS, 2003

0.1
Buffers Improve Slack
RAT = 300
Delay = 350
Slack = -50
slackmin = -50
RAT = 700
Delay = 600
RAT = Required Arrival Time Slack = 100
Slack = RAT - Delay
RAT = 300
Decouple capacitive Delay = 250
slackmin = 50 load from critical path Slack = 50
RAT = 700
Delay = 400
Slack = 300
Timing Driven Buffering
Problem Formulation
• Given
– A Steiner tree
– RAT at each sink
– A buffer type
– RC parameters
– Candidate buffer locations
• Find buffer insertion solution such that the
slack at the driver is maximized
Candidate Buffering Solutions
Candidate Solution Characteristics
• Each candidate vi is a sink

ci is sink capacitance
solution is
associated with
– vi: a node
– ci: downstream v is an internal node
capacitance
– qi: RAT
Van Ginneken’s Algorithm
Candidate solutions are

propagated toward the source
Dynamic Programming
Solution Propagation: Add Wire
x (v1, c1, q1)

(v2, c2, q2)
• c2 = c1 + cx
• q2 = q1 – rcx2/2 – rxc1
• r: wire resistance per unit length
• c: wire capacitance per unit length
Solution Propagation: Insert Buffer
(v1, c1, q1)

(v1, c1b, q1b)
• c1b = Cb
• q1b = q1 – Rbc1 – tb
• Cb: buffer input capacitance
• Rb: buffer output resistance
• tb: buffer intrinsic delay 28
Solution Propagation: Merge
(v, cl , ql) (v, cr , qr)
• cmerge = cl + cr
• qmerge = min(ql , qr)
Solution Propagation: Add Driver
(v0, c0, q0)

(v0, c0d, q0d)
• q0d = q0 – Rdc0 = slackmin

• Rd: driver resistance
• Pick solution with max slackmin
Example of Solution Propagation
• r = 1, c = 1
2 2
(v1, 1, 20) • Rb = 1, Cb = 1, tb = 1
• Rd = 1
Add wire
(v2, 3, 16) (v2, 1, 12)
v1 v1
Insert buffer
Add wire Add wire
(v3, 5, 8) (v3, 3, 8)
v1 v1
slack = 3 Add driver slack = 5 Add driver

Example of Merging
Left
candidates
Right candidates
Merged candidates
32
Solution Pruning
• Two candidate solutions
– (v, c1, q1)
– (v, c2, q2)
• Solution 1 is inferior if
– c1 > c2 : larger load
– and q1 < q2 : tighter timing
Pruning When Insert Buffer
They have the same load cap Cb,

only the one with max q is kept
Generating Candidates
(1)
(2)
(3)
35 From Dr. Charles Alpert

Pruning Candidates
(3)
(a) (b)
Both (a) and (b) “look” the same to the source.

Throw out the one with the worst slack
(4)
36
Candidate Example Continued
(4)
(5)
37
Candidate Example Continued
After pruning
(5)
At driver, compute which candidate maximizes

slack. Result is optimal.
38
Merging Branches
Left
Candidates
Right
Candidates
39
Pruning Merged Branches
Critical
With pruning
40
Van Ginneken Example
(20,400)
Buffer Wire
C=5, d=30 C=10,d=150
(30,250)
(5, 220) (20,400)
Buffer Wire
C=5, d=50 C=15,d=200
C=5, d=30 C=15,d=120
(45, 50) (30,250)
(5, 0) (5, 220) (20,400)
(20,100)
(5, 70)
41
Van Ginneken Example Cont’d
(45, 50) (30,250)
(5, 0) (5, 220) (20,400)
(20,100)
(5, 70)
(5,0) is inferior to (5,70). (45,50) is inferior to (20,100)
Wire C=10
(20,100) (30,250)
(30,10) (5, 220) (20,400)
(5, 70)
(15, -10)
Pick solution with largest slack, follow arrows to get solution
42
Basic Data Structure
Worse load cap
(c1, q1) (c2, q2) (c3, q3)

Better timing
Sorted list such that

• c 1 < c2 < c 3
• If there is no inferior candidates
q1 < q2 < q3
Prune Solution List
Increasing c
(c1, q1) (c2, q2) (c3, q3) (c4, q4)
N N
q1 < q2 ? Prune 2 q1 < q3 ? Prune 3 q1 < q4 ?
Y
Y
N Prune 3 q2 < q4 ?
q2 < q3 ?
Y
N Prune 4
N Prune 4 q3 < q4 ?
q3 < q4 ?
44
Pruning In Merging
Left Right ql1 < ql2 < qr1 < ql3 < qr2
candidates candidates
(cl1, ql1) (cr1, qr1) Merged (cl1, ql1) (cr1, qr1)
(cl2, ql2) (cr2, qr2) candidates (cl2, ql2) (cr2, qr2)
(cl1+cr1, ql1)
(cl3, ql3) (cl3, ql3)
(cl2+cr1, ql2)
(cl1, ql1) (cr1, qr1) (cl3+cr1, qr1) (cl1, ql1) (cr1, qr1)
(cl2, ql2) (cr2, qr2) (cl3+cr2, ql3) (cl2, ql2) (cr2, qr2)
(cl3, ql3) (cl3, ql3)
45
Van Ginneken Complexity
• Generate candidates from sinks to source
• Quadratic runtime
– Adding a wire does not change #candidates
– Adding a buffer adds only one new candidate
– Merging branches additive, not multiplicative
– Linear time solution list pruning
• Optimal for Elmore delay model

Multiple Buffer Types
2 2 • r = 1, c = 1
(v1, 1, 20)
• Rb1 = 1, Cb1 = 1, tb1 = 1
• Rb2 = 0.5, Cb2 = 2, tb2 = 0.5
(v2, 3, 16)
v1 • Rd = 1
(v2, 1, 12) (v2, 2, 14)

v1 v1

Buffer Insertion

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Buffer Insertion

Încărcat de

Drepturi de autor:

Formate disponibile

Interconnect Optimizations

• Ideal process scaling: S D

• Block area often stays same

• Global interconnect lengths don’t shrink

– Local interconnect delay unchanged (compare to faster devices)

– Global interconnect delay doubles – unsustainable!

• Interconnect delay increasingly more dominant

v0(1-e-t/RC)u(t) output response for step-input:

• Commonly used metric TD = RC (= Elmore delay)

unit wire capacitance c x

unit wire resistance r

delay (u, v )  t (b)  R(b)C

Input capacitance Driver resistance

t_buf – t_unbuf = RC + tb – rcx2/4

Combinational logic delay <= clock period

Interconnect delay = r.c.l2

Now, interconnect delay =  r.c.li2 < r.c.l2 (where l = lj )

(Of course, account for buffer delay also)

• Assume N identical buffers with equal inter-buffer length l (L = Nl)

% cells used to buffer nets

Feature size (nm)

Source: ITRS, 2003

• Each candidate vi is a sink

Candidate solutions are

x (v1, c1, q1)

(v1, c1, q1)

(v, cl , ql) (v, cr , qr)

(v0, c0, q0)

• q0d = q0 – Rdc0 = slackmin

slack = 3 Add driver slack = 5 Add driver

They have the same load cap Cb,

35 From Dr. Charles Alpert

Both (a) and (b) “look” the same to the source.

At driver, compute which candidate maximizes

(5,0) is inferior to (5,70). (45,50) is inferior to (20,100)

Pick solution with largest slack, follow arrows to get solution

Worse load cap

(c1, q1) (c2, q2) (c3, q3)

Sorted list such that

(c1, q1) (c2, q2) (c3, q3) (c4, q4)

• Optimal for Elmore delay model

(v2, 1, 12) (v2, 2, 14)

S-ar putea să vă placă și