Sunteți pe pagina 1din 47

Interconnect Optimizations

A scaling primer
G

• Ideal process scaling: S D


– Device geometries shrink by S= 0.7x)
• Device delay shrinks by s w
S
h
– Wire geometries shrink by 
• R/ :  /(ws.hs) = r/s2 l
• Cc/ : (hs). /(Ss) = Cc
• C/  : similar
• R/ doubles, C/  and Cc/ unchanged h

l

w S
Interconnect role
• Short (local) interconnect
– Used to connect nearby cells
– Minimize wire C, i.e., use short min-width wires
• Medium to long-distance (global) interconnect
– Size wires to tradeoff area vs. delay
– Increasing width  Capacitance increases, Resistance
decreases Need to find acceptable tradeoff - wire sizing problem
• “Fat” wires
– Thicker cross-sections in higher metal layers
– Useful for reducing delays for global wires
– Inductance issues, sharing of limited resource
Cross-Section of A Chip
Block scaling

• Block area often stays same


– # cells, # nets doubles
– Wiring histogram shape invariant

• Global interconnect lengths don’t shrink


• Local interconnect lengths shrink by s
Interconnect delay scaling
• Delay of a wire of length l :
  int = (rl)(cl) = rcl2 (first order)

• Local interconnects :
  int : (r/s2)(c)(ls)2 = rcl2

– Local interconnect delay unchanged (compare to faster devices)

• Global interconnects :
  int : (r/s2)(c)(l)2 = (rcl2)/s2

– Global interconnect delay doubles – unsustainable!

• Interconnect delay increasingly more dominant


Buffer Insertion For Delay
Reduction
Analysis of Simple RC Circuiti(t)
R
vT(t) ± C v(t)
R  i (t )  v (t )  vT (t )
d (Cv (t )) dv(t )
i (t )  C
dt dt
dv(t )
 RC  v(t )  vT (t )
dt
state
variable
Input
waveform
Analysis of Simple RC Circuit
dv(t )
Step-input response: RC  v (t )  v0u (t )
dt
t
v (t )  Ke RC  v0u (t )
v0u(t) match initial state:
v0 v (0)  0  K  v0u (t )  0

v0(1-e-t/RC)u(t) output response for step-input:


t
v (t )  v0 (1  e RC
)u (t )
Delays of Simple RC Circuit
• v(t) = v0(1 - e-t/RC) -- waveform
under step input v0u(t)

• v(t)=0.5v0  t = 0.69RC
– i.e., delay = 0.69RC (50% delay)

v(t)=0.1v0  t = 0.1RC
v(t)=0.9v0  t = 2.3RC
– i.e., rise time = 2.2RC (if defined as time from 10% to 90% of Vdd)

• Commonly used metric TD = RC (= Elmore delay)


Elmore Delay

Delay
Elmore Delay
• Driver is modeled as R
• Driver intrinsic gate delay t(B)
• Delay = all Ri all Cj downstream from Ri Ri*Cj
• Elmore delay at n2 R(B)*(C1+C2)+R(w)*C2
• Elmore delay at n1 R(B)*(C1+C2)

n1 n2

B R(B)
C1 R(w) C2
Elmore Delay
• For uniform wire

unit wire capacitance c x

unit wire resistance r


( xr )( xc ) C
delay   ( xr )C
2
• No matter how to lump, the Elmore delay
is the same
Delay for Buffer

u v u

C C(b)

delay (u, v )  t (b)  R(b)C


C (u )  C ( b)

Input capacitance Driver resistance


Intrinsic buffer delay
Buffers Reduce Wire Delay

x/2 x/2

R rx/2 C R rx/2
cx/4 cx/4 cx/4 cx/4 C
∆t
t_unbuf = R( cx + C ) + rx( cx/2 + C )
t_buf = 2R( cx/2 + C ) + rx( cx/4 + C ) + tb x

t_buf – t_unbuf = RC + tb – rcx2/4


Combinational Logic Delay

Register Register
Combinational
Primary Logic Primary
Input Output

clock

Combinational logic delay <= clock period


Buffered global interconnects:
Intuition
l

Interconnect delay = r.c.l2

l1 l2 l3 ln

Now, interconnect delay =  r.c.li2 < r.c.l2 (where l = lj )


since (lj 2) < (lj )2

(Of course, account for buffer delay also)


Optimal inter-buffer length
• First order (lumped parasitic, Elmore delay) analysis
L
… …
Rd – On resistance of inverter
l Cg – Gate input capacitance
r,c – Resistance, cap. per micron

• Assume N identical buffers with equal inter-buffer length l (L = Nl)



T  N Rd  C g  cl   rl  C g  cl / 2 
 L rcl / 2   rC g  Rd c    Rd C g  
 1 
 l 
• For minimum delay,
dT  rc Rd C g  2 Rd C g
0 L  2   0 lopt 
dl  2 lopt  rc
Optimal interconnect delay
• Substituting lopt back into the interconnect delay
expression:
 
 L rcl opt   rC g  Rd c    Rd C g 
1
Topt
 lopt 
 
 2 Rd C g Rd C g 
 L rc   rC g  Rd c   
 rc 2 Rd C g 
 
 rc 


Topt  L 2 Rd C g rc   rC g  Rd c  
Delay grows linearly with L (instead of quadratically)
Total buffer count
80
clk-buf
70

% cells used to buffer nets


buf
60 tot-buf

50
40
30
20
10
0
90nm 65nm 45nm 32nm
• Ever-increasing fractions of total cell count will be buffers
– 70% in 32nm
ITRS projections

Feature size (nm)


Relative
delay 250 180 130 90 65 45 32
100
Gate delay (fanout 4)
Local interconnect (M1,2)
Global interconnect with repeaters
Global interconnect without repeaters
10

Source: ITRS, 2003


0.1
Buffers Improve Slack

RAT = 300
Delay = 350
Slack = -50
slackmin = -50
RAT = 700
Delay = 600
RAT = Required Arrival Time Slack = 100
Slack = RAT - Delay
RAT = 300
Decouple capacitive Delay = 250
slackmin = 50 load from critical path Slack = 50

RAT = 700
Delay = 400
Slack = 300
Timing Driven Buffering
Problem Formulation
• Given
– A Steiner tree
– RAT at each sink
– A buffer type
– RC parameters
– Candidate buffer locations
• Find buffer insertion solution such that the
slack at the driver is maximized
Candidate Buffering Solutions
Candidate Solution Characteristics

• Each candidate vi is a sink


ci is sink capacitance
solution is
associated with
– vi: a node
– ci: downstream v is an internal node

capacitance
– qi: RAT
Van Ginneken’s Algorithm

Candidate solutions are


propagated toward the source

Dynamic Programming
Solution Propagation: Add Wire

x (v1, c1, q1)


(v2, c2, q2)

• c2 = c1 + cx
• q2 = q1 – rcx2/2 – rxc1
• r: wire resistance per unit length
• c: wire capacitance per unit length
Solution Propagation: Insert Buffer

(v1, c1, q1)


(v1, c1b, q1b)

• c1b = Cb
• q1b = q1 – Rbc1 – tb
• Cb: buffer input capacitance
• Rb: buffer output resistance
• tb: buffer intrinsic delay 28
Solution Propagation: Merge

(v, cl , ql) (v, cr , qr)

• cmerge = cl + cr
• qmerge = min(ql , qr)
Solution Propagation: Add Driver

(v0, c0, q0)


(v0, c0d, q0d)

• q0d = q0 – Rdc0 = slackmin


• Rd: driver resistance
• Pick solution with max slackmin
Example of Solution Propagation

• r = 1, c = 1
2 2
(v1, 1, 20) • Rb = 1, Cb = 1, tb = 1
• Rd = 1
Add wire
(v2, 3, 16) (v2, 1, 12)
v1 v1
Insert buffer
Add wire Add wire
(v3, 5, 8) (v3, 3, 8)
v1 v1

slack = 3 Add driver slack = 5 Add driver


Example of Merging

Left
candidates

Right candidates

Merged candidates
32
Solution Pruning
• Two candidate solutions
– (v, c1, q1)
– (v, c2, q2)
• Solution 1 is inferior if
– c1 > c2 : larger load
– and q1 < q2 : tighter timing
Pruning When Insert Buffer

They have the same load cap Cb,


only the one with max q is kept
Generating Candidates
(1)

(2)

(3)

35 From Dr. Charles Alpert


Pruning Candidates
(3)

(a) (b)

Both (a) and (b) “look” the same to the source.


Throw out the one with the worst slack

(4)

36
Candidate Example Continued
(4)

(5)

37
Candidate Example Continued
After pruning

(5)

At driver, compute which candidate maximizes


slack. Result is optimal.

38
Merging Branches

Left
Candidates

Right
Candidates

39
Pruning Merged Branches

Critical

With pruning

40
Van Ginneken Example

(20,400)
Buffer Wire
C=5, d=30 C=10,d=150
(30,250)
(5, 220) (20,400)

Buffer Wire
C=5, d=50 C=15,d=200
C=5, d=30 C=15,d=120
(45, 50) (30,250)
(5, 0) (5, 220) (20,400)
(20,100)
(5, 70)

41
Van Ginneken Example Cont’d
(45, 50) (30,250)
(5, 0) (5, 220) (20,400)
(20,100)
(5, 70)

(5,0) is inferior to (5,70). (45,50) is inferior to (20,100)

Wire C=10

(20,100) (30,250)
(30,10) (5, 220) (20,400)
(5, 70)
(15, -10)

Pick solution with largest slack, follow arrows to get solution

42
Basic Data Structure

Worse load cap

(c1, q1) (c2, q2) (c3, q3)


Better timing

Sorted list such that


• c 1 < c2 < c 3
• If there is no inferior candidates
q1 < q2 < q3
Prune Solution List
Increasing c

(c1, q1) (c2, q2) (c3, q3) (c4, q4)

N N
q1 < q2 ? Prune 2 q1 < q3 ? Prune 3 q1 < q4 ?
Y
Y
N Prune 3 q2 < q4 ?
q2 < q3 ?

Y
N Prune 4
N Prune 4 q3 < q4 ?
q3 < q4 ?
44
Pruning In Merging

Left Right ql1 < ql2 < qr1 < ql3 < qr2
candidates candidates
(cl1, ql1) (cr1, qr1) Merged (cl1, ql1) (cr1, qr1)
(cl2, ql2) (cr2, qr2) candidates (cl2, ql2) (cr2, qr2)
(cl1+cr1, ql1)
(cl3, ql3) (cl3, ql3)
(cl2+cr1, ql2)

(cl1, ql1) (cr1, qr1) (cl3+cr1, qr1) (cl1, ql1) (cr1, qr1)
(cl2, ql2) (cr2, qr2) (cl3+cr2, ql3) (cl2, ql2) (cr2, qr2)
(cl3, ql3) (cl3, ql3)
45
Van Ginneken Complexity
• Generate candidates from sinks to source
• Quadratic runtime
– Adding a wire does not change #candidates
– Adding a buffer adds only one new candidate
– Merging branches additive, not multiplicative
– Linear time solution list pruning

• Optimal for Elmore delay model


Multiple Buffer Types

2 2 • r = 1, c = 1
(v1, 1, 20)
• Rb1 = 1, Cb1 = 1, tb1 = 1
• Rb2 = 0.5, Cb2 = 2, tb2 = 0.5
(v2, 3, 16)
v1 • Rd = 1

(v2, 1, 12) (v2, 2, 14)


v1 v1

S-ar putea să vă placă și