Ece465 High Level Design Strategies

ECE 465 High Level Design Strategies
Lecture Notes # 9
Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago
Outline
Circuit Design Problem Solution Approaches: Truth Table (TT) vs. Computational/Algorithmic Yes, hardware, just like software can implement any algorithm! Flat vs. Divide-&-Conquer Divide-&-Conquer:
Associative operations/functions General operations/functions
Other Design Strategies for fast circuits:

Speculative computation Best of both worlds (best average and best worst-case) Pipelining
Summary
Circuit Design Problem

Design an 8-bit comparator that compares two 8-bit #s available in two registers A[7..0] and B[7..0], and that o/ps F = 1 if A > B and F = 0 if A <= B. Approach 1: The TT approach -- Write down a 16-bit TT, derive logic expression from it, minimize it, obtain gate-based realization, etc.!
A B F
00000000
00000000
00000000 00000001 0 -------------------00000001 00000000 1 ---------------------11111111 11111111 0
Too cumbersome and time-consuming Fraught with possibility of human error Difficult to formally prove correctness (i.e., proof w/o exhasutive testing) Will generally have high hardware cost (including wiring) and delay
Circuit Design Problem (contd)

Approach 2: Think computationally/algorithmically about what the ckt is supposed to compute: Approach 2(a): Flat algorithmic approach:
Note: A TT can be expressed as a sequence of if-then-elses If A = 00000000 and B = 00000000 then F = 0 else if A = 00000000 and B = 00000001 then F=0 . else if A = 00000001 and B = 00000000 then F=1 . Essentially a re-hashing of the TT same problems as the TT approach
Circuit Design Problem: Strategy 1: Divide-&-Conquer Approach 2(b): Structured algorithmic approach:
Be more innovative, think of the structure/properties of the computational problem E.g., think if the problem can be solved in a hierarchical or divide&-conquer (D&C) manner:
Root problem A
Stitch-up of solns to A1 and A2 to form the complete soln to A
Subprob. A1
Subprob. A2
A1,1
A1,2
A2,1
A2,2
Do recursively until subprob-size is s.t. TT-based design is doable
D&C approach: See if the problem can be broken up into 2 or more smaller subproblems whose solutions can be stitched-up to give a soln. to the parent prob. Do this recrusively for each large subprob until subprobs are small enough for TT-based solutions If the subprobs are of a similar kind (but of smaller size) to the root prob then the breakup and stitching will also be similar
Shift Gears: Design of a Parity Detection CircuitA Series of XORs

(a) A linearly-connected circuit
x(0)
(b) 16-bit parity tree

f = (((x(15) xor x(14)) xor (x(13) xor x(12))) xor ((x(11) xor x(10)) xor (x(9) xor x(8)))) xor (((x(7) xor x(6)) xor (x(5) xor x(4))) xor ((x(3) xor x(2)) xor (x(1) xor x(0)))) x(1) x(0) x(15) x(14)
x(1) x(2) X(3) x(15) f
No concurrency in design (a)---the actual problem has w(3,5) w(3,7) available concurrency, though, and it is not exploited well in w(3,4) w(3,6) the above linear design Complete sequentialization leading to a delay that is linear in the # of bits n (delay = n*td), td = delay of 1 gate All the available concurrency is exploited in design (b)---a w(2,3) w(2,2) parity tree. Question: When can we have a tree-structured circuit for an operation on multiple operands? Answer: (1) When the operation makes sense for any # of operands. (2) It should be possible to break it down into w(1,1) operations w/ fewer operands. (3) When the operation is associative. An oper. x is said to be associative if: a x b x c = (a x b) x c = a x (b x c). Thus if we have 4 operations a x b x c x d, we can either perform this as a x (b x (c x d)) [getting a linear delay of 3 units] or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available concurrency due w(0,0) = f to the fact that x is associative]. We can extend this idea to n operands (& n-1 operations) to perform as many of the pairwise operations as possible in parallel (& do this recursively for every level of remaining operations), similar to design (b) for the parity detector [xor is an associative operation!] and thus get a (log2 n) delay.
w(3,3)
w(3,2)
w(3,1) w(3,0)
w(2,1)
w(2,0)
w(1,0)
Delay = (# of levels in AND-OR tree) * td = log2 (n) *td An example of simple designer ingenuity---a bad design would have resulted in a linear delay that the VHDL code & the synthesis tool would have been at the mercy of.
D&C for Associative Operations

Let f(xn-1, .., x0) be an associative function. What is the D&C principle involved in the design of an n-bit xor/parity function? Can it also lead automatically to a tree-based ckt?
f(xn-1, .., x0)
f(a,b) a f(xn-1, .., xn/2) b
Stitch-up function---same as the original function for 2 inputs
f(xn/2-1, .., x0)
Using the D&C approach for an associative operation results in the stitch up function being the same as the original function (not the case for nonassoc. operations), but w/ a constant # of operands (2, if the orig problem is broken into 2 subproblems) If the two sub-problems of the D&C approach are balanced (of the same size or as close to it as possible), then unfolding the D&C results in a balanced operation tree of the type for the xor/parity function seen earlier
D&C Approach for Non-Associative Opers: n-bit Comparator

Useful property: At any Is this is associative?not sure A level, comp. of MS (most For a non-associative func, significant) half determines determine its propeties that allow o/p if result is > or < else Comp. A[7..0]],B[7..0] determining a break-up & a comp. of LS determ. o/p correct stitch-up function Can thus break up problem If A1 reslt is at any level into MS and A1 Comp A[7..4],B[7..4] > or < take LS comparisons & based A1 reslt else on their results determine take A2 reslt which o/p to choose for the If A1,1 reslt is A1,2 A1,1 higher-level (parent) result
Comp A[7..6],B[7..6]
A1,1,1 Comp A[7],B[7] > or < take A1,1 reslt else take A1,2 reslt Comp A[5,4],B[5,4] Stitch-up of solns to A1 and A2 to form the complete soln to A Comp A[3..0],B[3..0] A2
If A1,1,1 reslt is > or < take A1,1,1 reslt else take A1,1,2 reslt
A1,1,2 Comp A[6],B[6] The TT may be derived directly or by first thinking of and expressing its computation in a high-level programming language and then converting it to a TT.
Small enough to be designed using a TT
If A[i] = B[i] then { f1(i)=0; f2(i) = 1; /* f2(i) o/p is an i/p to the stitch logic */
A[i] B[i] 0 0 0 1 1 0 1 1 f1(i) f2(i) 0 1 0 0 1 0 0 1 /* f2(i) =1 means f1( ), f2( ) o/ps of the LS of this subtree should be selected by the stitch logic as its o/ps */ else if A[i] < B[i} then { f1(i) = 0; /* indicates < */ f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */ else if A[i] > B[i] then {f1(i) = 1; /* indicates > */ f2(i) = 0 }
(2-bit 2-o/p comparator)
Comparator Circuit Design Using D&C (contd.)

Once the D&C tree is formulated it is easy to get the low-level & stitch-up designs Stitch-up design shown here
A Comp. A[7..0]],B[7..0] Stitch-up of solns to A1 and A2 to form the complete soln to A A2 Comp A[3..0],B[3..0]
A1 If A1 reslt is > or < take A1 reslt else take A2 reslt

A1,2 Comp A[5,4],B[5,4]
Comp A[7..4],B[7..4]
A1,1
Comp A[7..6],B[7..6]
A1,1,1 Comp A[7],B[7]
If A1,1 reslt is > or < take A1,1 reslt else take A1,2 reslt
If A1,1,1 reslt is > or < take A1,1,1 reslt else take A1,1,2 reslt
A1,1,2 Comp A[6],B[6]
my_op1 my_op2
Stitch up logic details: If f2(i) = 0 then { my_op1=f1(i); my_op2=f2(i) } /* select MS comp o/ps */ else /* select LS comp. o/ps */ {my_op1=f1(i-1); my_op2=f2(i-1) }
my_op
2
A[i] B[i] 0 0 0 1 1 0 1 1
f1(i) f2(i) 0 1 0 0 1 0 0 1
OR
Stitch-up logic
f2(i)
2
2-bit 2:1 Mux

I0 I1
2
f1(i) f2(i) f1(i-1) f2(i-1) my_op1 my_op2 X 0 X X f1(i) f2(i) X 1 X X f1(i-1) f2(i-1)
f1(i) f2(i) f1(i-1) f2(i-1)
f(i)
f(i-1) (Compact TT)
(Direct design)
Comparator Circuit Design Using D&C Final Design

H/W_cost(8-bit comp.) = 7(HW_cost(2:1 Muxes)) + 8(H/W_cost(2-bit comp.)
F= my1(6)
H/W_cost(n-bit comp.) = my(5)(2) 2:1 Mux (n-1)(H/W_cost(2:1 Muxes)) + I1 I0 n(H/W_cost(2-bit comp.))

my(5) my(5)(1)
2
1-bit
Delay(8-bit comp.) = 3 (delay of 2:1 Mux) + delay of 2-bit comp. Note parallelism at work multiple logic blocks are processing simult. Delay(n-bit comp.) = log n (delay of 2:1 Mux) + delay of 2-bit comp.
my(4)(1)
my(4)
Log n level of Muxes
2-bit my(3)(2) 2:1 Mux

2
2-bit my(1)(2) 2:1 Mux

2
I0
I1
2
I0
I1
2
my(3)
my(2)
my(1)
my(0)
2-bit f2(7) = f(7)(2) 2:1 Mux

2
2-bit f(5)(2) 2:1 Mux

2
2-bit f(3)(2) 2:1 Mux

2
2-bit f(1)(2) 2:1 Mux

2
I0
I1
2
I0
I1
2
I0
I1
2
I0
I1
2
f(7)
2 2
f(6)
2
f(5)
2
f(4)
2
f(3)
2
f(2)
2
f(1)
2
f(0) 1-bit comparator
1-bit comparator
1-bit comparator
1-bit comparator
1-bit comparator
1-bit comparator
1-bit comparator
1-bit comparator
A[7] B[7]
A[6] B[6]
A[5] B[5]
A[4] B[4]
A[3] B[3]
A[2] B[2]
A[1] B[1]
A[0] B[0]
All bits except msb should have different combinations; msb should be at a constant value (here 0)
I0
D&C: Top-Down vs Bottom-Up: Mux Design 2n-1 :1 MUX

2:1
S0
I 2 nn-11
2:1
MSB value should differ among these 2 groups
2n-1 2:1 MUXes
Sn-2 S0
2:1
Sn-1
S0
2n-1 :1 MUX
I2n-1 n
2:1
2n-1 :1 MUX
S0
Sn-1 S1
(b) Bottom-Up
I 2 n 1
(a) Top-Down
Sn-2 S0
Generally better to try top-down first
An 8:1 MUX example (bottom-up)

I0
Selected when S0 = 0
2:1
MUX
S0
I0
I1
I2
I0 I1 I2 I3 I4 I5 I6 I7
2:1
MUX
S0
I2
I1 I3 I5 4:1
MUX Z
8:1 MUX Z
I3
I4
I5
S2 S1 S0
These inputs should have different lsb or S0 values, since their sel. is based on S0 (all other remaining, i.e., unselected bit values should be the same). Similarly for other i/p pairs at 2:1 Muxes at this level.
2:1 MUX S0
I4
S2 S1
I6 I7
I6 I7
2:1
MUX
S0
Selected when S0 = 1
Opening up the 8:1 MUXs hierarchical design and a top-down view

All bits except msb should have different combinations; msb should be at a constant value (here 0) MSB value should differ among these 2 groups
I0 I1 I2 I3 I4 I5 I6 I7
I0
2:1
MUX
S0
I04:1 Mux
Selected when S0 = 0, S1 = 1, S2=1
I1
I2
8:1 MUX Z
2:1
MUX
S0
I2
I3
I4
2:1 MUX S1 2:1 MUX S1
I2
2:1 MUX
I6
I6
S2 S1 S0
I5
I6
2:1 MUX S0
I4
S2
2:1
MUX
I6
4:1 Mux
Selected when S0 = 0, S1 = 1. These i/ps should differ in S2
I7
S0
Adder Design using D&C

Example: Ripple-Carry Adder (RCA)
Stitching up: Carry from LS n/2 bits is input to carry-in of MS n/2 bits at each level of the D&C tree. Leaf subproblem: Full Adder (FA) Add n-bit #s X, Y
Add MS n/2 bits of X,Y
Add LS n/2 bits of X,Y
Example: Carry-Lookahead Adder (CLA)

Division: 4 subproblems per level Stitching up: A more complex stitching up process (generation of super P,Gs to connect up the subproblems) Leaf subproblem: 4-bit basic CLA with small p, g bits.
FA
FA FA
FA
(a) D&C for Ripple-Carry Adder
Add n-bit #s X, Y
Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits
More intricate techniques (like P,G Add ms n/4 bits generation in CLA) for complex stitching up for fast designs may need to be devised that is not directly suggested by D&C. But 4-bit CLA D&C is a good starting point.
4-bit CLA
4-bit CLA
4-bit CLA
(b) D&C for Carry-Lookahead Adder
Dependency Resolution in D&C: (1) The Wait Strategy

So far we have seen D&C breakups in which there is no data dependency between the two (or more) subproblems of the breakup Data dependency leads to increased delays We now look at various ways of speeding up designs that have subproblem ependencies in their D&C breakups
Root problem A
Subprob. A2
Subprob. A1
Data flow
Strategy 1: Wait for required o/p of A1 and then perform A2, e.g., as in a ripple-carry adder: A = n-bit addition, A1 = (n/2)-bit addition of the L.S. n/2 bits, A2 = (n/2)-bit addition of the M.S. n/2 bits No concurrency between A1 and A2: t(A) = t(A1) + t(A2) + t(stich-up) = 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems of the same size (w/ different i/ps)
Example of the Wait Strategy in Adder Design
Note: Gate delay is propotional to # of inputs (since, generally there is a series connection of transistors in either the up or down network = # of inputs Rs of the transistors in series add up and is prop to # of inputs delay ~ RC (C is capacitive load) is prop. to # of inputs) Assume each gate i/p contributes 2 ns of delay For a 16-bit adder the delay will be 160 ns For a 64 bit adder the delay will be 640 ns
Dependency Resolution in D&C:

(2) The Design-for-all-cases-&-select or Speculative Strategy
Strategy 2: For a k-bit i/p from A1 to A2, design 2k copies of A2 each with a different hardwired kbit i/p to replace the one from A1. Select the correct o/p from all the copies of A2 via a (2k)-to-1 Mux that is selected by the k-bit o/p from A1 when it becomes available (e.g., carry-select adder) t(A) = max(t(A1), t(A2)) + t(Mux) + t(stich-up) = t(A1) + t(Mux) + t(stitch-up) if A1 and A2 are the same problems
Root problem A
00
Subprob. A2
Subprob. A1
01
Subprob. A2 I/p01 Subprob. A2 I/p10 I/p11
10
4-to-1 Mux
Select i/p
I/p00
11
Subprob. A2
Other variations---Predict Strategy: Have a single copy of A2 but choose a highly likely value of the k-bit i/p and perform A1, A2 concurrently. If after k-bit i/p from A1 is available and selection is incorrect, re-do A2 w/ correct available value. t(A) = p(correct-choice)*max(t(A1), t(A2)) +[(1-p(correct-choice)]*t(A2) + t(Mux) + t(stich-up), where p(correct-choice) is probability that our choice of the k-bit i/p for A2 is correct Need a completion signal to indicate when the final o/p is available for A; assuming worstcase time (when the choice is incorrect) is meaningless is such designs
Example of the Speculative Strategy in Adder Design
For a 16-bit adder, the delay is (9*4 8)*2 = 56 ns (2 ns is the delay for a single i/p); a 65% improvement ((160-56)*100/160) For a 64-bit adder, the delay is (9*8 8)*2 = 128 ns; an 80% improvement.
Dependency Resolution in D&C: (3) The Lookahead or Pre-Computation Strategy

Concept
Root problem A
v x u v
Example of an unstructured logic for A2

x w x yw z a1 u x a1 v x u v x w x yw z u x a1
A2_dep Subprob. A2
Subprob. A1
Data flow
A2 A2_indep
A2_indep or A2_lookahd
Critical path after a1 avail (8-unit delay)

a2
A2_dep Critical path after a1 avail (4-unit delay)
a2
Strategy 3: Redo the design of A2 so that it can do as much processing as possible that is independent of the i/p from A1 (A2_indep = A2_lookahd). This is the lookahead computation that prepares for the final computation of A2 (A2_dep) that can start once A2_indep and A1 are done. t(A) = max(t(A1), t(A2_indep)) + t(A2_dep) + t(stitch-up) E.g., Carry-looakahead adder --- does lookahead computation; also looakahead compuattion is associative, so doable in (log n). Overall computation is also doable in (log n) time. A less structured example: Let a1 be the i/p from A1 to A2. If A2 has the logic: a2 = vx + uvx + wxy + wza1 + uxa1. If this were implemented using 2-i/p AND/OR gates, the delay will be 8 delay units (1 unit = delay for 1 i/p) after a1 is available. If the logic is re-structured as a2= (vx + uvx + wxy) + (wz + ux)a1, and if the logic in the 2 brackets are performed before a1 is available (these constitute A2_indep), then the delay is only 4 delay units after a1 is available.
D&C Summary
For complex digital design, we need to think of the computation underlying the design in an algorithmic manner---are there properties of this computation that can be exploited for faster, less expensive, modular design; is it amenable to the D&C approach? The design is then developed in an algorithmic manner & the corresponding circuit may be synthesized by hand or described compactly using a HDL For an operation/func x on n operands (an-1 x an-2 x x a0 ) if x is associative, the D&C approach gives an easy stitch-up function, which is x on 2 operands (o/ps of applying x on each half). This results in a tree-structured circuit with (log n) delay instead of a linearlyconnected circuit with (n) delay can be synthesized. If x is non-associative, more ingenuity and determination of properties of x is needed to determine the stitch-up function. The resulting design may or may not be tree-structured D&C can be done top-down or bottom-up. Top-down generally better way to think for beginners If there is dependency between the 2 subproblems, then we saw strategies for addressing these dependencies:
Wait (slowest, least hardware cost) Speculative (fastest, highest hardware cost) Lookahead (medium speed, medium hardware cost)
Strategy 2: A general view of speculative computations (w/ or w/o D&C) x

If there is a data dependency between two or more portions of a computation (which may be obtained w/ or w/o using D&C), dont wait for the the previous computation to finish before starting the next one Assume all possible input values for the next computation/stage B (e.g., if it has 2 inputs from the prev. stage there will be 4 possible input value combinations) and perform it using a copy of the design for possible input value. All the different o/ps of the diff. Copies of B are Muxed using prev. stage As o/p E.g. design: Carry-Select Adder (at each stage performs two additions one for carryin of 0 and another for carry-in of 1 from the previous stage)
z B
A
y
(a) Original design: Time = T(A)+T(B) x 0 0 0 1 1 B(0,0)
A
y
B(0,1) 4:1 Mux B(1,0) B(1,1)
0 1
1
(b) Speculative computation: Time = max(T(A),T(B)) + T(Mux). Works well when T(A) approx = T(B) and T(A) >> T(Mux)
Strategy 3: Get the Best of Both Worlds (Average and Worst Case Delays)!
Registers
inputs
start
inputs
Unary Division Ckt (good ave case, bad done1 worst case)
output select
Ext. FSM
done2
NonRestoring Div. Ckt (bad ave case, good worst case)

output
Mux
Register
Use 2 circuits with different worst-case and average-case behaviors Use the first available output Get the best of both (ave-case, worst-case) worlds In the above schematic, we get the good ave case performance of unary division (assuming uniformly distributed inputs w/o the disadvantage of its bad worst-case performance)
Strategy 4: Pipeline It!

Stage 1
Original ckt or datapath
Stage 2
Conversion to a simple level-partitioned pipeline (level partition may not always be possible but other pipelineable partitions may be)
Stage k
Throughput is defined as # of outputs / sec Non-pipelined throughput = (1 / D), where D = delay of original ckts datapath Pipeline thoughput = 1/ (max stage delay + register delay) Special case: If original ckts datapath is divided into n stages, each of equal delay, and dr is the delay of a register, then pipeline thoughput = 1/((D/n)+dr). If dr is negligible compared to D/n, then pipeline throughput = n/D, n times that of the original ckt

Ece465 High Level Design Strategies

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Ece465 High Level Design Strategies

Încărcat de

Drepturi de autor:

Formate disponibile

ECE 465 High Level Design Strategies

Other Design Strategies for fast circuits:

Circuit Design Problem

00000000 00000001 0 -------------------00000001 00000000 1 ---------------------11111111 11111111 0

Circuit Design Problem (contd)

Stitch-up of solns to A1 and A2 to form the complete soln to A

Do recursively until subprob-size is s.t. TT-based design is doable

Shift Gears: Design of a Parity Detection CircuitA Series of XORs

(b) 16-bit parity tree

x(1) x(2) X(3) x(15) f

D&C for Associative Operations

f(xn-1, .., x0)

f(a,b) a f(xn-1, .., xn/2) b

Stitch-up function---same as the original function for 2 inputs

f(xn/2-1, .., x0)

D&C Approach for Non-Associative Opers: n-bit Comparator

Small enough to be designed using a TT

(2-bit 2-o/p comparator)

Comparator Circuit Design Using D&C (contd.)

A1 If A1 reslt is > or < take A1 reslt else take A2 reslt

A1,1,2 Comp A[6],B[6]

2-bit 2:1 Mux

f1(i) f2(i) f1(i-1) f2(i-1)

f(i-1) (Compact TT)

Comparator Circuit Design Using D&C Final Design

H/W_cost(n-bit comp.) = my(5)(2) 2:1 Mux (n-1)(H/W_cost(2:1 Muxes)) + I1 I0 n(H/W_cost(2-bit comp.))

Log n level of Muxes

2-bit my(3)(2) 2:1 Mux

2-bit my(1)(2) 2:1 Mux

2-bit f2(7) = f(7)(2) 2:1 Mux

2-bit f(5)(2) 2:1 Mux

2-bit f(3)(2) 2:1 Mux

2-bit f(1)(2) 2:1 Mux

f(0) 1-bit comparator

D&C: Top-Down vs Bottom-Up: Mux Design 2n-1 :1 MUX

2n-1 2:1 MUXes

Generally better to try top-down first

An 8:1 MUX example (bottom-up)

Opening up the 8:1 MUXs hierarchical design and a top-down view

2:1 MUX S1 2:1 MUX S1

Selected when S0 = 0, S1 = 1. These i/ps should differ in S2

Adder Design using D&C

Add MS n/2 bits of X,Y

Add LS n/2 bits of X,Y

Example: Carry-Lookahead Adder (CLA)

(a) D&C for Ripple-Carry Adder

(b) D&C for Carry-Lookahead Adder

Dependency Resolution in D&C: (1) The Wait Strategy

Example of the Wait Strategy in Adder Design

Dependency Resolution in D&C:

Subprob. A2 I/p01 Subprob. A2 I/p10 I/p11

Example of the Speculative Strategy in Adder Design

Dependency Resolution in D&C: (3) The Lookahead or Pre-Computation Strategy

Example of an unstructured logic for A2

Critical path after a1 avail (8-unit delay)

A2_dep Critical path after a1 avail (4-unit delay)

Strategy 2: A general view of speculative computations (w/ or w/o D&C) x

(a) Original design: Time = T(A)+T(B) x 0 0 0 1 1 B(0,0)

B(0,1) 4:1 Mux B(1,0) B(1,1)

NonRestoring Div. Ckt (bad ave case, good worst case)

Strategy 4: Pipeline It!

Original ckt or datapath

S-ar putea să vă placă și