Documente Academic
Documente Profesional
Documente Cultură
Lecture Notes # 9
Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago
Outline
Circuit Design Problem Solution Approaches: Truth Table (TT) vs. Computational/Algorithmic Yes, hardware, just like software can implement any algorithm! Flat vs. Divide-&-Conquer Divide-&-Conquer:
Associative operations/functions General operations/functions
Summary
00000000
00000000
Too cumbersome and time-consuming Fraught with possibility of human error Difficult to formally prove correctness (i.e., proof w/o exhasutive testing) Will generally have high hardware cost (including wiring) and delay
Circuit Design Problem: Strategy 1: Divide-&-Conquer Approach 2(b): Structured algorithmic approach:
Be more innovative, think of the structure/properties of the computational problem E.g., think if the problem can be solved in a hierarchical or divide&-conquer (D&C) manner:
Root problem A
Subprob. A1
Subprob. A2
A1,1
A1,2
A2,1
A2,2
D&C approach: See if the problem can be broken up into 2 or more smaller subproblems whose solutions can be stitched-up to give a soln. to the parent prob. Do this recrusively for each large subprob until subprobs are small enough for TT-based solutions If the subprobs are of a similar kind (but of smaller size) to the root prob then the breakup and stitching will also be similar
No concurrency in design (a)---the actual problem has w(3,5) w(3,7) available concurrency, though, and it is not exploited well in w(3,4) w(3,6) the above linear design Complete sequentialization leading to a delay that is linear in the # of bits n (delay = n*td), td = delay of 1 gate All the available concurrency is exploited in design (b)---a w(2,3) w(2,2) parity tree. Question: When can we have a tree-structured circuit for an operation on multiple operands? Answer: (1) When the operation makes sense for any # of operands. (2) It should be possible to break it down into w(1,1) operations w/ fewer operands. (3) When the operation is associative. An oper. x is said to be associative if: a x b x c = (a x b) x c = a x (b x c). Thus if we have 4 operations a x b x c x d, we can either perform this as a x (b x (c x d)) [getting a linear delay of 3 units] or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available concurrency due w(0,0) = f to the fact that x is associative]. We can extend this idea to n operands (& n-1 operations) to perform as many of the pairwise operations as possible in parallel (& do this recursively for every level of remaining operations), similar to design (b) for the parity detector [xor is an associative operation!] and thus get a (log2 n) delay.
w(3,3)
w(3,2)
w(3,1) w(3,0)
w(2,1)
w(2,0)
w(1,0)
Delay = (# of levels in AND-OR tree) * td = log2 (n) *td An example of simple designer ingenuity---a bad design would have resulted in a linear delay that the VHDL code & the synthesis tool would have been at the mercy of.
Using the D&C approach for an associative operation results in the stitch up function being the same as the original function (not the case for nonassoc. operations), but w/ a constant # of operands (2, if the orig problem is broken into 2 subproblems) If the two sub-problems of the D&C approach are balanced (of the same size or as close to it as possible), then unfolding the D&C results in a balanced operation tree of the type for the xor/parity function seen earlier
If A1,1,1 reslt is > or < take A1,1,1 reslt else take A1,1,2 reslt
A1,1,2 Comp A[6],B[6] The TT may be derived directly or by first thinking of and expressing its computation in a high-level programming language and then converting it to a TT.
If A[i] = B[i] then { f1(i)=0; f2(i) = 1; /* f2(i) o/p is an i/p to the stitch logic */
A[i] B[i] 0 0 0 1 1 0 1 1 f1(i) f2(i) 0 1 0 0 1 0 0 1 /* f2(i) =1 means f1( ), f2( ) o/ps of the LS of this subtree should be selected by the stitch logic as its o/ps */ else if A[i] < B[i} then { f1(i) = 0; /* indicates < */ f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */ else if A[i] > B[i] then {f1(i) = 1; /* indicates > */ f2(i) = 0 }
Comp A[7..4],B[7..4]
A1,1
Comp A[7..6],B[7..6]
A1,1,1 Comp A[7],B[7]
If A1,1 reslt is > or < take A1,1 reslt else take A1,2 reslt
If A1,1,1 reslt is > or < take A1,1,1 reslt else take A1,1,2 reslt
my_op1 my_op2
Stitch up logic details: If f2(i) = 0 then { my_op1=f1(i); my_op2=f2(i) } /* select MS comp o/ps */ else /* select LS comp. o/ps */ {my_op1=f1(i-1); my_op2=f2(i-1) }
my_op
2
A[i] B[i] 0 0 0 1 1 0 1 1
f1(i) f2(i) 0 1 0 0 1 0 0 1
OR
Stitch-up logic
f2(i)
2
f1(i) f2(i) f1(i-1) f2(i-1) my_op1 my_op2 X 0 X X f1(i) f2(i) X 1 X X f1(i-1) f2(i-1)
f(i)
(Direct design)
1-bit
Delay(8-bit comp.) = 3 (delay of 2:1 Mux) + delay of 2-bit comp. Note parallelism at work multiple logic blocks are processing simult. Delay(n-bit comp.) = log n (delay of 2:1 Mux) + delay of 2-bit comp.
my(4)(1)
my(4)
I0
I1
2
I0
I1
2
my(3)
my(2)
my(1)
my(0)
I0
I1
2
I0
I1
2
I0
I1
2
I0
I1
2
f(7)
2 2
f(6)
2
f(5)
2
f(4)
2
f(3)
2
f(2)
2
f(1)
2
1-bit comparator
1-bit comparator
1-bit comparator
1-bit comparator
1-bit comparator
1-bit comparator
1-bit comparator
A[7] B[7]
A[6] B[6]
A[5] B[5]
A[4] B[4]
A[3] B[3]
A[2] B[2]
A[1] B[1]
A[0] B[0]
All bits except msb should have different combinations; msb should be at a constant value (here 0)
I0
I 2 nn-11
2:1
MSB value should differ among these 2 groups
Sn-2 S0
2:1
Sn-1
S0
2n-1 :1 MUX
All bits except msb should have different combinations; msb should be at a constant value (here 1)
I2n-1 n
2:1
2n-1 :1 MUX
S0
Sn-1 S1
(b) Bottom-Up
I 2 n 1
(a) Top-Down
Sn-2 S0
Selected when S0 = 0
2:1
MUX
S0
I0
I1
I2
I0 I1 I2 I3 I4 I5 I6 I7
2:1
MUX
S0
I2
I1 I3 I5 4:1
MUX Z
8:1 MUX Z
I3
I4
I5
S2 S1 S0
These inputs should have different lsb or S0 values, since their sel. is based on S0 (all other remaining, i.e., unselected bit values should be the same). Similarly for other i/p pairs at 2:1 Muxes at this level.
2:1 MUX S0
I4
S2 S1
I6 I7
I6 I7
2:1
MUX
S0
Selected when S0 = 1
I0
2:1
MUX
S0
I04:1 Mux
Selected when S0 = 0, S1 = 1, S2=1
I1
I2
8:1 MUX Z
2:1
MUX
S0
I2
I3
I4
I2
2:1 MUX
I6
I6
S2 S1 S0
I5
I6
2:1 MUX S0
I4
S2
2:1
MUX
I6
4:1 Mux
All bits except msb should have different combinations; msb should be at a constant value (here 1)
I7
S0
FA
Add n-bit #s X, Y
Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits
More intricate techniques (like P,G Add ms n/4 bits generation in CLA) for complex stitching up for fast designs may need to be devised that is not directly suggested by D&C. But 4-bit CLA D&C is a good starting point.
4-bit CLA
4-bit CLA
4-bit CLA
Subprob. A2
Subprob. A1
Data flow
Strategy 1: Wait for required o/p of A1 and then perform A2, e.g., as in a ripple-carry adder: A = n-bit addition, A1 = (n/2)-bit addition of the L.S. n/2 bits, A2 = (n/2)-bit addition of the M.S. n/2 bits No concurrency between A1 and A2: t(A) = t(A1) + t(A2) + t(stich-up) = 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems of the same size (w/ different i/ps)
Note: Gate delay is propotional to # of inputs (since, generally there is a series connection of transistors in either the up or down network = # of inputs Rs of the transistors in series add up and is prop to # of inputs delay ~ RC (C is capacitive load) is prop. to # of inputs) Assume each gate i/p contributes 2 ns of delay For a 16-bit adder the delay will be 160 ns For a 64 bit adder the delay will be 640 ns
00
Subprob. A2
Subprob. A1
01
10
4-to-1 Mux
Select i/p
I/p00
11
Subprob. A2
Other variations---Predict Strategy: Have a single copy of A2 but choose a highly likely value of the k-bit i/p and perform A1, A2 concurrently. If after k-bit i/p from A1 is available and selection is incorrect, re-do A2 w/ correct available value. t(A) = p(correct-choice)*max(t(A1), t(A2)) +[(1-p(correct-choice)]*t(A2) + t(Mux) + t(stich-up), where p(correct-choice) is probability that our choice of the k-bit i/p for A2 is correct Need a completion signal to indicate when the final o/p is available for A; assuming worstcase time (when the choice is incorrect) is meaningless is such designs
For a 16-bit adder, the delay is (9*4 8)*2 = 56 ns (2 ns is the delay for a single i/p); a 65% improvement ((160-56)*100/160) For a 64-bit adder, the delay is (9*8 8)*2 = 128 ns; an 80% improvement.
A2_dep Subprob. A2
Subprob. A1
Data flow
A2 A2_indep
A2_indep or A2_lookahd
a2
Strategy 3: Redo the design of A2 so that it can do as much processing as possible that is independent of the i/p from A1 (A2_indep = A2_lookahd). This is the lookahead computation that prepares for the final computation of A2 (A2_dep) that can start once A2_indep and A1 are done. t(A) = max(t(A1), t(A2_indep)) + t(A2_dep) + t(stitch-up) E.g., Carry-looakahead adder --- does lookahead computation; also looakahead compuattion is associative, so doable in (log n). Overall computation is also doable in (log n) time. A less structured example: Let a1 be the i/p from A1 to A2. If A2 has the logic: a2 = vx + uvx + wxy + wza1 + uxa1. If this were implemented using 2-i/p AND/OR gates, the delay will be 8 delay units (1 unit = delay for 1 i/p) after a1 is available. If the logic is re-structured as a2= (vx + uvx + wxy) + (wz + ux)a1, and if the logic in the 2 brackets are performed before a1 is available (these constitute A2_indep), then the delay is only 4 delay units after a1 is available.
D&C Summary
For complex digital design, we need to think of the computation underlying the design in an algorithmic manner---are there properties of this computation that can be exploited for faster, less expensive, modular design; is it amenable to the D&C approach? The design is then developed in an algorithmic manner & the corresponding circuit may be synthesized by hand or described compactly using a HDL For an operation/func x on n operands (an-1 x an-2 x x a0 ) if x is associative, the D&C approach gives an easy stitch-up function, which is x on 2 operands (o/ps of applying x on each half). This results in a tree-structured circuit with (log n) delay instead of a linearlyconnected circuit with (n) delay can be synthesized. If x is non-associative, more ingenuity and determination of properties of x is needed to determine the stitch-up function. The resulting design may or may not be tree-structured D&C can be done top-down or bottom-up. Top-down generally better way to think for beginners If there is dependency between the 2 subproblems, then we saw strategies for addressing these dependencies:
Wait (slowest, least hardware cost) Speculative (fastest, highest hardware cost) Lookahead (medium speed, medium hardware cost)
A
y
A
y
0 1
1
(b) Speculative computation: Time = max(T(A),T(B)) + T(Mux). Works well when T(A) approx = T(B) and T(A) >> T(Mux)
Strategy 3: Get the Best of Both Worlds (Average and Worst Case Delays)!
Registers
inputs
start
inputs
Unary Division Ckt (good ave case, bad done1 worst case)
output select
Ext. FSM
done2
Mux
Register
Use 2 circuits with different worst-case and average-case behaviors Use the first available output Get the best of both (ave-case, worst-case) worlds In the above schematic, we get the good ave case performance of unary division (assuming uniformly distributed inputs w/o the disadvantage of its bad worst-case performance)
Stage 2
Conversion to a simple level-partitioned pipeline (level partition may not always be possible but other pipelineable partitions may be)
Stage k
Throughput is defined as # of outputs / sec Non-pipelined throughput = (1 / D), where D = delay of original ckts datapath Pipeline thoughput = 1/ (max stage delay + register delay) Special case: If original ckts datapath is divided into n stages, each of equal delay, and dr is the delay of a register, then pipeline thoughput = 1/((D/n)+dr). If dr is negligible compared to D/n, then pipeline throughput = n/D, n times that of the original ckt