Sunteți pe pagina 1din 23

ECE 465 High Level Design Strategies

Lecture Notes # 9
Shantanu Dutt Electrical & Computer Engineering University of Illinois at Chicago

Outline
Circuit Design Problem Solution Approaches: Truth Table (TT) vs. Computational/Algorithmic Yes, hardware, just like software can implement any algorithm! Flat vs. Divide-&-Conquer Divide-&-Conquer:
Associative operations/functions General operations/functions

Other Design Strategies for fast circuits:


Speculative computation Best of both worlds (best average and best worst-case) Pipelining

Summary

Circuit Design Problem


Design an 8-bit comparator that compares two 8-bit #s available in two registers A[7..0] and B[7..0], and that o/ps F = 1 if A > B and F = 0 if A <= B. Approach 1: The TT approach -- Write down a 16-bit TT, derive logic expression from it, minimize it, obtain gate-based realization, etc.!
A B F

00000000

00000000

00000000 00000001 0 -------------------00000001 00000000 1 ---------------------11111111 11111111 0

Too cumbersome and time-consuming Fraught with possibility of human error Difficult to formally prove correctness (i.e., proof w/o exhasutive testing) Will generally have high hardware cost (including wiring) and delay

Circuit Design Problem (contd)


Approach 2: Think computationally/algorithmically about what the ckt is supposed to compute: Approach 2(a): Flat algorithmic approach:
Note: A TT can be expressed as a sequence of if-then-elses If A = 00000000 and B = 00000000 then F = 0 else if A = 00000000 and B = 00000001 then F=0 . else if A = 00000001 and B = 00000000 then F=1 . Essentially a re-hashing of the TT same problems as the TT approach

Circuit Design Problem: Strategy 1: Divide-&-Conquer Approach 2(b): Structured algorithmic approach:
Be more innovative, think of the structure/properties of the computational problem E.g., think if the problem can be solved in a hierarchical or divide&-conquer (D&C) manner:
Root problem A

Stitch-up of solns to A1 and A2 to form the complete soln to A

Subprob. A1

Subprob. A2

A1,1

A1,2

A2,1

A2,2

Do recursively until subprob-size is s.t. TT-based design is doable

D&C approach: See if the problem can be broken up into 2 or more smaller subproblems whose solutions can be stitched-up to give a soln. to the parent prob. Do this recrusively for each large subprob until subprobs are small enough for TT-based solutions If the subprobs are of a similar kind (but of smaller size) to the root prob then the breakup and stitching will also be similar

Shift Gears: Design of a Parity Detection CircuitA Series of XORs


(a) A linearly-connected circuit
x(0)

(b) 16-bit parity tree


f = (((x(15) xor x(14)) xor (x(13) xor x(12))) xor ((x(11) xor x(10)) xor (x(9) xor x(8)))) xor (((x(7) xor x(6)) xor (x(5) xor x(4))) xor ((x(3) xor x(2)) xor (x(1) xor x(0)))) x(1) x(0) x(15) x(14)

x(1) x(2) X(3) x(15) f

No concurrency in design (a)---the actual problem has w(3,5) w(3,7) available concurrency, though, and it is not exploited well in w(3,4) w(3,6) the above linear design Complete sequentialization leading to a delay that is linear in the # of bits n (delay = n*td), td = delay of 1 gate All the available concurrency is exploited in design (b)---a w(2,3) w(2,2) parity tree. Question: When can we have a tree-structured circuit for an operation on multiple operands? Answer: (1) When the operation makes sense for any # of operands. (2) It should be possible to break it down into w(1,1) operations w/ fewer operands. (3) When the operation is associative. An oper. x is said to be associative if: a x b x c = (a x b) x c = a x (b x c). Thus if we have 4 operations a x b x c x d, we can either perform this as a x (b x (c x d)) [getting a linear delay of 3 units] or as (a x b) x (c x d) [getting a logarithmic (base 2) delay of 2 units and exploiting the available concurrency due w(0,0) = f to the fact that x is associative]. We can extend this idea to n operands (& n-1 operations) to perform as many of the pairwise operations as possible in parallel (& do this recursively for every level of remaining operations), similar to design (b) for the parity detector [xor is an associative operation!] and thus get a (log2 n) delay.

w(3,3)
w(3,2)

w(3,1) w(3,0)

w(2,1)

w(2,0)

w(1,0)

Delay = (# of levels in AND-OR tree) * td = log2 (n) *td An example of simple designer ingenuity---a bad design would have resulted in a linear delay that the VHDL code & the synthesis tool would have been at the mercy of.

D&C for Associative Operations


Let f(xn-1, .., x0) be an associative function. What is the D&C principle involved in the design of an n-bit xor/parity function? Can it also lead automatically to a tree-based ckt?

f(xn-1, .., x0)

f(a,b) a f(xn-1, .., xn/2) b

Stitch-up function---same as the original function for 2 inputs

f(xn/2-1, .., x0)

Using the D&C approach for an associative operation results in the stitch up function being the same as the original function (not the case for nonassoc. operations), but w/ a constant # of operands (2, if the orig problem is broken into 2 subproblems) If the two sub-problems of the D&C approach are balanced (of the same size or as close to it as possible), then unfolding the D&C results in a balanced operation tree of the type for the xor/parity function seen earlier

D&C Approach for Non-Associative Opers: n-bit Comparator


Useful property: At any Is this is associative?not sure A level, comp. of MS (most For a non-associative func, significant) half determines determine its propeties that allow o/p if result is > or < else Comp. A[7..0]],B[7..0] determining a break-up & a comp. of LS determ. o/p correct stitch-up function Can thus break up problem If A1 reslt is at any level into MS and A1 Comp A[7..4],B[7..4] > or < take LS comparisons & based A1 reslt else on their results determine take A2 reslt which o/p to choose for the If A1,1 reslt is A1,2 A1,1 higher-level (parent) result
Comp A[7..6],B[7..6]
A1,1,1 Comp A[7],B[7] > or < take A1,1 reslt else take A1,2 reslt Comp A[5,4],B[5,4] Stitch-up of solns to A1 and A2 to form the complete soln to A Comp A[3..0],B[3..0] A2

If A1,1,1 reslt is > or < take A1,1,1 reslt else take A1,1,2 reslt

A1,1,2 Comp A[6],B[6] The TT may be derived directly or by first thinking of and expressing its computation in a high-level programming language and then converting it to a TT.

Small enough to be designed using a TT

If A[i] = B[i] then { f1(i)=0; f2(i) = 1; /* f2(i) o/p is an i/p to the stitch logic */
A[i] B[i] 0 0 0 1 1 0 1 1 f1(i) f2(i) 0 1 0 0 1 0 0 1 /* f2(i) =1 means f1( ), f2( ) o/ps of the LS of this subtree should be selected by the stitch logic as its o/ps */ else if A[i] < B[i} then { f1(i) = 0; /* indicates < */ f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */ else if A[i] > B[i] then {f1(i) = 1; /* indicates > */ f2(i) = 0 }

(2-bit 2-o/p comparator)

Comparator Circuit Design Using D&C (contd.)


Once the D&C tree is formulated it is easy to get the low-level & stitch-up designs Stitch-up design shown here
A Comp. A[7..0]],B[7..0] Stitch-up of solns to A1 and A2 to form the complete soln to A A2 Comp A[3..0],B[3..0]

A1 If A1 reslt is > or < take A1 reslt else take A2 reslt


A1,2 Comp A[5,4],B[5,4]

Comp A[7..4],B[7..4]

A1,1

Comp A[7..6],B[7..6]
A1,1,1 Comp A[7],B[7]

If A1,1 reslt is > or < take A1,1 reslt else take A1,2 reslt

If A1,1,1 reslt is > or < take A1,1,1 reslt else take A1,1,2 reslt

A1,1,2 Comp A[6],B[6]

my_op1 my_op2

Stitch up logic details: If f2(i) = 0 then { my_op1=f1(i); my_op2=f2(i) } /* select MS comp o/ps */ else /* select LS comp. o/ps */ {my_op1=f1(i-1); my_op2=f2(i-1) }
my_op
2

A[i] B[i] 0 0 0 1 1 0 1 1

f1(i) f2(i) 0 1 0 0 1 0 0 1

OR

Stitch-up logic

f2(i)
2

2-bit 2:1 Mux


I0 I1
2

f1(i) f2(i) f1(i-1) f2(i-1) my_op1 my_op2 X 0 X X f1(i) f2(i) X 1 X X f1(i-1) f2(i-1)

f1(i) f2(i) f1(i-1) f2(i-1)

f(i)

f(i-1) (Compact TT)

(Direct design)

Comparator Circuit Design Using D&C Final Design


H/W_cost(8-bit comp.) = 7(HW_cost(2:1 Muxes)) + 8(H/W_cost(2-bit comp.)
F= my1(6)

H/W_cost(n-bit comp.) = my(5)(2) 2:1 Mux (n-1)(H/W_cost(2:1 Muxes)) + I1 I0 n(H/W_cost(2-bit comp.))


my(5) my(5)(1)
2

1-bit

Delay(8-bit comp.) = 3 (delay of 2:1 Mux) + delay of 2-bit comp. Note parallelism at work multiple logic blocks are processing simult. Delay(n-bit comp.) = log n (delay of 2:1 Mux) + delay of 2-bit comp.
my(4)(1)

my(4)

Log n level of Muxes

2-bit my(3)(2) 2:1 Mux


2

2-bit my(1)(2) 2:1 Mux


2

I0

I1
2

I0

I1
2

my(3)

my(2)

my(1)

my(0)

2-bit f2(7) = f(7)(2) 2:1 Mux


2

2-bit f(5)(2) 2:1 Mux


2

2-bit f(3)(2) 2:1 Mux


2

2-bit f(1)(2) 2:1 Mux


2

I0

I1
2

I0

I1
2

I0

I1
2

I0

I1
2

f(7)
2 2

f(6)
2

f(5)
2

f(4)
2

f(3)
2

f(2)
2

f(1)
2

f(0) 1-bit comparator

1-bit comparator

1-bit comparator

1-bit comparator

1-bit comparator

1-bit comparator

1-bit comparator

1-bit comparator

A[7] B[7]

A[6] B[6]

A[5] B[5]

A[4] B[4]

A[3] B[3]

A[2] B[2]

A[1] B[1]

A[0] B[0]

All bits except msb should have different combinations; msb should be at a constant value (here 0)

I0

D&C: Top-Down vs Bottom-Up: Mux Design 2n-1 :1 MUX


2:1
S0

I 2 nn-11

2:1
MSB value should differ among these 2 groups

2n-1 2:1 MUXes

Sn-2 S0

2:1
Sn-1

S0

2n-1 :1 MUX

All bits except msb should have different combinations; msb should be at a constant value (here 1)

I2n-1 n

2:1

2n-1 :1 MUX

S0

Sn-1 S1
(b) Bottom-Up

I 2 n 1
(a) Top-Down

Sn-2 S0

Generally better to try top-down first

An 8:1 MUX example (bottom-up)


I0

Selected when S0 = 0

2:1
MUX
S0

I0

I1
I2
I0 I1 I2 I3 I4 I5 I6 I7

2:1
MUX
S0

I2

I1 I3 I5 4:1
MUX Z

8:1 MUX Z

I3
I4

I5
S2 S1 S0
These inputs should have different lsb or S0 values, since their sel. is based on S0 (all other remaining, i.e., unselected bit values should be the same). Similarly for other i/p pairs at 2:1 Muxes at this level.

2:1 MUX S0

I4

S2 S1
I6 I7

I6 I7

2:1
MUX

S0
Selected when S0 = 1

Opening up the 8:1 MUXs hierarchical design and a top-down view


All bits except msb should have different combinations; msb should be at a constant value (here 0) MSB value should differ among these 2 groups
I0 I1 I2 I3 I4 I5 I6 I7

I0

2:1
MUX
S0

I04:1 Mux
Selected when S0 = 0, S1 = 1, S2=1

I1
I2

8:1 MUX Z

2:1
MUX
S0

I2

I3
I4

2:1 MUX S1 2:1 MUX S1

I2
2:1 MUX

I6

I6

S2 S1 S0

I5
I6

2:1 MUX S0

I4

S2

2:1
MUX

I6
4:1 Mux

Selected when S0 = 0, S1 = 1. These i/ps should differ in S2

All bits except msb should have different combinations; msb should be at a constant value (here 1)

I7

S0

Adder Design using D&C


Example: Ripple-Carry Adder (RCA)
Stitching up: Carry from LS n/2 bits is input to carry-in of MS n/2 bits at each level of the D&C tree. Leaf subproblem: Full Adder (FA) Add n-bit #s X, Y

Add MS n/2 bits of X,Y

Add LS n/2 bits of X,Y

Example: Carry-Lookahead Adder (CLA)


Division: 4 subproblems per level Stitching up: A more complex stitching up process (generation of super P,Gs to connect up the subproblems) Leaf subproblem: 4-bit basic CLA with small p, g bits.
FA
FA FA

FA

(a) D&C for Ripple-Carry Adder

Add n-bit #s X, Y
Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits

More intricate techniques (like P,G Add ms n/4 bits generation in CLA) for complex stitching up for fast designs may need to be devised that is not directly suggested by D&C. But 4-bit CLA D&C is a good starting point.

4-bit CLA

4-bit CLA

4-bit CLA

(b) D&C for Carry-Lookahead Adder

Dependency Resolution in D&C: (1) The Wait Strategy


So far we have seen D&C breakups in which there is no data dependency between the two (or more) subproblems of the breakup Data dependency leads to increased delays We now look at various ways of speeding up designs that have subproblem ependencies in their D&C breakups
Root problem A

Subprob. A2

Subprob. A1

Data flow

Strategy 1: Wait for required o/p of A1 and then perform A2, e.g., as in a ripple-carry adder: A = n-bit addition, A1 = (n/2)-bit addition of the L.S. n/2 bits, A2 = (n/2)-bit addition of the M.S. n/2 bits No concurrency between A1 and A2: t(A) = t(A1) + t(A2) + t(stich-up) = 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems of the same size (w/ different i/ps)

Example of the Wait Strategy in Adder Design

Note: Gate delay is propotional to # of inputs (since, generally there is a series connection of transistors in either the up or down network = # of inputs Rs of the transistors in series add up and is prop to # of inputs delay ~ RC (C is capacitive load) is prop. to # of inputs) Assume each gate i/p contributes 2 ns of delay For a 16-bit adder the delay will be 160 ns For a 64 bit adder the delay will be 640 ns

Dependency Resolution in D&C:


(2) The Design-for-all-cases-&-select or Speculative Strategy
Strategy 2: For a k-bit i/p from A1 to A2, design 2k copies of A2 each with a different hardwired kbit i/p to replace the one from A1. Select the correct o/p from all the copies of A2 via a (2k)-to-1 Mux that is selected by the k-bit o/p from A1 when it becomes available (e.g., carry-select adder) t(A) = max(t(A1), t(A2)) + t(Mux) + t(stich-up) = t(A1) + t(Mux) + t(stitch-up) if A1 and A2 are the same problems
Root problem A

00

Subprob. A2

Subprob. A1

01

Subprob. A2 I/p01 Subprob. A2 I/p10 I/p11

10

4-to-1 Mux
Select i/p

I/p00

11

Subprob. A2

Other variations---Predict Strategy: Have a single copy of A2 but choose a highly likely value of the k-bit i/p and perform A1, A2 concurrently. If after k-bit i/p from A1 is available and selection is incorrect, re-do A2 w/ correct available value. t(A) = p(correct-choice)*max(t(A1), t(A2)) +[(1-p(correct-choice)]*t(A2) + t(Mux) + t(stich-up), where p(correct-choice) is probability that our choice of the k-bit i/p for A2 is correct Need a completion signal to indicate when the final o/p is available for A; assuming worstcase time (when the choice is incorrect) is meaningless is such designs

Example of the Speculative Strategy in Adder Design

For a 16-bit adder, the delay is (9*4 8)*2 = 56 ns (2 ns is the delay for a single i/p); a 65% improvement ((160-56)*100/160) For a 64-bit adder, the delay is (9*8 8)*2 = 128 ns; an 80% improvement.

Dependency Resolution in D&C: (3) The Lookahead or Pre-Computation Strategy


Concept
Root problem A
v x u v

Example of an unstructured logic for A2


x w x yw z a1 u x a1 v x u v x w x yw z u x a1

A2_dep Subprob. A2

Subprob. A1

Data flow
A2 A2_indep

A2_indep or A2_lookahd

Critical path after a1 avail (8-unit delay)


a2

A2_dep Critical path after a1 avail (4-unit delay)

a2

Strategy 3: Redo the design of A2 so that it can do as much processing as possible that is independent of the i/p from A1 (A2_indep = A2_lookahd). This is the lookahead computation that prepares for the final computation of A2 (A2_dep) that can start once A2_indep and A1 are done. t(A) = max(t(A1), t(A2_indep)) + t(A2_dep) + t(stitch-up) E.g., Carry-looakahead adder --- does lookahead computation; also looakahead compuattion is associative, so doable in (log n). Overall computation is also doable in (log n) time. A less structured example: Let a1 be the i/p from A1 to A2. If A2 has the logic: a2 = vx + uvx + wxy + wza1 + uxa1. If this were implemented using 2-i/p AND/OR gates, the delay will be 8 delay units (1 unit = delay for 1 i/p) after a1 is available. If the logic is re-structured as a2= (vx + uvx + wxy) + (wz + ux)a1, and if the logic in the 2 brackets are performed before a1 is available (these constitute A2_indep), then the delay is only 4 delay units after a1 is available.

D&C Summary
For complex digital design, we need to think of the computation underlying the design in an algorithmic manner---are there properties of this computation that can be exploited for faster, less expensive, modular design; is it amenable to the D&C approach? The design is then developed in an algorithmic manner & the corresponding circuit may be synthesized by hand or described compactly using a HDL For an operation/func x on n operands (an-1 x an-2 x x a0 ) if x is associative, the D&C approach gives an easy stitch-up function, which is x on 2 operands (o/ps of applying x on each half). This results in a tree-structured circuit with (log n) delay instead of a linearlyconnected circuit with (n) delay can be synthesized. If x is non-associative, more ingenuity and determination of properties of x is needed to determine the stitch-up function. The resulting design may or may not be tree-structured D&C can be done top-down or bottom-up. Top-down generally better way to think for beginners If there is dependency between the 2 subproblems, then we saw strategies for addressing these dependencies:
Wait (slowest, least hardware cost) Speculative (fastest, highest hardware cost) Lookahead (medium speed, medium hardware cost)

Strategy 2: A general view of speculative computations (w/ or w/o D&C) x


If there is a data dependency between two or more portions of a computation (which may be obtained w/ or w/o using D&C), dont wait for the the previous computation to finish before starting the next one Assume all possible input values for the next computation/stage B (e.g., if it has 2 inputs from the prev. stage there will be 4 possible input value combinations) and perform it using a copy of the design for possible input value. All the different o/ps of the diff. Copies of B are Muxed using prev. stage As o/p E.g. design: Carry-Select Adder (at each stage performs two additions one for carryin of 0 and another for carry-in of 1 from the previous stage)
z B

A
y

(a) Original design: Time = T(A)+T(B) x 0 0 0 1 1 B(0,0)

A
y

B(0,1) 4:1 Mux B(1,0) B(1,1)

0 1
1

(b) Speculative computation: Time = max(T(A),T(B)) + T(Mux). Works well when T(A) approx = T(B) and T(A) >> T(Mux)

Strategy 3: Get the Best of Both Worlds (Average and Worst Case Delays)!
Registers

inputs
start

inputs

Unary Division Ckt (good ave case, bad done1 worst case)
output select

Ext. FSM

done2

NonRestoring Div. Ckt (bad ave case, good worst case)


output

Mux
Register

Use 2 circuits with different worst-case and average-case behaviors Use the first available output Get the best of both (ave-case, worst-case) worlds In the above schematic, we get the good ave case performance of unary division (assuming uniformly distributed inputs w/o the disadvantage of its bad worst-case performance)

Strategy 4: Pipeline It!


Stage 1

Original ckt or datapath

Stage 2

Conversion to a simple level-partitioned pipeline (level partition may not always be possible but other pipelineable partitions may be)

Stage k

Throughput is defined as # of outputs / sec Non-pipelined throughput = (1 / D), where D = delay of original ckts datapath Pipeline thoughput = 1/ (max stage delay + register delay) Special case: If original ckts datapath is divided into n stages, each of equal delay, and dr is the delay of a register, then pipeline thoughput = 1/((D/n)+dr). If dr is negligible compared to D/n, then pipeline throughput = n/D, n times that of the original ckt

S-ar putea să vă placă și