Sunteți pe pagina 1din 5

Analysis of a Fully-Scalable Digital Fractional Clock Divider

Thomas B. Preuer, Rainer G. Spallek Technische Universit t Dresden, Germany a {preusser,rgs}@ite.inf.tu-dresden.de Abstract
cin

e
2P
+
neg

It was previously shown [5] that the B RESENHAM algorithm [2] is well-suited for digital fractional clock generation. Specically, it proved to be the optimal approximation of a desired clock in terms of the edges provided by the reference clock. Moreover, some synthesis results for hardwired dividers on Altera FPGAs showed that this technique for clock division achieves a high performance often at or close to the maximum frequency supported by the devices for moderate bit widths of up to 16 bits. This paper extends the investigations on the clock division by the B RESENHAM algorithm. It draws out the limits encountered by the existing implementation for both FPGA and VLSI realizations. A rather unconventional adoption of the carry-save representation combined with a soft-threshold comparison is proposed to circumvent these limitations. The resulting design is described and evaluated. Mathematically appealing results on the quality of the approximation achieved by this approach are presented.

2PQ
+ T Q

cout

1
MUX

Figure 1. Basic B RESENHAM Clock Division In the remainder of this paper: Sec. 2 reviews the B RE algorithm in the context of discrete fractional clock division. Sec. 3 proposes a design based on carrysave arithmetic and soft-threshold comparison. Its simulation and results about the quality of its generated clock are presented in Sec. 4. Sec. 5 concludes the paper and recapitulates the observations and open mathematical questions.
SENHAM

2 Review of B RESENHAM Clock Division 1 Introduction


The B RESENHAM algorithm [2] is a long-known algorithm for the generation of plots of straight lines. The application of a hardware implementation of the B RESEN HAM algorithm for fractional clock generation is mentioned in [3]. It is formally proven to be the optimal approximation of the desired clock in terms of the switching edges provided by an available reference clock in [5]. Although latter work showed that a very straightforward implementation of a hardwired B RESENHAM clock divider performs well on current FPGA hardware for moderate bit widths of up to 16 bits, this paper extends on this work by showing the performance bounds of the direct implementation and by proposing an approach that trades some approximation quality for ideal scalability. It is shown that the incurred quality loss is fairly small and is not even present at all for most fractions. The hardware design proposed in [5] to implement the B RESENHAM clock division is reproduced in Fig. 1. The P frequency of the generated clock is fout = Q fin . The output clock is free of any long-term phase drift. The initialization e = Q P was used in [5] to prove that the approximation of the ideal clock obtained by this design is optimal. All other initializations were shown to differ from this result merely in phase. In [5] a few synthesis results for hardwired FPGA implementations were presented. The design is, however, equally applicable for a programmable VLSI block simply by turning the constant inputs 2P and 2P Q into conguration registers. For the further discussion, we will somewhat depart from the special case of clock division, which causes the appearance of 2P due to the need to generate two edges for each complete clock cycle of the output clock. In a more general

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 2006

MSB

cin

e
pq p
1
MUX

3 A Fast, Hardly Approximating Design


Addition can be performed faster when a redundant representation of e is permissible. In fact, the coding of e is arbitrary as long as the modulo events adding p q instead of p can be identied. So far, rm thresholds (q, 2n1 or 0) were used to trigger these events but unfortunately:
T Q

cout

Figure 2. Design (a) with Alternate Range of e

Lemma. Be G a totally-ordered group and + : GG G the group operation called addition. Then, there is no representation for the members of G, which allows both the addition and the comparison against some xed group member to be implemented faster than (log log |G|). For a proof sketched after [6], see [4]. Applied to addition, this means that not both a fast addition and a fast comparison against a rm threshold are possible at the same time. In terms of the bit width n, (log n) is the best to achieve. A way out of this dilemma is the soft-threshold comparison. Consider the design given in Fig. 3. It resembles the design from Fig. 2 except for the representation of e and the identication of the modulo event. e is encoded in carrysave representation, i.e. its numerical value is the sum of two pseudo-components es , the pseudo-sum, and ec , the pseudo-carry. The modulo event is triggered, depending on the conguration bit t, by a single or by two bits set in the MSB position. The performance of this design is no longer limited by a carry propagation path. Its critical combinatorial path is of O(1). Only, the signal controlling the wide p / p q-MUX is loaded heavily and thus likely to require some effort in a practical design. The quality of the clock generated by this design is not clear at all. Due to the redundant nature of the carry-save representation, certain values of e may trigger the modulo event when represented one way and may not when represented another. Thus, the series of values in e is not predetermined by design and might not only depend on the choices of p, q and t but also on the initial value and representation of e. Yet, the modulo classes with respect to q that are represented by the values of e succeed in a well-dened order as the additions of p or p q are equivalent in terms of these classes. As the number of states that e can assume is nite, a cycle must be entered eventually. As the equal states of e in successive iterations of this cycle represent the same modulo class, the period of any such cycle is a multiple of q q (or gcf(p,q) if p and q are not relatively prime). Since the additions of p and p q also sum up to zero over a whole cycle, their overall ratio equals that of the original B RESEN HAM design. Their distribution, however, may differ so that the generated modulo events no longer constitute the best approximation of the desired clock at the output. So the

discussion, p and q shall thus be used, which satisfy p = 2P and q = Q for the clock division. Recall the requirement q p that carries over from the original line drawing application. The operation implemented by each iteration is the addition of p modulo q so that e iterates through remainder classes of q using representatives from [0, q). This choice certainly minimizes the bit width used in the design but is nevertheless arbitrary. In fact, a whole adder can be eliminated if the case when p q instead of p is to be added call it the modulo event is identied by the value of e rather than by the sign of a speculative add. This can be achieved either (a) by choosing representatives in a range [2n1 + p q, 2n1 + p) triggering the modulo event by the most signicant bit (MSB) of e with value 2n1 or, similarly, (b) by subtracting instead of adding p and p q combined with the sign detection of the twos complement representation of e, again by its MSB. Latter approach gives e a range of [p, q p). The bit widths required for both of these approaches turns out to have to satisfy n 1 + ld max{q p, p}. Therefore, both of these cases will never require a smaller bit width than the original implementation (ld q). Nonetheless, they also require at most one bit more. Unless, the carry propagation delay in the adder is absolutely tight, this is denitely a good tradeoff. The resulting design, here for case (a), is depicted in Fig. 2. The performance-limiting factor of the designs seen so far is the carry propagation within the adder. Even when choosing fast binary adder implementations, the achievable combinatorial delay is still (log n) [6]. Optimizations for the original line drawing application cannot be transfered to the clock division as they rely on parallelization and / or the identication of identical line segments [1]. Faster implementations in the clock division domain are only possible by the elimination of the carry propagation. If such can be found, their adoption for line drawing is very well possible.

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 2006

MSB

cin

es ec

MUX

p
0
MUX

pq
1

CSA
T Q Configuration Cell (e.g. SRAM)

cout

Figure 3. Carry-Save Digital Clock Division Design quality of the output of the proposed design will need to be evaluated. For the design to work, it is absolutly vital that the additions do not produce an arithmetic overow. For a width of n bits, an overow would essentially cause a mod 2n operation, which will only go consistent with the calculation within the modulo classes of q in the special case that q = 2nk . So it must be ensured that the addition of p does not produce an outgoing carry and that the addition of p q, which is actually a subtraction, does produce an outgoing carry. The rst requirement implies that, if both MSBs of e are set, the modulo event must be triggered. The second requirement implies that the modulo event must not be triggered if not at least one of the MSBs of e is set. As these border cases are easily identied by the inspection of only two bits, they were chosen as the two options to investigate. In the design, they can be congured via t. Observe that the use of the carry-save representation slightly differs from its application in common arithmetic settings. Here, the numeric value of e is strictly dened to be the positive sum of both unsigned pseudo-components without any modulo operation. Specically, the n-bit-wide carry-save representation with es = 2n 1 and ec = 1 is not another representation for the value 0 but represents e = 2n . Further, note that the application of the carry-save representation in the proposed design allows the formation of equivalence classes among the representations of an integer e. Interpreting the two hardware bits at each bit position as encoding one of the digits {0, 1, 2}, the two different encodings of the digit 1 are not distinguished by the carry-save adder and can thus be considered equivalent. So the two representations 9 + 3 = 1001 and 11 + 1 = 1011 for 12 0011 0001 are equivalent as both recode to 10122 . 7 + 5 = 0111 , on 0101 the other hand, recodes to 02122 , thus not being equivalent to those representations. Denition. A representation of a natural number in the redundant place-value system with the digits {0, 1, 2} and the base 2 is called additive carry-save representation. For its predominance in this paper, the attribute additive can and will usually be omitted. The range of values traversed by e during a cycle is no longer conned to an interval of length q. The smallest possible value that e can assume within a cycle is reached after the smallest e that has a carry-save representation triggering the modulo event; the largest possible value of e within a cycle is reached after the greatest e that has a carry-save representation not triggering the modulo event. Thus, the cyclic values of e lie somewhere in: 1 + p] if t = 0 [2n1 +p q, 2 (2n1 1) [2n +p q, (2n 1) + (2n1 1) 1 + p] if t = 1 These intervals be called the cyclic range E of e. For the determination of the minimum implementation bit width n, observe that the negative number p q must be representable as a twos complement of n bits, that p must be representable as an unsigned natural of n bits and it must not have an MSB set so as to provoke an outgoing carry from the CSA for t = 1, and the cyclic range of e must be representable in additive carry-save with components at most n bits wide. The intersection of these requirements yields: n max{ ld(p + 1), 1 + ld(q p)} max{1+ ld(p + 1), 1 + ld(q p)} if t = 0 if t = 1

These bounds are fairly similar to the one found for the design of Fig. 2. So the main investment into the carry-save implementation is the coding overhead for the representation of e doubling the register size. There is no signicant gain or price paid in a larger bit width. Multiplexer and adder thus essentially have the same complexity.

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 2006

15 19 16 13 17 14 18 15 20

22

4 Simulation and Results


The goal of the simulation was to obtain some information about the cyclic behavior of the proposed design, specically: the number of stable cycles for a fraction p , q their periods and their approximation quality as compared to the original B RESENHAM clock division. While the interest in the approximation quality is obvious, the other goals may require some elaboration: If a stable cycle turns out to be unique, the initialization requirements may be relaxed as any initial value will eventually lead into this cycle. The knowledge about the period of the cycles may be a rst step toward a provable quality assurance as no multi-q-cycle can produce a better quality than the primitive B RESENHAM cycle with period q. The simulation must be restricted to some attractive range. So the bit width n used in the simulation was limited to its minimum n and n + 1. Note that results found for some bit width can be easily mapped to wider setups by shifting left the involved constants p and p q as well as the initialization of e. No output quality is lost by this scaling. The simulated quotients p were restricted to reduced q fractions. While this covers all fractions with at least one setup, this also eases the simulation effort signicantly as all cycles must contain a representative of every modulo class with respect to q in these setups. This limits the initial states of e to consider to a single horizontal cut through the elattice and simplies cycle detection also along this cut. As detailed in [4], the overall simulation effort for a fraction p q thus becomes O (q a ) with a = ld 1 + 5 < 1.695. The quality of the generated clock output is measured by the average distance square of the integer times of the modulo events from their optimal occurence on the continuous real timescale where a cycle of the input clock serves as the time unit. As the evaluation is to reect the edge jitter rather than the phase of the output clock, an arbitrary phase is allowed to minimize this deviation. The quality of the original B RESENHAM approximation is explicitly given by 1 1 (see [4]) and provides a baseline, by which 12 1 p2 the other results are normalized. The behavior of the design was exhaustively simulated for all reduced fractions p with 1 p < q 4096. The q results are visualized in Fig. 5, which also reveals an astonishing geometric regularity. The most interesting observations are that for all examined fractions, the cycles of the setup with minimal bit width are unique and primitive as well as the fact that this setup seems to be superior for most 6 fractions but not for all (smallest counterexample: 17 ). Most interestingly, the inferior fractions seem to cluster in diamond-shaped areas around some p = m q with p = 2q being the most dominant. The visual representation even suggests a fractal self-similarity, which can, however, not apply in a strict sense as only reduced fractions are covered.

23

+p +p q
24 21 25 22

Figure 4. e-Lattice for (p,q;n,t)=(4,7;4,1) The members of the cyclic range E can be organized in a lattice-like graph. (Note that the term lattice is merely inspired by the shape and has no relation to the algebraic lattice established by certain posets.) An example of such a lattice for (p, q) = (4, 7) with n = 4 and t = 1 is given in Fig. 4. Each edge of this graph resembles a transition from one value of e to another. An edge to the left implies that the value of the source node has a carry-save representation triggering the modulo event; an edge to the right analogously implies that it can be represented in a way that the modulo event is not triggered. Note that many values of e have carry-save representations of both kinds. Further, observe that the lattice has been drawn such that horizontally neighboring nodes are two values representing the same modulo class with respect to q. As it turns out, the cycle for the case examplied in Fig. 4 is unique and has a period of exactly q. For the argument of uniqueness, observe that the carry-save representations of 15 and 22 (as well as 14 and 23) reachable within a cycle are unique. As 15 and 22 are the only representatives of their modulo class inside the cyclic range E, any cycle must pass through one of them. As both are succeeded by the same representation of 19 (which is 20112 ), a unique and primitive cycle of period q follows. Its simulation yields the path boldened in Fig. 4. Most importantly, the distribution of the modulo events on this path is equivalent to the one produced by the original design. Unfortunately, this advantageous behavior does not generalize. Simulation shows, however, that it is astonishingly common. Moreso, the quotients p that do not yield a fullq quality B RESENHAM-like sequence of modulo events seem to cluster in well-dened, diamond-shaped areas on the p-qplane. These results are described in detail in the following section.

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 2006

(a) Normalized Best Quality Achieved for n

(b) Normalized Best Quality Achieved for n + 1

(c) Normalized Overall Best Quality Achieved

(d) Multiple or Non-Primitive Cycles for n + 1

Figure 5. Visualized Simulation Results

5 Conclusions
Optimizations for the straightforward implementation of the B RESENHAM digital clock division have been discussed. As only little could be achieved for the original design, a further approximation step introducing a softthreshold comparison to implement the modulo addition has been proposed. This approach enabled a highly-scalable design with a critical combinatorial path independent from the bit width and only one heavily-loaded logic signal. The simulation of this design for fractions p with relq atively small p and q suggested that the proposed design achieves a high-quality clock output, for most fractions even equivalent to the quality achieved by the original design. Due to the observed uniqueness of the cycles in the setup of minimal bit width, it may be used as the base of a programmable implementation without requiring extensive initialization logic to ensure the entering of the correct highquality cycle. The results presented are purely heuristic. Their provable generalization remains an open question. The nature of the diamand-shaped areas might be a key to some answers.

References
[1] E. Angel and D. Morrison. Speeding up bresenhams algorithm. IEEE Computer Graphics and Applications, 11(6):16 17, Nov. 1991. [2] J. E. Bresenham. Algorithm for computer control of a digital plotter. IBM Systems Journal, 4(1):2530, 1965. [3] Micrel, Inc., 1849 Fortune Drive, San Jose, CA 95131, USA. 3.3V AnyClock Fractional N Synthesizer. http://www. micrel.com/_PDF/HBW/sy87729l.pdf. [4] T. B. Preuer. Background of the analysis of a fully-scalable digital fractional clock divider. Technical report, Fakult t Ina formatik, Technische Universit t Dresden, 2006. To be puba lished, see: ftp://ftp.inf.tu-dresden.de/pub/ berichte/. [5] T. B. Preuer and S. K hler. Discrete fractional clock geno eration for systems-on-fpga. Technical Report TUD-FI0507, Fakult t Informatik, Technische Universit t Dresden, a a June 2005. ftp://ftp.inf.tu-dresden.de/pub/ berichte/tud05-07.pdf. [6] S. Winograd. On the time required to perform addition. J. ACM, 12(2):277285, 1965.

Application-specific Systems, Architectures and Processors (ASAP'06) 0-7695-2682-9/06 $20.00 2006

S-ar putea să vă placă și