Documente Academic
Documente Profesional
Documente Cultură
J. T. Butler T. Sasao
Department of Electrical and Computer Engineering Department of Computer Science and Electronics
Naval Postgraduate School Kyushu Institute of Technology
Monterey, CA U.S.A. Iizuka, Fukuoka, JAPAN
AbstractWe show a high-speed hardware implementation of pipelined, and each stage consists of adders and multiplexors,
x mod z that can be pipelined in O(n m) stages, where x is so the delay is small and the throughput is high. When
represented in n bits and z is represented in m bits. It is suitable multiple-pipelines are used, as in the case of RNS applications,
for large x. We offer two versions. In the first, the value of z
is fixed by the hardware. For example, using this circuit, we the pipelines can be designed to have the same length, so that
show a random number generator that produces more than 11 the residues of each number arrive simultaneously. We show
million random numbers per second on the SRC-6 reconfigurable two architectures. In the first, z is fixed by the hardware. In the
computer. In the second, z is an independent input. This is second, z can be changed at each clock period. Experimental
suitable for RNS number system applications, for example. The results demonstrate the efficiency of our design.
second version can be pipelined in O(n) stages.
Keywords: x mod z computation, high-speed modulo reduction,
mod z arithmetic
II. BASIC I MPLEMENTATION
Our design benefits from the following viewpoint: We
I. I NTRODUCTION
consider the computation of x mod z as a modulo reduction
The need for cryptographically-secure systems has inspired process, where, at each stage, the magnitude of x is reduced,
interest in the computation of x mod z, especially when x but the residue remains the same. This continues until only
and z are large [6]. The x mod z function is also useful the residue remains.
in producing (pseudo) random numbers. Random number
generators based on linear recurrences use, for example, A. Computing x mod z for fixed z
xn+1 = (P1 xn + P2 ) mod N , where x0 is the seed value Fig. 1 shows a circuit that realizes this process for the case
[14]. For example, Lehmers algorithm, where the i-th random where x has 8 bits and z = 3. First, 192 is subtracted, if
number is si = asi1 mod p, is fast enough for many possible. Next, 96 is subtracted, if possible, etc.. This circuit
simulation applications [10]. The Blum-Blum-Shub algorithm, performs a sequence of subtract operations until only the
where a bit of the random number is chosen from si = s2i1 residue OU T remains. That is, IN = x = OU T + 3Q, where
mod pq, such that p and q are prime, is believed to be as OU T is a 2-bit remainder upon division of IN by 3 (OU T
secure as encryption methods based on factorization [5]. = 0, 1, or 2). Representing Q, the quotient, as an 7-bit binary
Another application is the residue number system (RNS), number q6 26 + . . . + q1 21 + q0 20 yields
where addition, subtraction, and multiplication are done with-
out carries [13]. In this application, the x mod z operation is 0IN=x=OU T +3q6 26 +. . .+3q1 21 +3q0 20<256, (1)
used in the binary-to-RNS conversion, the RNS operations,
and the RNS-to-binary conversion. In an RNS application, where the limits 0 and < 256 are imposed by our
one seeks to compute simultaneously x mod z for one value specification that IN = x is represented as an 8-bit standard
of x and several values of z. Radix converters have been binary number.
proposed that use an LUT cascade [8]. Another application
x mod 3
A A A A A A A
A-B A-B A-B A-B A-B A-B A-B
x
speed compact hardware realizations of x mod z suitable for Fig. 1. Computation of x mod z, where x has 8 bits and z = 3 (z is fixed
implementation on an FPGA. by hardware).
Surprisingly there is little work on x mod z. In one, [9] The circuit shown in Fig. 1 is combinational. When n
uses the sign estimate technique to estimate when the sign is small, such a circuit is satisfactory. However, when n is
of (x qz) changes, where x = (qz + x) mod z. They show large, the delay will be too large, and it is necessary that it
an algorithm for computing x mod z, but no experimental be pipelined. Table I shows the frequency, number of LUTs,
results are shown. [7] discusses a unified method for comput- and the number of register bits used in the Altera Stratix IV
ing modular multiplication, but shows no experimental results. EP4SE530F43C3NES FPGA to realize x mod 3, for x, an n-
Neither address the issue of whether an independent z can be bit number, where n = 8, 16, 32, 64, 128, and 256, such
accommodated. We consider a circuit that computes x mod z that a pipeline register exists at the output of each stage. The
given x and z, as n- and m-bit numbers, respectively. It can be resulting circuit is compact and fast. For example, for n = 256
Form Approved
Report Documentation Page OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
14. ABSTRACT
We show a high-speed hardware implementation of xmod z that can be pipelined in O(n ? m) stages, where
x is represented in n bits and z is represented in m bits. It is suitable for large x. We offer two versions. In
the first, the value of z is fixed by the hardware. For example, using this circuit, we show a random number
generator that produces more than 11 million random numbers per second on the SRC-6 reconfigurable
computer. In the second, z is an independent input. This is suitable for RNS number system applications,
for example. The second version can be pipelined in O(n) stages. Keywords: x mod z computation,
high-speed modulo reduction mod z arithmetic
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE
4
unclassified unclassified unclassified
A A A A A A A
A-B A-B A-B A-B A-B A-B A-B (MHz) 6- 5- 4- 3- Packed ALMs Registers
x
B B B B B B B
Aq Aq Aq Aq Aq Aq Aq 8 632.6 0 18 9 16 32(0%) 56(0%)
MUX MUX MUX MUX MUX MUX MUX 16 493.3 5 12 17 138 141(0%) 240(0%)
q 0 q 0 q 0 q 0 q 0 q 0 q 0 32 422.0 29 35 71 531 580(1%) 992(0%)
64 276.2 0 0 0 3,575 2,148(1%) 2,800(0%)
2 2 2 2 2 2 2 128 209.7 0 0 0 13,890 7,332(3%) 9,028(2%)
z