Sunteți pe pagina 1din 10

A Comparison of CPUs, GPUs, FPGAs and Massively Parallel Processor Arrays for Random Number Generation

David B. Thomas, Lee Howes, Wayne Luke Presented by: M. Ameen Qureshi

Introduction
RNG Parallelism
High Performance Computing

Monte Carlo Simulations


Embarrassingly Parallel

Deterministic generators Pseudo Random Number Generation Beneficiary: Simulations, Cryptography, Genetic Algorithms, Climate modeling

RNG Evaluation Methodology


Period : 2160 Stream-Splitting
Independent stream of random numbers for each node in Monte Carlo Applications in parallel 264 sub-sequences of at least length 264 Non-overlapping Probability 1/ 264

Empirical Statistical Quality


Big Crush Test from TestU01 library (Uni of Montreal) 106 tests on 238 samples of each RNG

Uniform Generation
P should be close to S
S is a vector of w bits P close to 2w

P should be close to S
S is a vector of w bits P close to 2w

Binary Linear Generator


Binary operations (and, xor) on vectors of individual bits Bit wise masking, xor and shifting

Uniform Generation
Combined Tausworthe XorShift Mersenne Twister SFMT
SIMD-oriented Mersenne Twister

Non-Uniform Generation
Inversion Transfromation
Box-Muller Method

Rejection
Box-Muller Method

Rejection
Box-Muller Method

Platforms
CPU FPGA GPU
Thread level parallelism More area to ALU, cache and scheduling logic removed Each CPU executes upto 1024 threads at once Batches of 32 threads (warps)

MPPA
Hundreds of RISC CPUs (regular grid) Small memories and 2D communication channels

336 processors to partition applications over Two types of processors


SR: extremely simple operations generating address streams and routing data SRD: Complex Large register set and multiplyaccumulate unit Throughput of 1 cycle No stalls due to standard register usage SR: generates a uniform random stream using binary linear operations SRD: Transforms this uniform stream into non-uniform random numbers

MPPA Ambric AM2045

MPPA Uniform RNG


No multiplier in SR
Linear Congruential Generator and Multiply Recursive Generator are impossible

256 byte memory


Lagged Fibonacci Generator impossible

Linear Binary Operations


Combined Tausworthe and XorShift fail to fit in 128-words instruction memory

MPPA Uniform RNG


Take a word from state, right shift, XOR with byte permuted word from the state 6 byte permutations, 10 transformations K60k brute force for k=4 (select best generator) Linear generators> flaws due to linear complexity