Sunteți pe pagina 1din 10

A Comparison of CPUs, GPUs, FPGAs and Massively Parallel Processor Arrays for Random Number Generation

David B. Thomas, Lee Howes, Wayne Luke Presented by: M. Ameen Qureshi

RNG Parallelism
High Performance Computing

Monte Carlo Simulations

Embarrassingly Parallel

Deterministic generators Pseudo Random Number Generation Beneficiary: Simulations, Cryptography, Genetic Algorithms, Climate modeling

RNG Evaluation Methodology

Period : 2160 Stream-Splitting
Independent stream of random numbers for each node in Monte Carlo Applications in parallel 264 sub-sequences of at least length 264 Non-overlapping Probability 1/ 264

Empirical Statistical Quality

Big Crush Test from TestU01 library (Uni of Montreal) 106 tests on 238 samples of each RNG

Uniform Generation
P should be close to S
S is a vector of w bits P close to 2w

P should be close to S
S is a vector of w bits P close to 2w

Binary Linear Generator

Binary operations (and, xor) on vectors of individual bits Bit wise masking, xor and shifting

Uniform Generation
Combined Tausworthe XorShift Mersenne Twister SFMT
SIMD-oriented Mersenne Twister

Non-Uniform Generation
Inversion Transfromation
Box-Muller Method

Box-Muller Method

Box-Muller Method

Thread level parallelism More area to ALU, cache and scheduling logic removed Each CPU executes upto 1024 threads at once Batches of 32 threads (warps)

Hundreds of RISC CPUs (regular grid) Small memories and 2D communication channels

336 processors to partition applications over Two types of processors

SR: extremely simple operations generating address streams and routing data SRD: Complex Large register set and multiplyaccumulate unit Throughput of 1 cycle No stalls due to standard register usage SR: generates a uniform random stream using binary linear operations SRD: Transforms this uniform stream into non-uniform random numbers

MPPA Ambric AM2045

MPPA Uniform RNG

No multiplier in SR
Linear Congruential Generator and Multiply Recursive Generator are impossible

256 byte memory

Lagged Fibonacci Generator impossible

Linear Binary Operations

Combined Tausworthe and XorShift fail to fit in 128-words instruction memory

MPPA Uniform RNG

Take a word from state, right shift, XOR with byte permuted word from the state 6 byte permutations, 10 transformations K60k brute force for k=4 (select best generator) Linear generators> flaws due to linear complexity