Sunteți pe pagina 1din 6

SIGCSE 2011 - The 42nd ACM Technical Symposium on Computer Science Education March 9-12, 2010, Dallas, Texas,

USA

Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming Monte Carlo Computation Random Number Generation
B. Wilkinson Feb 11, 2011

Preliminaries
Monte Carlo computations use random selections within calculations. There are many application areas: numerical integration, physical simulations, business models, finance, . Monte Carlo computations are embarrassingly parallel because each random selection and subsequent calculation is independent on the other selection and calculations. They are very amenable to GPUs the calculations using different random sequences random can be done independently by different threads. The Monte Carlo computation considered here is to compute . The Monte Carlo calculation is described in Appendix A. One major issue is how to generate random numbers. A CUDA kernel cannot call rand() or any other C library function from within a CUDA kernel (except math routines as given in the NVIDIA CUDA C programming guide.1 NVIDIA provides a CUDA CURAND library for generating random numbers with various distributions.2 Here we will provide MonteCarlo code using this library, and also a version using hand-coded (pseudo) random number generator using the well-known generator xi+1 = (a * xi + c) mod m, where a = 16807, c = 0, and m = 231 - 1 (a prime number). Selecting a starting value (seed) that creates a unique sequence for each thread is an issue for hand-coding. CURAND handles this aspect nicely in the API.3 Provided files: Each provided guest account has the following for Session 1b: Directory WorkshopFiles Contents Pi.cu Makefile PiMyRandom.cu Description Monte Carlo CUDA program Makefile for compiling and running CUDA program Monte Carlo CUDA program using hand-coded random number generator.

NVIDIA CUDA C programming guide, version 3.2, 11/09/2010, http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf 2 NVIDIA CUDA CURAND Library, PG-05328-032_V01, August 2010 http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf 3 Although not strictly Monte Carlo, the effect is the same in the calculation if one uses numbers in numeric sequence and so a simple counter may be possible with a large sample.
1

Task 1 Compiling and Executing Monte Carlo program


In this task, you will compile and execute a simple prewritten CUDA program that add to vectors. The code is given here as Pi.cu and overleaf:
//Derived from code developed by Patrick Rogers, UNC-C, which used a hand-coded random number generator. #include <stdlib.h> #include <stdio.h> #include <cuda.h> #include <math.h> #include <time.h> #include <curand_kernel.h> #define #define #define #define TRIALS_PER_THREAD 4096 BLOCKS 256 THREADS 256 PI 3.1415926535 // known value of pi

__global__ void gpu_monte_carlo(float *estimate, curandState *states) { unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x; int points_in_circle = 0; float x, y; curand_init(1234, tid, 0, &states[tid]); // Initialize CURAND

for(int i = 0; i < TRIALS_PER_THREAD; i++) { x = curand_uniform (&states[tid]); y = curand_uniform (&states[tid]); points_in_circle += (x*x + y*y <= 1.0f); // count if x & y is in the circle. } estimate[tid] = 4.0f * points_in_circle / (float) TRIALS_PER_THREAD; //estimate of pi } float host_monte_carlo(long trials) { float x, y; long points_in_circle; for(long i = 0; i < trials; i++) { x = rand() / (float) RAND_MAX; y = rand() / (float) RAND_MAX; points_in_circle += (x*x + y*y <= 1.0f); } return 4.0f * points_in_circle / trials; } int main (int argc, char *argv[]) { clock_t start, stop; float host[BLOCKS * THREADS]; float *dev; curandState *devStates; printf("# of trials per thread = %d, # of blocks = %d, # of threads/block = %d.\n", TRIALS_PER_THREAD, BLOCKS, THREADS); start = clock(); cudaMalloc((void **) &dev, BLOCKS * THREADS * sizeof(float)); cudaMalloc( (void **)&devStates, THREADS * BLOCKS * sizeof(curandState) ); gpu_monte_carlo<<<BLOCKS, THREADS>>>(dev, devStates); cudaMemcpy(host, dev, BLOCKS * THREADS * sizeof(float), cudaMemcpyDeviceToHost); float pi_gpu; for(int i = 0; i < BLOCKS * THREADS; i++) {

pi_gpu += host[i]; } pi_gpu /= (BLOCKS * THREADS); stop = clock(); printf("GPU pi calculated in %f s.\n", (stop-start)/(float)CLOCKS_PER_SEC); start = clock(); float pi_cpu = host_monte_carlo(BLOCKS * THREADS * TRIALS_PER_THREAD); stop = clock(); printf("CPU pi calculated in %f s.\n", (stop-start)/(float)CLOCKS_PER_SEC); printf("CUDA estimate of PI = %f [error of %f]\n", pi_gpu, pi_gpu - PI); printf("CPU estimate of PI = %f [error of %f]\n", pi_cpu, pi_cpu - PI); return 0; }

Timing Execution In Session 1a, we CUDA events to time the execution although because of the synchronous nature of cudaMemcpy, we could have used Linux clock() or time() . In above, we use clock() and also time the execution of computing on the CPU only for comparison. Compiling: A makefile is given below:
NVCC = /usr/local/cuda/bin/nvcc CUDAPATH = /usr/local/cuda NVCCFLAGS = -I$(CUDAPATH)/include LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm Pi: $(NVCC) $(NVCCFLAGS) $(LFLAGS) o Pi Pi.cu

Type make Pi to compile the program and ./Pi to execute program. Executing Program Type ./Pi to execute compiled program. The program will first compute on the GPU and then on the CPU (which may take several seconds).

Task 2 Experiment with different CUDA grid/block structures


Experiment with different numbers of blocks and threads/block.

Task 3 Hand-coded Random Number Generator


Compile and test the MonteCarlo pi program, PiMyRandom.cu:
//Derived somewhat from code developed by Patrick Rogers, UNC-C #include <stdlib.h> #include <stdio.h> #include <cuda.h> #include <math.h> #include <time.h> #define #define #define #define TRIALS_PER_THREAD 4096 BLOCKS 256 THREADS 256 PI 3.1415926535 // known value of pi

__device__ float my_rand(unsigned int *seed) { unsigned long a = 16807; // constants for random number generator unsigned long m = 2147483647; // 2^31 - 1 unsigned long x = (unsigned long) *seed; x = (a * x)%m; *seed = (unsigned int) x; return ((float)x)/m; } __global__ void gpu_monte_carlo(float *estimate) { unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x; int points_in_circle = 0; float x, y; unsigned int seed = tid + 1; // starting number in random sequence

for(int i = 0; i < TRIALS_PER_THREAD; i++) { x = my_rand(&seed); y = my_rand(&seed); points_in_circle += (x*x + y*y <= 1.0f); // count if x & y is in the circle. } estimate[tid] = 4.0f * points_in_circle / (float) TRIALS_PER_THREAD; } float host_monte_carlo(long trials) { float x, y; long points_in_circle; for(long i = 0; i < trials; i++) { x = rand() / (float) RAND_MAX; y = rand() / (float) RAND_MAX; points_in_circle += (x*x + y*y <= 1.0f); } return 4.0f * points_in_circle / trials; } int main (int argc, char *argv[]) { clock_t start, stop; float host[BLOCKS * THREADS]; float *dev;

printf("# of trials per thread = %d, # of blocks = %d, # of threads/block = %d.\n", TRIALS_PER_THREAD, BLOCKS, THREADS); start = clock();

cudaMalloc((void **) &dev, BLOCKS * THREADS * sizeof(float)); gpu_monte_carlo<<<BLOCKS, THREADS>>>(dev); cudaMemcpy(host, dev, BLOCKS * THREADS * sizeof(float), cudaMemcpyDeviceToHost); float pi_gpu; for(int i = 0; i < BLOCKS * THREADS; i++) { pi_gpu += host[i]; } pi_gpu /= (BLOCKS * THREADS); stop = clock(); printf("GPU pi calculated in %f s.\n", (stop-start)/(float)CLOCKS_PER_SEC); start = clock(); float pi_cpu = host_monte_carlo(BLOCKS * THREADS * TRIALS_PER_THREAD); stop = clock(); printf("CPU pi calculated in %f s.\n", (stop-start)/(float)CLOCKS_PER_SEC); printf("CUDA estimate of PI = %f [error of %f]\n", pi_gpu, pi_gpu - PI); printf("CPU estimate of PI = %f [error of %f]\n", pi_cpu, pi_cpu - PI); return 0; }

This uses a hand-coded random number generator. You will need to modify the make file to compile the program. Note the time of execution.

S-ar putea să vă placă și