Documente Academic
Documente Profesional
Documente Cultură
Performance Improvements
These notes will introduce:
Basic matrix multiplication on a 2-D grid/block review
Limitations
Block matrix multiplication
Memory access patterns in matrix multiplication
Coalescing memory accesses
Transpose array
Using shared memory tiles
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 6, 2011
MatrixMult.ppt
Matrix Multiplication
Matrix multiplication is an important application in HPC
and appears in many applications
C=A*B
where A, B, and C are matrices (two-dimensional
arrays.
A restricted case is when B has only one column -- matrix-vector
multiplication, which appears in representation of linear equations
and partial differential equations
2
Matrix multiplication, C = A x B
CUDA Kernel
for multiplying two arrays
__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {
int k, sum = 0;
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
if (col < N && row < N) {
for (k = 0; k < N; k++)
sum += a[row * N + k] * b[k * N + col];
c[row * N + col] = sum;
}
}
Grid
Block
threadID.x
threadID.y
Array
A[row][]
Thread
Basically array divided into tiles and one tile mapped onto one block
Complete
Program
(several slides)
}
int main(int argc, char *argv[]) {
int i, j;
int Grid_Dim_x=1, Grid_Dim_y=1;
int Block_Dim_x=1, Block_Dim_y=1;
int noThreads_x, noThreads_y;
int noThreads_block;
int N = 10;
int *a,*b,*c,*d;
int *dev_a, *dev_b, *dev_c;
int size;
cudaEvent_t start, stop;
float elapsed_time_ms;
// loop counters
//Grid structure values
//Block structure values
// number of threads available in device, each dimension
// number of threads in a block
// size of array in each dimension
// number of bytes in arrays
// using cuda events to measure time
// which is applicable for asynchronous code also
// keyboard input
dim3 Grid(Grid_Dim_x, Grid_Dim_x); //Grid structure
dim3 Block(Block_Dim_x,Block_Dim_y); //Block structure, threads/block limited by specific device
size = N * N * sizeof(int);
// number of bytes in total in arrays
a = (int*) malloc(size);
b = (int*) malloc(size);
c = (int*) malloc(size);
d = (int*) malloc(size);
Where you
measure time
will make a big
difference
gpu_matrixmult<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c, dev_c, size , cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0);
// measuse end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms);
cpu_matrixmult(a,b,d,N);
// do calculation on host
cudaEventRecord(stop, 0);
// measure end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms); // exe. time
/* ------------------- check device creates correct results -----------------*/
Some Preliminaries
Effects of First Launch
Program is written so that can repeat with
different parameters without stopping
program to eliminate effect of first kernel
launch
Also might take advantage of caching
seems not significant as first launch
11
Some
results
32 x 32
array
Random
numbers
0- 9
1 block
of 32 x 32
threads
Speedup =
1.65,
First time
Answer
Check both CPU and GPU same answers
12
Some
results
32 x 32 array
1 block
of 32 x 32
threads
Speedup =
2.12
Second time
13
Some
results
32 x 32
array
1 block
of 32 x 32
threads
Speedup =
2.16
Third time
Subsequently
can vary 2.12
2.18
14
Some
results
256 x 256
array
8 blocks
of 32 x 32
threads
Speedup =
151.86
15
Some
results
1024 x
1024 array
32 blocks
of 32 x 32
threads
Speedup =
860.9
16
Some
results
2048 x 2048 array
64 blocks
of 32 x 32 threads
Speedup = ??
GPU appears to
freeze.
Why?
Some
results
2048 x
2048 array
64 blocks
of 32 x 32
threads
Speedup =
??
18
Speedup *
Speedup**
32 x 32
2.18
1.26
256 x 256
151.86
110.63
1024 x 1024
860.9
733.44
2048 x 2048
??
4096 x 4096
* These result include time of memory copy back from device but not
memory copy to device.
** These result include time of memory copy back from the device and
memory copy to the device
Block size 32 x 32. Number of blocks to suit array size
19
GPU Limitations
Previous program has limitations:
Number of threads => number of array elements
(code will not work if number of threads < number of array elements)
= 512
= 512
= 65535
= 1024
= 1024
= 65535
22
Sub-matrix
Not in textbooks (not their tiling).
Sub-matrix
23
* Would you want to use less threads than elements?
24
Each
thread
does one
value of i
and j.
S threads
}
}
Not tested
26 yet
Mistakes?
Thread 0,
Thread I,
Thread N-1,
//
// transpose array
cudaEventRecord(start, 0);
cudaEventSynchronize(start);
Some
results
8 x 8 array
1 block
of 8 x 8
threads
Speedup =
1.62 over not
transposing
array
33
Some
results
32 x 32 array
1 block
of 32 x 32
threads
Speedup =
1.17 over not
transposing
array
34
Some
results
256 x 256
array
8 blocks
of 32 x 32
threads
Speedup =
0.89!! over not
transposing
array
35
Some
results
1024 x 1024
array
32 blocks
of 32 x 32
threads
Speedup =
0.93!! over not
transposing
array
36
As
Bs
Cs
37
Developing Code
1. Declaring shared memory:
__global__ void gpu_matrixmult (int* Md, int* Nd, int* Pd, int Width) {
__shared__ int Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ int Nds[TILE_WIDTH][TILE_WIDTH];
int tx = threadIdx.x;
int ty = threadIdx.y;
//thread indices
To access (later):
Mds[ty][tx] =
Nds[ty][tx] =
39
3. Global address:
For convenience declare row and column:
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
// global indices
Note: Same as the usual global thread ID and result index in normal
matrix multiplication:
int row = threadIdx.y + blockDim.y * blockIdx.y;
int col = threadIdx.x + blockDim.x * blockIdx.x;
40
All elements on
row transferred,
one per thread
Thread ID
Thread ID
All elements on column
transferred, one per thread
41
Example (3 x 3 tiles)
Row = by * TILE_WIDTH + ty
Col = bx * TILE_WIDTH + tx
by/bx
0
m
Global array A
Global array B
Based upon Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10,
Page 109.
42
Multiply tiles,
accumulating
values
Copy back to global memory
43
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y; // thread ID
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
float Pvalue = 0;
for (int m = 0; m < Width/TILE_WIDTH; m++) { // loop over Md, Nd to compute Pd elem
shared memory is a
collaborative betwe
threads, each threa
doing one element
each array
Some
results
32 x 32 array
1 block
of 32 x 32 threads
Speedup
GPU to CPU
2.12
GPU sh. mem. to CPU
2.73
GPU sh. mem to GPU
without sh memory
1.28
45
Some
results
256 x 256 array
8 block
of 32 x 32 threads
Speedup
GPU to CPU
153
GPU sh. mem. to CPU
217
GPU sh. mem to GPU
without sh memory
1.41
46
Some
results
1024 x1024 array
32 block
of 32 x 32 threads
Speedup
GPU to CPU
864
GPU sh. mem. to CPU
2214 !!!
GPU sh. mem to GPU
without sh memory
2.56
47
Some
results
2048 x 2048 array
64 block
of 32 x 32 threads
Speedup
GPU to CPU
989
GPU sh. mem. to CPU
2962 !!
GPU sh. mem to GPU
without sh memory
2.99
48
GPU to CPU
32 x 32
2.12
2.73
1.28
256 x 256
153
217
1.41
1024 x 1024
864
2214
2.56
2048 x 2048
989
2962
2.99
4096 x 4096
49
Bandwidth improvements by
using shared memory
Using a 32 x 32 tiles reduced the number of global
memory access by a factor of 16 (two transfers instead
of 2 x 32 transfers).
According to PMPP book, page 90, using 16 x 16 tiles:
it allows the 86.4 GB/s global bandwidth to serve a
much larger floating-point computation rate Can now
support 86.4/4 x 16 = 345.6 gigaflops, very close to the
peak floating point performance of the G80 effectively
removes global memory bandwidth as the major limiting
factor of matrix multiplication performance.
50
Conclusions
Using the shared memory algorithm can make
a significant difference, up to 3 times as fast on
the GPU to not using this algorithm in
presented tests.
Speedup of almost 3000 over the CPU!
(Note though CPU is an old processor)
51
Further Reading
Programming Massively
Parallel Processors
A hands-on Approach
David B. Kirk and
Wen-mei W. Hwu
Morgan Kaufmann, 2010
This book is only on NVIDIA
GPUs and CUDA programming
despite its title.
53
Questions
55