Sunteți pe pagina 1din 55

Matrix Multiplication

Performance Improvements
These notes will introduce:
Basic matrix multiplication on a 2-D grid/block review
Limitations
Block matrix multiplication
Memory access patterns in matrix multiplication
Coalescing memory accesses
Transpose array
Using shared memory tiles
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 6, 2011
MatrixMult.ppt

Matrix Multiplication
Matrix multiplication is an important application in HPC
and appears in many applications
C=A*B
where A, B, and C are matrices (two-dimensional
arrays.
A restricted case is when B has only one column -- matrix-vector
multiplication, which appears in representation of linear equations
and partial differential equations
2

Matrix multiplication, C = A x B

Implementing Matrix Multiplication


Sequential Code
Assume matrices square (N x N matrices).
for (i = 0; i < N; i++)
for (j = 0; j < N; j++) {
c[i][j] = 0;
for (k = 0; k < N; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
Requires n3 multiplications and n3 additions
Sequential time complexity of O(n3).
Very easy to parallelize.
4

CUDA Kernel
for multiplying two arrays
__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {
int k, sum = 0;
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
if (col < N && row < N) {
for (k = 0; k < N; k++)
sum += a[row * N + k] * b[k * N + col];
c[row * N + col] = sum;
}
}

In this example, one


thread computes one C
element and the number
of threads must equal or
greater than the number
of elements
5

Sequential version with flattened arrays


for comparison
void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {
int i, j, k, sum;
for (row =0; row < N; row++)
// row of a
for (col =0; col < N; col++) {
// column of b
sum = 0;
for(k = 0; k < N; k++)
sum += cpu_a[row * N + k] * cpu_b[k * N + col];
cpu_c[row * N + col] = sum;
}
}
6

Matrix mapped on 2-D Grids and 2-D blocks


A[][column]

Arrays mapped onto


structure, one element
per thread

Grid

blockIdx.y * blockDim.y + threadIdx.y

Block

threadID.x
threadID.y

Array

A[row][]

Thread

blockIdx.x * blockDim.x + threadIdx.x

Basically array divided into tiles and one tile mapped onto one block

// Matrix addition program MatrixMult.cu, Barry Wilkinson, Dec. 28, 2010.


#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {

Complete
Program
(several slides)

void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {

}
int main(int argc, char *argv[]) {
int i, j;
int Grid_Dim_x=1, Grid_Dim_y=1;
int Block_Dim_x=1, Block_Dim_y=1;
int noThreads_x, noThreads_y;
int noThreads_block;
int N = 10;
int *a,*b,*c,*d;
int *dev_a, *dev_b, *dev_c;
int size;
cudaEvent_t start, stop;
float elapsed_time_ms;

// loop counters
//Grid structure values
//Block structure values
// number of threads available in device, each dimension
// number of threads in a block
// size of array in each dimension
// number of bytes in arrays
// using cuda events to measure time
// which is applicable for asynchronous code also

/* --------------------ENTER INPUT PARAMETERS AND ALLOCATE DATA -----------------------*/

// keyboard input
dim3 Grid(Grid_Dim_x, Grid_Dim_x); //Grid structure
dim3 Block(Block_Dim_x,Block_Dim_y); //Block structure, threads/block limited by specific device
size = N * N * sizeof(int);
// number of bytes in total in arrays
a = (int*) malloc(size);
b = (int*) malloc(size);
c = (int*) malloc(size);
d = (int*) malloc(size);

//dynamically allocated memory for arrays on host


// results from GPU
// results from CPU
// load arrays with some numbers

/* ------------- COMPUTATION DONE ON GPU ----------------------------*/


cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);

// allocate memory on device

cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice);


cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice);
cudaEventRecord(start, 0);

Where you
measure time
will make a big
difference

// here start time, after memcpy

gpu_matrixmult<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c, dev_c, size , cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0);
// measuse end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms);

/* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/


cudaEventRecord(start, 0);

// use same timing*

cpu_matrixmult(a,b,d,N);

// do calculation on host

cudaEventRecord(stop, 0);
// measure end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms); // exe. time
/* ------------------- check device creates correct results -----------------*/

/* --------------------- repeat program ----------------------------------------*/

// while loop to repeat calc with different parameters


/* -------------- clean up ---------------------------------------*/
free(a); free(b); free(c);
I have found
cudaFree(dev_a);
cudaFree(dev_b);
problems if do
cudaFree(dev_c);
CPU timing before
cudaEventDestroy(start);
GPU timing.
cudaEventDestroy(stop);
return 0;
Anyone else?
}
10

Some Preliminaries
Effects of First Launch
Program is written so that can repeat with
different parameters without stopping
program to eliminate effect of first kernel
launch
Also might take advantage of caching
seems not significant as first launch

11

Some
results
32 x 32
array

Random
numbers
0- 9

1 block
of 32 x 32
threads
Speedup =
1.65,
First time

Answer
Check both CPU and GPU same answers
12

Some
results
32 x 32 array
1 block
of 32 x 32
threads
Speedup =
2.12
Second time

13

Some
results
32 x 32
array
1 block
of 32 x 32
threads
Speedup =
2.16
Third time
Subsequently
can vary 2.12
2.18

14

Some
results
256 x 256
array
8 blocks
of 32 x 32
threads
Speedup =
151.86
15

Some
results
1024 x
1024 array
32 blocks
of 32 x 32
threads
Speedup =
860.9

16

Some
results
2048 x 2048 array
64 blocks
of 32 x 32 threads
Speedup = ??
GPU appears to
freeze.
Why?

211 x 211 threads = 222 threads


Memory needed = 222 integers
= 224 bytes
= 4 Mbytes

Max number of threads on GPU appears to be 216 x 216 = 232 threads


Server has 4 Gbytes of main memory
17

Some
results
2048 x
2048 array
64 blocks
of 32 x 32
threads
Speedup =
??

18

Different Array Sizes


Array size

Speedup *

Speedup**

32 x 32

2.18

1.26

256 x 256

151.86

110.63

1024 x 1024

860.9

733.44

2048 x 2048

??

4096 x 4096

* These result include time of memory copy back from device but not
memory copy to device.
** These result include time of memory copy back from the device and
memory copy to the device
Block size 32 x 32. Number of blocks to suit array size

19

GPU Limitations
Previous program has limitations:
Number of threads => number of array elements
(code will not work if number of threads < number of array elements)

Number of threads/block and blocks/grid has GPU


limitations, which will limit size of arrays that can be
processed.
Keyboard input must check for invalid grid and block
value
There are memory bandwidth issues.
20

Compute capability 1.x


Maximum number of threads per block

= 512

Maximum sizes of x- and y- dimension


of thread block

= 512

Maximum size of each dimension of grid

= 65535

Maximum number of threads per block of 512 means a square 2-D


block cannot be greater than 16 x 16 (256 threads)
So maximum square array size is 16 x 65535 (24 x 216 = 220)
i.e. an array of 220 x 220
Is this a problem?
21

Compute capability 2.x


(coit-grid06.uncc.edu)
Maximum number of threads per block

= 1024

Maximum sizes of x- and y- dimension


of thread block

= 1024

Maximum size of each dimension of grid

= 65535

Max number of threads per block of 1024 means a square 2-D


block cannot be greater than 32 x 32. Now all 1024 threads
allocated
So maximum square array size is 32 x 65535 (25 x 216 = 221)
i.e. an array of 221 x 221
Is this a problem?

22

Increasing size of arrays beyond thread


limitation of GPU
-- Using less threads than array elements*
Actually this one is easy and can draw upon regular technique of
using sub-matrix multiplication:
Sub-matrix

Sub-matrix
Not in textbooks (not their tiling).

Sub-matrix

23
* Would you want to use less threads than elements?

To demonstrate that sum-matrix multiplication will produce


the correct final answer, consider simple 2 x 2 sub-matrices:

24

Pseudo Code for Block Multiplication


Suppose N x N matrices, divided into s2 submatrices.
Each submatrix has n/s x n/s elements.
for (p = 0; p < s; p++)
for (q = 0; q < s; q++) {
Cp,q = 0;
for (r = 0; r < m; r++)
Cp,q = Cp,q + Ap,r * Br,q;
}

// clear elements of submatrix


// submatrix multiplication
// add to accum. submatrix

Cp,q means submatrix row p and column q of matrix C.


Cp,q = Cp,q + Ap,r * Br,q; means multiply submatrix Ap,r and Br,q using
matrix multiplication and add to submatrix Cp,q using matrix addition.
25

Code for Block Multiplication


N = ;
s = ;
ss = N/s;
for (i = 0; i < N; i+= ss)
for (j = 0; j < N; j+= ss) {

Each
thread
does one
value of i
and j.

// no of elements in cols/rows of matrices


// no of sub matrices
N/s an integer
// no of elements in sub-matrix cols/rows
// go thro sub-matrices of A
// and sub-matrices of B

for (p = i; p < i + ss; p++)


for (q = j; q < j + ss; q++)
C[p][q] = 0;

// clear elements of sub-matrix, Cp,q = 0;

for (r = 0; r < s; r++) {

// sub-matrices row, column

for (p = i; p < i + ss; p++) // submatrix multiplication


for (q = j; q < j + ss; q++) {
c[p][q] = 0;
for (k = r; k < r + ss; k++)
c[p][q] += A[p][k] * B[k][q];
}
C[i][j] += c[p][q];
// add to accum. submatrix

S threads
}
}

Not tested
26 yet
Mistakes?

Effects of memory access in


matrix multiplication
One thread is responsible for computing one result C ij
and needs access a row of A and a column of B:
Thread

Each thread access one row of A and one column of B


N2 row/column combinations, N2 threads
27

Seen another way, in first time period, each thread


accesses the first element in a row of A:

Thread 0,

Thread I,
Thread N-1,

Consider those threads that access different rows


Given the row-major order of how A is stored, those threads will
locations are not in consecutive locations

Bad cannot do memory

Question: how many


threads access the same
coalescing. location?
28

Next, each thread accesses the first element in a


column of B:
Thread 0,

Thread I, Thread N-1,

Consider those threads that access different columns


Given the row-major order of how A is stored, those threads will
locations are in consecutive locations.
Question: how many

Good! Can do memory coalescing.

threads access the same


location?
29

How can we get better memory


accesses and memory coalcesing?
1. Transpose one array
Copy all rows of A to
columns and all columns
of A to rows before access
A and modify program
according.
(Not mentioned in course
textbook or other NVIDIA
book, although appears
obvious way see next
about whether works!)
30

Sequential code for a transpose using same array:


for (i=0; i < N; i++)
for (j=0; j < i; j++)
temp = B[i][j];
B[i][j] = b[j][i];
B[j][i] = temp;
}

(In my code, I use separate arrays)


Could be done on host prior to copying to device.
How would the code look like if on device?
31

/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/

//

transposeArray(a, a_T, N);

// transpose array

cudaEventRecord(start, 0);

// here time measured before


// host-device copy, but not transpose
// Needed?

cudaEventSynchronize(start);

cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice); // cpy transp. A


cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice);
// copy B
gpu_matrixmult_T<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c_T,dev_c, size ,cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0);
// measure end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms2, start, stop );
printf("Time to calculate results on GPU with transposed array: %f ms.\n",
elapsed_time_ms2); // print out execution time
32

Some
results
8 x 8 array
1 block
of 8 x 8
threads
Speedup =
1.62 over not
transposing
array
33

Some
results
32 x 32 array
1 block
of 32 x 32
threads
Speedup =
1.17 over not
transposing
array
34

Some
results
256 x 256
array
8 blocks
of 32 x 32
threads
Speedup =
0.89!! over not
transposing
array
35

Some
results
1024 x 1024
array
32 blocks
of 32 x 32
threads
Speedup =
0.93!! over not
transposing
array
36

2. Using shared memory with Tiling


Copy of tiles made in
shared memory

As

Bs

Cs

Note: this is not using block matrix multiplication. The


fundamental algorithm is just divided into phases
Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10, Page 109.

37

Developing Code
1. Declaring shared memory:
__global__ void gpu_matrixmult (int* Md, int* Nd, int* Pd, int Width) {
__shared__ int Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ int Nds[TILE_WIDTH][TILE_WIDTH];

Needs to be using static memory allocation


so will need to fix size of tile
Convenient to choose same as kernel block size, say
32 x 32
38

2. Index into shared memory:


For convenience declare tile(block) and thread
indices, and indices into final C array:
int bx = blockIdx.x;
int by = blockIdx.y;

// tile (block) indices

int tx = threadIdx.x;
int ty = threadIdx.y;

//thread indices

To access (later):
Mds[ty][tx] =
Nds[ty][tx] =

// element associated with thread

39

3. Global address:
For convenience declare row and column:
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;

// global indices

Note: Same as the usual global thread ID and result index in normal
matrix multiplication:
int row = threadIdx.y + blockDim.y * blockIdx.y;
int col = threadIdx.x + blockDim.x * blockIdx.x;
40

4. Loading shared memory:


Done using SIMT thread collaboration (very tricky):
for (int m = 0; m < N/TILE_WIDTH; m++) {

All elements on
row transferred,
one per thread

For each tile in row or column

Mds[ty][tx] = Md[Row*N + (m*TILE_WIDTH + tx)];

Thread ID

Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*N + Col];


__syncthreads();
Wait for all threads in block

Thread ID
All elements on column
transferred, one per thread

// do matrix multiplication operations on pair of tiles


Books says achieves memory coalescing although it does not look to do that in both cases.

41

Example (3 x 3 tiles)
Row = by * TILE_WIDTH + ty
Col = bx * TILE_WIDTH + tx

m = tile number 0 ,1, and 2

Global memory address

Global memory address


((m*TILE_WIDTH + ty)*N + Col

Row*N + (m*TILE_WIDTH + tx)

by/bx

0
m

Global array A

Global array B

Based upon Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10,
Page 109.

42

5. Matrix multiplication in shared memory:


int Pvalue = 0;
for (int m = 0; m < N/TILE_WIDTH; m++) {
// copy tiles to shared memory (step 4)
for ( int k = 0; k < TILE_WIDTH; k++)
Pvalue += Mds[ty][k] * Nds[k][tx];
}
Pd[Row * N + Col] = Pvalue;
}

Multiply tiles,
accumulating
values
Copy back to global memory

43

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y; // thread ID
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;

Code given in book


Programming Massively
Parallel Processors A Hand
Approach, Fig. 6.11, Page 1

//Identify row, column of Pd element to work o

float Pvalue = 0;

for (int m = 0; m < Width/TILE_WIDTH; m++) { // loop over Md, Nd to compute Pd elem

Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; // load Md, Nd tiles into sh. m


Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
Note copying to
__syncthreads();
for ( int k = 0; k < TILE_WIDTH; k++)
Pvalue += Mds[ty][k] * Nds[k][tx];
}
Pd[Row][Col] = Pvalue;

In wrong place on page 110, although does


not effect final answer!

shared memory is a
collaborative betwe
threads, each threa
doing one element
each array

A mistake in the book as Pd is a pointer and size of array unknown? Will

Some
results
32 x 32 array
1 block
of 32 x 32 threads
Speedup
GPU to CPU
2.12
GPU sh. mem. to CPU
2.73
GPU sh. mem to GPU
without sh memory
1.28
45

Some
results
256 x 256 array
8 block
of 32 x 32 threads
Speedup
GPU to CPU
153
GPU sh. mem. to CPU
217
GPU sh. mem to GPU
without sh memory
1.41

46

Some
results
1024 x1024 array
32 block
of 32 x 32 threads
Speedup
GPU to CPU
864
GPU sh. mem. to CPU
2214 !!!
GPU sh. mem to GPU
without sh memory
2.56

47

Some
results
2048 x 2048 array
64 block
of 32 x 32 threads
Speedup
GPU to CPU
989
GPU sh. mem. to CPU
2962 !!
GPU sh. mem to GPU
without sh memory
2.99

48

Different Array Sizes


Speedup
Array size

GPU to CPU

GPU using shared


memory to CPU

GPU using shared


memory to GPU not
using shared
memory

32 x 32

2.12

2.73

1.28

256 x 256

153

217

1.41

1024 x 1024

864

2214

2.56

2048 x 2048

989

2962

2.99

4096 x 4096

Block size 32 x 32. Number of blocks to suit array size

49

Bandwidth improvements by
using shared memory
Using a 32 x 32 tiles reduced the number of global
memory access by a factor of 16 (two transfers instead
of 2 x 32 transfers).
According to PMPP book, page 90, using 16 x 16 tiles:
it allows the 86.4 GB/s global bandwidth to serve a
much larger floating-point computation rate Can now
support 86.4/4 x 16 = 345.6 gigaflops, very close to the
peak floating point performance of the G80 effectively
removes global memory bandwidth as the major limiting
factor of matrix multiplication performance.
50

Conclusions
Using the shared memory algorithm can make
a significant difference, up to 3 times as fast on
the GPU to not using this algorithm in
presented tests.
Speedup of almost 3000 over the CPU!
(Note though CPU is an old processor)

51

This topic may be explored further


in an assignment
Need better tools
NVIDIA offers a debugging tool
called Parallel Nsight for Visual
Studio/Windows
http://parallelnsight.nvidia.com/
52

Further Reading
Programming Massively
Parallel Processors
A hands-on Approach
David B. Kirk and
Wen-mei W. Hwu
Morgan Kaufmann, 2010
This book is only on NVIDIA
GPUs and CUDA programming
despite its title.
53

Questions

Things in PMPP book (Ch 6) not covered yet:


Dynamic partitioning of SM resources
Data Prefetching
Instruction usage
Thread granularity
Also note page 108 says memory coalescing not
needed for shared memory!

55

S-ar putea să vă placă și