Matrix Mult

Matrix Multiplication
Performance Improvements
These notes will introduce:
Basic matrix multiplication on a 2-D grid/block review
Limitations
Block matrix multiplication
Memory access patterns in matrix multiplication
Coalescing memory accesses
Transpose array
Using shared memory tiles
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 6, 2011
MatrixMult.ppt
Matrix Multiplication
Matrix multiplication is an important application in HPC
and appears in many applications
C=A*B
where A, B, and C are matrices (two-dimensional
arrays.
A restricted case is when B has only one column -- matrix-vector
multiplication, which appears in representation of linear equations
and partial differential equations
2
Matrix multiplication, C = A x B
Implementing Matrix Multiplication

Sequential Code
Assume matrices square (N x N matrices).
for (i = 0; i < N; i++)
for (j = 0; j < N; j++) {
c[i][j] = 0;
for (k = 0; k < N; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
Requires n3 multiplications and n3 additions
Sequential time complexity of O(n3).
Very easy to parallelize.
4
CUDA Kernel
for multiplying two arrays
__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {
int k, sum = 0;
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
if (col < N && row < N) {
for (k = 0; k < N; k++)
sum += a[row * N + k] * b[k * N + col];
c[row * N + col] = sum;
}
}
In this example, one

thread computes one C
element and the number
of threads must equal or
greater than the number
of elements
5
Sequential version with flattened arrays

for comparison
void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {
int i, j, k, sum;
for (row =0; row < N; row++)
// row of a
for (col =0; col < N; col++) {
// column of b
sum = 0;
for(k = 0; k < N; k++)
sum += cpu_a[row * N + k] * cpu_b[k * N + col];
cpu_c[row * N + col] = sum;
}
}
6
Matrix mapped on 2-D Grids and 2-D blocks

A[][column]
Arrays mapped onto

structure, one element
per thread
Grid
blockIdx.y * blockDim.y + threadIdx.y
Block
threadID.x
threadID.y
Array
A[row][]
Thread
blockIdx.x * blockDim.x + threadIdx.x
Basically array divided into tiles and one tile mapped onto one block
// Matrix addition program MatrixMult.cu, Barry Wilkinson, Dec. 28, 2010.

#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
__global__ void gpu_matrixmult(int *gpu_a, int *gpu_b, int *gpu_c, int N) {
Complete
Program
(several slides)
void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {
}
int main(int argc, char *argv[]) {
int i, j;
int Grid_Dim_x=1, Grid_Dim_y=1;
int Block_Dim_x=1, Block_Dim_y=1;
int noThreads_x, noThreads_y;
int noThreads_block;
int N = 10;
int *a,*b,*c,*d;
int *dev_a, *dev_b, *dev_c;
int size;
cudaEvent_t start, stop;
float elapsed_time_ms;
// loop counters
//Grid structure values
//Block structure values
// number of threads available in device, each dimension
// number of threads in a block
// size of array in each dimension
// number of bytes in arrays
// using cuda events to measure time
// which is applicable for asynchronous code also
/* --------------------ENTER INPUT PARAMETERS AND ALLOCATE DATA -----------------------*/
// keyboard input
dim3 Grid(Grid_Dim_x, Grid_Dim_x); //Grid structure
dim3 Block(Block_Dim_x,Block_Dim_y); //Block structure, threads/block limited by specific device
size = N * N * sizeof(int);
// number of bytes in total in arrays
a = (int*) malloc(size);
b = (int*) malloc(size);
c = (int*) malloc(size);
d = (int*) malloc(size);
//dynamically allocated memory for arrays on host

// results from GPU
// results from CPU
// load arrays with some numbers
/* ------------- COMPUTATION DONE ON GPU ----------------------------*/

cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
// allocate memory on device
cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice);

cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice);
cudaEventRecord(start, 0);
Where you
measure time
will make a big
difference
// here start time, after memcpy
gpu_matrixmult<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c, dev_c, size , cudaMemcpyDeviceToHost);
cudaEventRecord(stop, 0);
// measuse end time
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on GPU: %f ms.\n", elapsed_time_ms);
/* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/

// use same timing*
cpu_matrixmult(a,b,d,N);
// do calculation on host
// measure end time
cudaEventElapsedTime(&elapsed_time_ms, start, stop );
printf("Time to calculate results on CPU: %f ms.\n", elapsed_time_ms); // exe. time
/* ------------------- check device creates correct results -----------------*/
/* --------------------- repeat program ----------------------------------------*/
// while loop to repeat calc with different parameters

/* -------------- clean up ---------------------------------------*/
free(a); free(b); free(c);
I have found
cudaFree(dev_a);
cudaFree(dev_b);
problems if do
cudaFree(dev_c);
CPU timing before
cudaEventDestroy(start);
GPU timing.
cudaEventDestroy(stop);
return 0;
Anyone else?
}
10
Some Preliminaries
Effects of First Launch
Program is written so that can repeat with
different parameters without stopping
program to eliminate effect of first kernel
launch
Also might take advantage of caching
seems not significant as first launch
11
Some
results
32 x 32
array
Random
numbers
0- 9
1 block
of 32 x 32
threads
Speedup =
1.65,
First time
Answer
Check both CPU and GPU same answers
12
Some
results
32 x 32 array
1 block
of 32 x 32
threads
Speedup =
2.12
Second time
13
Some
results
32 x 32
array
1 block
of 32 x 32
threads
Speedup =
2.16
Third time
Subsequently
can vary 2.12
2.18
14
Some
results
256 x 256
array
8 blocks
of 32 x 32
threads
Speedup =
151.86
15
Some
results
1024 x
1024 array
32 blocks
of 32 x 32
threads
Speedup =
860.9
16
Some
results
2048 x 2048 array
64 blocks
of 32 x 32 threads
Speedup = ??
GPU appears to
freeze.
Why?
211 x 211 threads = 222 threads

Memory needed = 222 integers
= 224 bytes
= 4 Mbytes
Max number of threads on GPU appears to be 216 x 216 = 232 threads

Server has 4 Gbytes of main memory
17
Some
results
2048 x
2048 array
64 blocks
of 32 x 32
threads
Speedup =
??
18
Different Array Sizes

Array size
Speedup *
Speedup**
32 x 32
2.18
1.26
256 x 256
151.86
110.63
1024 x 1024
860.9
733.44
2048 x 2048
??
4096 x 4096
* These result include time of memory copy back from device but not
memory copy to device.
** These result include time of memory copy back from the device and
memory copy to the device
Block size 32 x 32. Number of blocks to suit array size
19
GPU Limitations
Previous program has limitations:
Number of threads => number of array elements
(code will not work if number of threads < number of array elements)
Number of threads/block and blocks/grid has GPU

limitations, which will limit size of arrays that can be
processed.
Keyboard input must check for invalid grid and block
value
There are memory bandwidth issues.
20
Compute capability 1.x

Maximum number of threads per block
= 512
Maximum sizes of x- and y- dimension

of thread block
= 512
Maximum size of each dimension of grid
= 65535
Maximum number of threads per block of 512 means a square 2-D

block cannot be greater than 16 x 16 (256 threads)
So maximum square array size is 16 x 65535 (24 x 216 = 220)
i.e. an array of 220 x 220
Is this a problem?
21
Compute capability 2.x

(coit-grid06.uncc.edu)
Maximum number of threads per block
= 1024
Maximum sizes of x- and y- dimension

of thread block
= 1024
Maximum size of each dimension of grid
= 65535
Max number of threads per block of 1024 means a square 2-D

block cannot be greater than 32 x 32. Now all 1024 threads
allocated
So maximum square array size is 32 x 65535 (25 x 216 = 221)
i.e. an array of 221 x 221
Is this a problem?
22
Increasing size of arrays beyond thread

limitation of GPU
-- Using less threads than array elements*
Actually this one is easy and can draw upon regular technique of
using sub-matrix multiplication:
Sub-matrix
Sub-matrix
Not in textbooks (not their tiling).
Sub-matrix
23
* Would you want to use less threads than elements?
To demonstrate that sum-matrix multiplication will produce

the correct final answer, consider simple 2 x 2 sub-matrices:
24
Pseudo Code for Block Multiplication

Suppose N x N matrices, divided into s2 submatrices.
Each submatrix has n/s x n/s elements.
for (p = 0; p < s; p++)
for (q = 0; q < s; q++) {
Cp,q = 0;
for (r = 0; r < m; r++)
Cp,q = Cp,q + Ap,r * Br,q;
}
// clear elements of submatrix

// submatrix multiplication
// add to accum. submatrix
Cp,q means submatrix row p and column q of matrix C.

Cp,q = Cp,q + Ap,r * Br,q; means multiply submatrix Ap,r and Br,q using
matrix multiplication and add to submatrix Cp,q using matrix addition.
25
Code for Block Multiplication

N = ;
s = ;
ss = N/s;
for (i = 0; i < N; i+= ss)
for (j = 0; j < N; j+= ss) {
Each
thread
does one
value of i
and j.
// no of elements in cols/rows of matrices

// no of sub matrices
N/s an integer
// no of elements in sub-matrix cols/rows
// go thro sub-matrices of A
// and sub-matrices of B
for (p = i; p < i + ss; p++)

for (q = j; q < j + ss; q++)
C[p][q] = 0;
// clear elements of sub-matrix, Cp,q = 0;
for (r = 0; r < s; r++) {
// sub-matrices row, column
for (p = i; p < i + ss; p++) // submatrix multiplication

for (q = j; q < j + ss; q++) {
c[p][q] = 0;
for (k = r; k < r + ss; k++)
c[p][q] += A[p][k] * B[k][q];
}
C[i][j] += c[p][q];
// add to accum. submatrix
S threads
}
}
Not tested
26 yet
Mistakes?
Effects of memory access in

matrix multiplication
One thread is responsible for computing one result C ij
and needs access a row of A and a column of B:
Thread
Each thread access one row of A and one column of B

N2 row/column combinations, N2 threads
27
Seen another way, in first time period, each thread

accesses the first element in a row of A:
Thread 0,
Thread I,
Thread N-1,
Consider those threads that access different rows

Given the row-major order of how A is stored, those threads will
locations are not in consecutive locations
Bad cannot do memory
Question: how many

threads access the same
coalescing. location?
28
Next, each thread accesses the first element in a

column of B:
Thread 0,
Thread I, Thread N-1,
Consider those threads that access different columns

Given the row-major order of how A is stored, those threads will
locations are in consecutive locations.
Question: how many
Good! Can do memory coalescing.
threads access the same

location?
29
How can we get better memory

accesses and memory coalcesing?
1. Transpose one array
Copy all rows of A to
columns and all columns
of A to rows before access
A and modify program
according.
(Not mentioned in course
textbook or other NVIDIA
book, although appears
obvious way see next
about whether works!)
30
Sequential code for a transpose using same array:

for (i=0; i < N; i++)
for (j=0; j < i; j++)
temp = B[i][j];
B[i][j] = b[j][i];
B[j][i] = temp;
}
(In my code, I use separate arrays)

Could be done on host prior to copying to device.
How would the code look like if on device?
31
/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/
//
transposeArray(a, a_T, N);
// transpose array
// here time measured before

// host-device copy, but not transpose
// Needed?
cudaEventSynchronize(start);
cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice); // cpy transp. A

cudaMemcpy(dev_b, b , size ,cudaMemcpyHostToDevice);
// copy B
gpu_matrixmult_T<<<Grid,Block>>>(dev_a,dev_b,dev_c,N);
cudaMemcpy(c_T,dev_c, size ,cudaMemcpyDeviceToHost);
// measure end time
cudaEventElapsedTime(&elapsed_time_ms2, start, stop );
printf("Time to calculate results on GPU with transposed array: %f ms.\n",
elapsed_time_ms2); // print out execution time
32
Some
results
8 x 8 array
1 block
of 8 x 8
threads
Speedup =
1.62 over not
transposing
array
33
Some
results
32 x 32 array
1 block
of 32 x 32
threads
Speedup =
1.17 over not
transposing
array
34
Some
results
256 x 256
array
8 blocks
of 32 x 32
threads
Speedup =
0.89!! over not
transposing
array
35
Some
results
1024 x 1024
array
32 blocks
of 32 x 32
threads
Speedup =
0.93!! over not
transposing
array
36
2. Using shared memory with Tiling

Copy of tiles made in
shared memory
As
Bs
Cs
Note: this is not using block matrix multiplication. The

fundamental algorithm is just divided into phases
Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10, Page 109.
37
Developing Code
1. Declaring shared memory:
__global__ void gpu_matrixmult (int* Md, int* Nd, int* Pd, int Width) {
__shared__ int Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ int Nds[TILE_WIDTH][TILE_WIDTH];
Needs to be using static memory allocation

so will need to fix size of tile
Convenient to choose same as kernel block size, say
32 x 32
38
2. Index into shared memory:

For convenience declare tile(block) and thread
indices, and indices into final C array:
int bx = blockIdx.x;
int by = blockIdx.y;
// tile (block) indices
int tx = threadIdx.x;
int ty = threadIdx.y;
//thread indices
To access (later):
Mds[ty][tx] =
Nds[ty][tx] =
// element associated with thread
39
3. Global address:
For convenience declare row and column:
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
// global indices
Note: Same as the usual global thread ID and result index in normal
matrix multiplication:
int row = threadIdx.y + blockDim.y * blockIdx.y;
int col = threadIdx.x + blockDim.x * blockIdx.x;
40
4. Loading shared memory:

Done using SIMT thread collaboration (very tricky):
for (int m = 0; m < N/TILE_WIDTH; m++) {
All elements on
row transferred,
one per thread
For each tile in row or column
Mds[ty][tx] = Md[Row*N + (m*TILE_WIDTH + tx)];
Thread ID
Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*N + Col];

__syncthreads();
Wait for all threads in block
Thread ID
All elements on column
transferred, one per thread
// do matrix multiplication operations on pair of tiles

Books says achieves memory coalescing although it does not look to do that in both cases.
41
Example (3 x 3 tiles)
Row = by * TILE_WIDTH + ty
Col = bx * TILE_WIDTH + tx
m = tile number 0 ,1, and 2
Global memory address
Global memory address

((m*TILE_WIDTH + ty)*N + Col
Row*N + (m*TILE_WIDTH + tx)
by/bx
0
m
Global array A
Global array B
Based upon Programming Massively Parallel Processors A Hands-on Approach, Fig. 6.10,
Page 109.
42
5. Matrix multiplication in shared memory:

int Pvalue = 0;
for (int m = 0; m < N/TILE_WIDTH; m++) {
// copy tiles to shared memory (step 4)
for ( int k = 0; k < TILE_WIDTH; k++)
Pvalue += Mds[ty][k] * Nds[k][tx];
}
Pd[Row * N + Col] = Pvalue;
}
Multiply tiles,
accumulating
values
Copy back to global memory
43
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y; // thread ID
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
Code given in book

Programming Massively
Parallel Processors A Hand
Approach, Fig. 6.11, Page 1
//Identify row, column of Pd element to work o
float Pvalue = 0;
for (int m = 0; m < Width/TILE_WIDTH; m++) { // loop over Md, Nd to compute Pd elem
Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)]; // load Md, Nd tiles into sh. m

Nds[ty][tx] = Nd[(m*TILE_WIDTH + ty)*Width + Col];
Note copying to
__syncthreads();
for ( int k = 0; k < TILE_WIDTH; k++)
Pvalue += Mds[ty][k] * Nds[k][tx];
}
Pd[Row][Col] = Pvalue;
In wrong place on page 110, although does

not effect final answer!
shared memory is a
collaborative betwe
threads, each threa
doing one element
each array
A mistake in the book as Pd is a pointer and size of array unknown? Will
Some
results
32 x 32 array
1 block
of 32 x 32 threads
Speedup
GPU to CPU
2.12
GPU sh. mem. to CPU
2.73
GPU sh. mem to GPU
without sh memory
1.28
45
Some
results
256 x 256 array
8 block
of 32 x 32 threads
Speedup
GPU to CPU
153
GPU sh. mem. to CPU
217
GPU sh. mem to GPU
without sh memory
1.41
46
Some
results
1024 x1024 array
32 block
of 32 x 32 threads
Speedup
GPU to CPU
864
GPU sh. mem. to CPU
2214 !!!
GPU sh. mem to GPU
without sh memory
2.56
47
Some
results
2048 x 2048 array
64 block
of 32 x 32 threads
Speedup
GPU to CPU
989
GPU sh. mem. to CPU
2962 !!
GPU sh. mem to GPU
without sh memory
2.99
48
Different Array Sizes

Speedup
Array size
GPU to CPU
GPU using shared

memory to CPU
GPU using shared

memory to GPU not
using shared
memory
32 x 32
2.12
2.73
1.28
256 x 256
153
217
1.41
1024 x 1024
864
2214
2.56
2048 x 2048
989
2962
2.99
4096 x 4096
Block size 32 x 32. Number of blocks to suit array size
49
Bandwidth improvements by
using shared memory
Using a 32 x 32 tiles reduced the number of global
memory access by a factor of 16 (two transfers instead
of 2 x 32 transfers).
According to PMPP book, page 90, using 16 x 16 tiles:
it allows the 86.4 GB/s global bandwidth to serve a
much larger floating-point computation rate Can now
support 86.4/4 x 16 = 345.6 gigaflops, very close to the
peak floating point performance of the G80 effectively
removes global memory bandwidth as the major limiting
factor of matrix multiplication performance.
50
Conclusions
Using the shared memory algorithm can make
a significant difference, up to 3 times as fast on
the GPU to not using this algorithm in
presented tests.
Speedup of almost 3000 over the CPU!
(Note though CPU is an old processor)
51
This topic may be explored further

in an assignment
Need better tools
NVIDIA offers a debugging tool
called Parallel Nsight for Visual
Studio/Windows
http://parallelnsight.nvidia.com/
52
Further Reading
Programming Massively
Parallel Processors
A hands-on Approach
David B. Kirk and
Wen-mei W. Hwu
Morgan Kaufmann, 2010
This book is only on NVIDIA
GPUs and CUDA programming
despite its title.
53
Questions
Things in PMPP book (Ch 6) not covered yet:

Dynamic partitioning of SM resources
Data Prefetching
Instruction usage
Thread granularity
Also note page 108 says memory coalescing not
needed for shared memory!
55

Matrix Mult

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Matrix Mult

Încărcat de

Drepturi de autor:

Formate disponibile

Matrix Multiplication

Implementing Matrix Multiplication

In this example, one

Sequential version with flattened arrays

Matrix mapped on 2-D Grids and 2-D blocks

Arrays mapped onto

blockIdx.y * blockDim.y + threadIdx.y

blockIdx.x * blockDim.x + threadIdx.x

// Matrix addition program MatrixMult.cu, Barry Wilkinson, Dec. 28, 2010.

void cpu_matrixmult(int *cpu_a, int *cpu_b, int *cpu_c, int N) {

/* --------------------ENTER INPUT PARAMETERS AND ALLOCATE DATA -----------------------*/

//dynamically allocated memory for arrays on host

/* ------------- COMPUTATION DONE ON GPU ----------------------------*/

// allocate memory on device

cudaMemcpy(dev_a, a , size ,cudaMemcpyHostToDevice);

// here start time, after memcpy

/* ------------- COMPUTATION DONE ON HOST CPU ----------------------------*/

// use same timing*

/* --------------------- repeat program ----------------------------------------*/

// while loop to repeat calc with different parameters

211 x 211 threads = 222 threads

Max number of threads on GPU appears to be 216 x 216 = 232 threads

Different Array Sizes

Number of threads/block and blocks/grid has GPU

Compute capability 1.x

Maximum sizes of x- and y- dimension

Maximum size of each dimension of grid

Maximum number of threads per block of 512 means a square 2-D

Compute capability 2.x

Maximum sizes of x- and y- dimension

Maximum size of each dimension of grid

Max number of threads per block of 1024 means a square 2-D

Increasing size of arrays beyond thread

To demonstrate that sum-matrix multiplication will produce

Pseudo Code for Block Multiplication

// clear elements of submatrix

Cp,q means submatrix row p and column q of matrix C.

Code for Block Multiplication

// no of elements in cols/rows of matrices

for (p = i; p < i + ss; p++)

// clear elements of sub-matrix, Cp,q = 0;

for (r = 0; r < s; r++) {

// sub-matrices row, column

for (p = i; p < i + ss; p++) // submatrix multiplication

Effects of memory access in

Each thread access one row of A and one column of B

Seen another way, in first time period, each thread

Consider those threads that access different rows

Bad cannot do memory

Question: how many

Next, each thread accesses the first element in a

Thread I, Thread N-1,

Consider those threads that access different columns

Good! Can do memory coalescing.

threads access the same

How can we get better memory

Sequential code for a transpose using same array:

(In my code, I use separate arrays)

/* ------ COMPUTATION DONE ON GPU USING A TRANSPOSED ARRAY-----*/

transposeArray(a, a_T, N);

// here time measured before

cudaMemcpy(dev_a, a_T , size ,cudaMemcpyHostToDevice); // cpy transp. A

2. Using shared memory with Tiling

Note: this is not using block matrix multiplication. The

void cpu_matrixmult(int cpu_a, int cpu_b, int *cpu_c, int N) {

Mds[ty][tx] = Md[RowN + (mTILE_WIDTH + tx)];

Nds[ty][tx] = Nd[(mTILE_WIDTH + ty)N + Col];

RowN + (mTILE_WIDTH + tx)

Mds[ty][tx] = Md[RowWidth + (mTILE_WIDTH + tx)]; // load Md, Nd tiles into sh. m