Lect11 12 Cuda Threads

CUDA Threads
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
CUDA Thread Block

All threads in a block execute the same kernel program (SPMD) Programmer declares block:
Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads Thread program uses thread id to select work and address shared data
CUDA Thread Block

Thread Id #: 0123 m
Threads have thread id numbers within block
Thread program
Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate
Each block can execute in any order relative to other blocks!
Courtesy: John Nickolls, NVIDIA

2
CUDA Thread Organization Implementation

blockIdx & threadIdx: unique coordinates of threads used to distinguish threads & identify for each thread the appropriate portion of the data to process:
Assigned to threads by the CUDA runtime system Appear as built-in variables that are initialized by the runtime system and accessed within the kernel functions References to the blockIdx and threadIdx variables return the appropriate values that form coordinates of the thread
gridDim & blockDim: Built in variables which provide the dimensions of the grid and block
Example Thread Organization

M=8 threads/block (1D organization) threadIdx.x = 0, 1, 2, , 7
N thread blocks (1D organization) blockIdx.x = 0, 1, , N-1 8*N threads/grid: blockDim.x=8 ; gridDim.x=N
In the code of the figure

threadID=blockIdx.x*blockDim.x + threadIdx.x For Thread 3 of Block 5, threadID = 5*8+ 3 =43
Thread Block 0
threadID
0 1 2 3 4 5 6 7 0
Thread Block 1
1 2 3 4 5 6 7
Thread Block N - 1
0 1 2 3 4 5 6 7
float x = input[threadID]; float y = func(x); output[threadID] = y;
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
General Thread Organization

Grid: 2D array of blocks; Block: 3D array of threads Execution Configuration determines exact organization In the example of previous slide, if N=128 & M=32, then the following Execution Configuration is used
Dim3 dimGrid(128,1,1);//dimGrid.x, dimGrid.y take values 1 to 65,535 Dim3 dimBlock(32,1,1); // size of block limited to 512 threads KernelFunction<<<dimGrid, dimBlock>>>(.); blockIdx.x ranges between 0 & dimGrid.x-1
Another Example
Dim3 dimGrid(2,2,1) Dim3 dimBlock(4,2,2); KernelFunction<<<dimGrid, dimBlock>>>(.); Block(1,0) has blockIdx.x=1 & blockIdx.y=0
Host
Device Grid 1
Kernel 1
Block (0, 0) Block (0, 1)
Block (1, 0) Block (1, 1)
Grid 2 Kernel 2 Block (1, 1)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Thread(2,1,0) has threadIdx.x=2 ;threadIdx.y=1; threadIdx.z=0

Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0)
Courtesy:
6
Matrix Multiplication Using Multiple Blocks

Break-up Pd into tiles Each thread block calculates one tile; Block size equal tile size Each thread calculates one Pd element identified using
blockIdx.x & blockIdx.y to identify the tile hreadIdx.x, threadIdx.y to identify the thtread within the tile
Md
0 Pd Nd
Bx= blockIdx.x
0 1 2
Tx=threadIdx.x
0 1 2 TILE_WIDTH-1
By= blockIdx.y
ty
TILE_WIDTH-1
TILE_WIDTH
2 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
WIDTH
WIDTH
WIDTH
0 1 2
Pdsub
TILE_WIDTHE
WIDTH
Revised Matrix Multiplication using Multiple Blocks (cont.)

the x index of the Pd element computed by a thread is x = bx*TILE_WIDTH + tx the y index of the Pd element computed by a thread is y= by*TILE_WIDTH + ty Thread (tx,ty) in block (bx, by) uses row y of Md and column x of Nd to update element Pd[y*Width+x]
Nd
bx
0 1 2
tx
0 1 2 TILE_WIDTH-1
Md
0
Pd
by
ty
TILE_WIDTH-1
TILE_WIDTH
2 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
WIDTH
WIDTH
WIDTH
0 1 2
Pdsub
TILE_WIDTHE
WIDTH
A Small Example: P(4,4)

For Block(0,0) blockIdx.x=0 & blockIdx.y=0 Block(0,0) For Block(1,0) blockIdx.x=1 & blockIdx.y=0 Block(1,0)
P0,0 P1,0 P2,0 P3,0 P0,1 P1,1 P2,1 P3,1 P0,2 P1,2 P2,2 P3,2
TILE_WIDTH = 2
P0,3 P1,3 P2,3 P3,3

Block(0,1) For Block(0,1) blockIdx.x=0 & blockIdx.y=1
Block(1,1) For Block(1,1) blockIdx.x=1 & blockIdx.y=1

9
A Small Example: Multiplication

thread(0,0) of block(0,0) calculates Pd0,0 thread(0,0) of block(0,1) calculates Pd0,2 thread(1,1) of block(0,0) calculates Pd1,1 thread(1,1) of block(0,1) calculates Pd1,3
Md0,0Md1,0Md2,0Md3,0 Md0,1Md1,1Md2,1Md3,1 Pd0,0 Pd1,0 Pd2,0 Pd3,0 Pd0,1 Pd1,1 Pd2,1 Pd3,1 Pd0,2 Pd1,2 Pd2,2 Pd3,2 Pd0,3 Pd1,3 Pd2,3 Pd3,3 Nd0,0 Nd1,0
Nd0,1 Nd1,1
Nd0,2 Nd1,2 Nd0,3 Nd1,3
10
Revised Host Code for Launching the Revised Kernel

//Setup the execution configuration Dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH,1); Dim3 dimBlock(TILE_WIDTH, TILE_WIDTH,1); //Launch the device computation threads KernelFunction<<<dimGrid, dimBlock>>> MatrixMulKernel(Md, Nd, Pd, Width); //Kernel can handle arrays of dimensions up to 1,048,560 X 1,048,560{16 X 65,535=1,048,560} using 65,535 X 65,535= 4,294,836,225 blocks each with 256 threads Total number of parallel threads =
1,099,478,073,600 > 1 Tera (1012)threads
11
Revised Matrix Multiplication Kernel using Multiple Blocks

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;

// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }
12
Transparent Scalability
Cuda runtime system can execute blocks in any order A kernel scales across any number of execution resources
Execution with few resources

Device
Transparent Scalability: The ability to execute the same application on hardware with different # of execution resources
Execution with more resources
Kernel grid Block 0 Block 1 Block 2 Block 3 Device
Block 0
Block 1
Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 5 Block 2 Block 6 Block 3 Block 7
Block 2
Block 3
time
Block 4
Block 4
Block 5
Block 6
Block 7
Each block can execute in any order relative to other blocks.

13
Thread Block Assignment to Streaming Multiprocessors
t0 t1 t2 tm
SM 0 SM 1
MT IU SP MT IU SP
t0 t1 t2 tm
Blocks
Blocks
Shared Memory Shared Memory
Upon Kernel launch, threads are assigned to Streaming Multiprocessors, SMs in block granularity Up to 8 blocks assigned to each SM as resources allows SM in GT200 can take up to 1024 threads
Could be 256 (threads/block) * 4 blocks Or 128 (threads/block) * 8 blocks, etc
14
Thread Block Assignment to Streaming Multiprocessors (cont.)
t0 t1 t2 tm
SM 0 SM 1
MT IU SP MT IU SP
t0 t1 t2 tm
Blocks
Blocks GT200 has 30 SMs

Shared Memory Shared Memory
Up to 240 (30*8) blocks execute simultaneously Up to 30,720 (30*1024) concurrent threads can be residing in SMs for execution
SM maintains thread/block id #s SM manages/schedules thread execution
15
GT200 Example: Thread Scheduling

Each Block is divided into 32thread Warps for scheduling purposes
Size of warp is implementation dependant not part of the CUDA programming model Warp is the unit of thread scheduling in SM
Block 1 Warps t0 t1 t2 t31
Streaming Multiprocessor
Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SFU SP SP SP SP SP SP SFU
If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?
Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps in each SM
16
GT200 Example: Thread Scheduling (Cont.)

SM implements zero-overhead warp scheduling: At any time, only one of the warps is executed by SM Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected (SIMD execution)
TB1, W1 stall TB2, W1 stall TB1 W1 TB2 W1 TB3 W1 TB3 W2 TB2 W1 TB3, W2 stall TB1 W1 TB1 W2 TB1 W3 TB3 W2
Instruction:
Time
TB = Thread Block, W = Warp
17
GT200 Block Granularity Considerations

For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks?
For 8X8 blocks, we have 64 threads per Block. Since each SM can take up to 1024 threads, there are 16 Blocks (1024/64). However, each SM can only take up to 8 Blocks. Hence only 512 (64*8) threads will go into each SM! SM execution resources are under utilized; fewer wraps to schedule around long latency operations
For 16X16 blocks , we have 256 threads per Block. Since each SM can take up to 1024 threads, it can take up to 4 Blocks (1024/256) Full thread capacity in each SM Maximal number of warps for scheduling around long-latency operations (1024/32= 32 wraps)
For 32X32 blocks, we have 1024 threads per Block. Not even one can fit into an SM! (512 threads /block limitation)
18
Some Additional API Features
19
Application Programming Interface

The API is an extension to the C programming language It consists of:
Language extensions
To target portions of the code for execution on the device
A runtime library split into:

1. A host component to control and access one or more devices from the host 2. A device component providing device-specific functions 3. A common component providing built-in vector types and a subset of the C runtime library in both host and device codes
20
Language Extensions: Built-in Variables

dim3 gridDim;
Dimensions of the grid in blocks (gridDim.z unused)
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block
21
Common Runtime Component: Mathematical Functions

pow, sqrt, cbrt, hypot exp, exp2, expm1 log, log2, log10, log1p sin, cos, tan, asin, acos, atan, atan2 sinh, cosh, tanh, asinh, acosh, atanh ceil, floor, trunc, round Etc. When executed on the host, a given function uses the C runtime implementation if available These functions are only supported for scalar types, not vector types
22
Device Runtime Component: Mathematical Functions

Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x))
__pow __log, __log2, __log10 __exp __sin, __cos, __tan
23
Host Runtime Component

Provides functions to deal with:
Device management (including multi-device systems) Memory management Error handling
Initializes the first time a runtime function is called

A host thread can invoke device code on only one device
Multiple host threads required to run on multiple devices
24
Device Runtime Component: Synchronization Function

void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory Allowed in conditional constructs only if the conditional is uniform across the entire thread block
25

Lect11 12 Cuda Threads

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lect11 12 Cuda Threads

Încărcat de

Drepturi de autor:

Formate disponibile

CUDA Threads

CUDA Thread Block

CUDA Thread Block

Threads have thread id numbers within block

Courtesy: John Nickolls, NVIDIA

CUDA Thread Organization Implementation

Example Thread Organization

In the code of the figure

float x = input[threadID]; float y = func(x); output[threadID] = y;

float x = input[threadID]; float y = func(x); output[threadID] = y;

float x = input[threadID]; float y = func(x); output[threadID] = y;

General Thread Organization

Block (0, 0) Block (0, 1)

Block (1, 0) Block (1, 1)

Grid 2 Kernel 2 Block (1, 1)

Thread(2,1,0) has threadIdx.x=2 ;threadIdx.y=1; threadIdx.z=0

Matrix Multiplication Using Multiple Blocks

Revised Matrix Multiplication using Multiple Blocks (cont.)

A Small Example: P(4,4)

P0,3 P1,3 P2,3 P3,3

Block(1,1) For Block(1,1) blockIdx.x=1 & blockIdx.y=1

A Small Example: Multiplication

Revised Host Code for Launching the Revised Kernel

Revised Matrix Multiplication Kernel using Multiple Blocks

int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;

int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

Execution with few resources

Each block can execute in any order relative to other blocks.

Thread Block Assignment to Streaming Multiprocessors

Thread Block Assignment to Streaming Multiprocessors (cont.)

Blocks GT200 has 30 SMs

GT200 Example: Thread Scheduling

Block 1 Warps t0 t1 t2 t31

Block 2 Warps t0 t1 t2 t31

Block 1 Warps t0 t1 t2 t31

GT200 Example: Thread Scheduling (Cont.)

TB = Thread Block, W = Warp

GT200 Block Granularity Considerations

Some Additional API Features

Application Programming Interface

A runtime library split into:

Language Extensions: Built-in Variables

Common Runtime Component: Mathematical Functions

Device Runtime Component: Mathematical Functions

Host Runtime Component

Initializes the first time a runtime function is called

Device Runtime Component: Synchronization Function

S-ar putea să vă placă și