Sunteți pe pagina 1din 25

CUDA Threads

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

CUDA Thread Block


All threads in a block execute the same kernel program (SPMD) Programmer declares block:
Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads Thread program uses thread id to select work and address shared data

CUDA Thread Block


Thread Id #: 0123 m

Threads have thread id numbers within block

Thread program

Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate
Each block can execute in any order relative to other blocks!

Courtesy: John Nickolls, NVIDIA


2

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

CUDA Thread Organization Implementation


blockIdx & threadIdx: unique coordinates of threads used to distinguish threads & identify for each thread the appropriate portion of the data to process:
Assigned to threads by the CUDA runtime system Appear as built-in variables that are initialized by the runtime system and accessed within the kernel functions References to the blockIdx and threadIdx variables return the appropriate values that form coordinates of the thread

gridDim & blockDim: Built in variables which provide the dimensions of the grid and block
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

Example Thread Organization


M=8 threads/block (1D organization) threadIdx.x = 0, 1, 2, , 7

N thread blocks (1D organization) blockIdx.x = 0, 1, , N-1 8*N threads/grid: blockDim.x=8 ; gridDim.x=N

In the code of the figure


threadID=blockIdx.x*blockDim.x + threadIdx.x For Thread 3 of Block 5, threadID = 5*8+ 3 =43
Thread Block 0
threadID
0 1 2 3 4 5 6 7 0

Thread Block 1
1 2 3 4 5 6 7

Thread Block N - 1
0 1 2 3 4 5 6 7

float x = input[threadID]; float y = func(x); output[threadID] = y;

float x = input[threadID]; float y = func(x); output[threadID] = y;

float x = input[threadID]; float y = func(x); output[threadID] = y;

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

General Thread Organization


Grid: 2D array of blocks; Block: 3D array of threads Execution Configuration determines exact organization In the example of previous slide, if N=128 & M=32, then the following Execution Configuration is used
Dim3 dimGrid(128,1,1);//dimGrid.x, dimGrid.y take values 1 to 65,535 Dim3 dimBlock(32,1,1); // size of block limited to 512 threads KernelFunction<<<dimGrid, dimBlock>>>(.); blockIdx.x ranges between 0 & dimGrid.x-1

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

Another Example

Dim3 dimGrid(2,2,1) Dim3 dimBlock(4,2,2); KernelFunction<<<dimGrid, dimBlock>>>(.); Block(1,0) has blockIdx.x=1 & blockIdx.y=0

Host

Device Grid 1

Kernel 1

Block (0, 0) Block (0, 1)

Block (1, 0) Block (1, 1)

Grid 2 Kernel 2 Block (1, 1)


(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Thread(2,1,0) has threadIdx.x=2 ;threadIdx.y=1; threadIdx.z=0


David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0)

Courtesy:
6

Matrix Multiplication Using Multiple Blocks


Break-up Pd into tiles Each thread block calculates one tile; Block size equal tile size Each thread calculates one Pd element identified using
blockIdx.x & blockIdx.y to identify the tile hreadIdx.x, threadIdx.y to identify the thtread within the tile
Md
0 Pd Nd

Bx= blockIdx.x
0 1 2

Tx=threadIdx.x
0 1 2 TILE_WIDTH-1

By= blockIdx.y

ty

TILE_WIDTH-1
TILE_WIDTH

2 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

WIDTH

WIDTH

WIDTH

0 1 2

Pdsub

TILE_WIDTHE

WIDTH

Revised Matrix Multiplication using Multiple Blocks (cont.)


the x index of the Pd element computed by a thread is x = bx*TILE_WIDTH + tx the y index of the Pd element computed by a thread is y= by*TILE_WIDTH + ty Thread (tx,ty) in block (bx, by) uses row y of Md and column x of Nd to update element Pd[y*Width+x]
Nd

bx
0 1 2

tx
0 1 2 TILE_WIDTH-1

Md
0

Pd

by

ty

TILE_WIDTH-1
TILE_WIDTH

2 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

WIDTH

WIDTH

WIDTH

0 1 2

Pdsub

TILE_WIDTHE

WIDTH

A Small Example: P(4,4)


For Block(0,0) blockIdx.x=0 & blockIdx.y=0 Block(0,0) For Block(1,0) blockIdx.x=1 & blockIdx.y=0 Block(1,0)

P0,0 P1,0 P2,0 P3,0 P0,1 P1,1 P2,1 P3,1 P0,2 P1,2 P2,2 P3,2

TILE_WIDTH = 2

P0,3 P1,3 P2,3 P3,3


Block(0,1) For Block(0,1) blockIdx.x=0 & blockIdx.y=1
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

Block(1,1) For Block(1,1) blockIdx.x=1 & blockIdx.y=1


9

A Small Example: Multiplication


thread(0,0) of block(0,0) calculates Pd0,0 thread(0,0) of block(0,1) calculates Pd0,2 thread(1,1) of block(0,0) calculates Pd1,1 thread(1,1) of block(0,1) calculates Pd1,3
Md0,0Md1,0Md2,0Md3,0 Md0,1Md1,1Md2,1Md3,1 Pd0,0 Pd1,0 Pd2,0 Pd3,0 Pd0,1 Pd1,1 Pd2,1 Pd3,1 Pd0,2 Pd1,2 Pd2,2 Pd3,2 Pd0,3 Pd1,3 Pd2,3 Pd3,3 Nd0,0 Nd1,0

Nd0,1 Nd1,1
Nd0,2 Nd1,2 Nd0,3 Nd1,3

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

10

Revised Host Code for Launching the Revised Kernel


//Setup the execution configuration Dim3 dimGrid(Width/TILE_WIDTH, Width/TILE_WIDTH,1); Dim3 dimBlock(TILE_WIDTH, TILE_WIDTH,1); //Launch the device computation threads KernelFunction<<<dimGrid, dimBlock>>> MatrixMulKernel(Md, Nd, Pd, Width); //Kernel can handle arrays of dimensions up to 1,048,560 X 1,048,560{16 X 65,535=1,048,560} using 65,535 X 65,535= 4,294,836,225 blocks each with 256 threads Total number of parallel threads =
1,099,478,073,600 > 1 Tera (1012)threads
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

11

Revised Matrix Multiplication Kernel using Multiple Blocks


__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {
// Calculate the row index of the Pd element and M

int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;


// Calculate the column idenx of Pd and N

int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix

for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

12

Transparent Scalability
Cuda runtime system can execute blocks in any order A kernel scales across any number of execution resources

Execution with few resources


Device

Transparent Scalability: The ability to execute the same application on hardware with different # of execution resources
Execution with more resources
Kernel grid Block 0 Block 1 Block 2 Block 3 Device

Block 0

Block 1

Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 5 Block 2 Block 6 Block 3 Block 7

Block 2

Block 3

time
Block 4

Block 4

Block 5

Block 6

Block 7

Each block can execute in any order relative to other blocks.


13

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

Thread Block Assignment to Streaming Multiprocessors

t0 t1 t2 tm

SM 0 SM 1
MT IU SP MT IU SP

t0 t1 t2 tm

Blocks

Blocks
Shared Memory Shared Memory

Upon Kernel launch, threads are assigned to Streaming Multiprocessors, SMs in block granularity Up to 8 blocks assigned to each SM as resources allows SM in GT200 can take up to 1024 threads
Could be 256 (threads/block) * 4 blocks Or 128 (threads/block) * 8 blocks, etc
14

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

Thread Block Assignment to Streaming Multiprocessors (cont.)

t0 t1 t2 tm

SM 0 SM 1
MT IU SP MT IU SP

t0 t1 t2 tm

Blocks

Blocks GT200 has 30 SMs


Shared Memory Shared Memory

Up to 240 (30*8) blocks execute simultaneously Up to 30,720 (30*1024) concurrent threads can be residing in SMs for execution
SM maintains thread/block id #s SM manages/schedules thread execution

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

15

GT200 Example: Thread Scheduling


Each Block is divided into 32thread Warps for scheduling purposes
Size of warp is implementation dependant not part of the CUDA programming model Warp is the unit of thread scheduling in SM

Block 1 Warps t0 t1 t2 t31

Block 2 Warps t0 t1 t2 t31

Block 1 Warps t0 t1 t2 t31

Streaming Multiprocessor
Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SFU SP SP SP SP SP SP SFU

If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?
Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps in each SM

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

16

GT200 Example: Thread Scheduling (Cont.)


SM implements zero-overhead warp scheduling: At any time, only one of the warps is executed by SM Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a warp execute the same instruction when selected (SIMD execution)
TB1, W1 stall TB2, W1 stall TB1 W1 TB2 W1 TB3 W1 TB3 W2 TB2 W1 TB3, W2 stall TB1 W1 TB1 W2 TB1 W3 TB3 W2

Instruction:

Time

TB = Thread Block, W = Warp

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

17

GT200 Block Granularity Considerations


For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks?
For 8X8 blocks, we have 64 threads per Block. Since each SM can take up to 1024 threads, there are 16 Blocks (1024/64). However, each SM can only take up to 8 Blocks. Hence only 512 (64*8) threads will go into each SM! SM execution resources are under utilized; fewer wraps to schedule around long latency operations

For 16X16 blocks , we have 256 threads per Block. Since each SM can take up to 1024 threads, it can take up to 4 Blocks (1024/256) Full thread capacity in each SM Maximal number of warps for scheduling around long-latency operations (1024/32= 32 wraps)
For 32X32 blocks, we have 1024 threads per Block. Not even one can fit into an SM! (512 threads /block limitation)
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

18

Some Additional API Features

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

19

Application Programming Interface


The API is an extension to the C programming language It consists of:
Language extensions
To target portions of the code for execution on the device

A runtime library split into:


1. A host component to control and access one or more devices from the host 2. A device component providing device-specific functions 3. A common component providing built-in vector types and a subset of the C runtime library in both host and device codes
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

20

Language Extensions: Built-in Variables


dim3 gridDim;
Dimensions of the grid in blocks (gridDim.z unused)

dim3 blockDim;
Dimensions of the block in threads

dim3 blockIdx;
Block index within the grid

dim3 threadIdx;
Thread index within the block
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

21

Common Runtime Component: Mathematical Functions


pow, sqrt, cbrt, hypot exp, exp2, expm1 log, log2, log10, log1p sin, cos, tan, asin, acos, atan, atan2 sinh, cosh, tanh, asinh, acosh, atanh ceil, floor, trunc, round Etc. When executed on the host, a given function uses the C runtime implementation if available These functions are only supported for scalar types, not vector types
22

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

Device Runtime Component: Mathematical Functions


Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x))
__pow __log, __log2, __log10 __exp __sin, __cos, __tan

David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

23

Host Runtime Component


Provides functions to deal with:
Device management (including multi-device systems) Memory management Error handling

Initializes the first time a runtime function is called


A host thread can invoke device code on only one device
Multiple host threads required to run on multiple devices
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

24

Device Runtime Component: Synchronization Function


void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point, execution resumes normally Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory Allowed in conditional constructs only if the conditional is uniform across the entire thread block
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

25