Documente Academic
Documente Profesional
Documente Cultură
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
Thread program
Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate
Each block can execute in any order relative to other blocks!
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
gridDim & blockDim: Built in variables which provide the dimensions of the grid and block
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
N thread blocks (1D organization) blockIdx.x = 0, 1, , N-1 8*N threads/grid: blockDim.x=8 ; gridDim.x=N
Thread Block 1
1 2 3 4 5 6 7
Thread Block N - 1
0 1 2 3 4 5 6 7
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
Another Example
Dim3 dimGrid(2,2,1) Dim3 dimBlock(4,2,2); KernelFunction<<<dimGrid, dimBlock>>>(.); Block(1,0) has blockIdx.x=1 & blockIdx.y=0
Host
Device Grid 1
Kernel 1
Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0)
Courtesy:
6
Bx= blockIdx.x
0 1 2
Tx=threadIdx.x
0 1 2 TILE_WIDTH-1
By= blockIdx.y
ty
TILE_WIDTH-1
TILE_WIDTH
2 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
WIDTH
WIDTH
WIDTH
0 1 2
Pdsub
TILE_WIDTHE
WIDTH
bx
0 1 2
tx
0 1 2 TILE_WIDTH-1
Md
0
Pd
by
ty
TILE_WIDTH-1
TILE_WIDTH
2 David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
WIDTH
WIDTH
WIDTH
0 1 2
Pdsub
TILE_WIDTHE
WIDTH
P0,0 P1,0 P2,0 P3,0 P0,1 P1,1 P2,1 P3,1 P0,2 P1,2 P2,2 P3,2
TILE_WIDTH = 2
Nd0,1 Nd1,1
Nd0,2 Nd1,2 Nd0,3 Nd1,3
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
10
11
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col]; Pd[Row*Width+Col] = Pvalue; }
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
12
Transparent Scalability
Cuda runtime system can execute blocks in any order A kernel scales across any number of execution resources
Transparent Scalability: The ability to execute the same application on hardware with different # of execution resources
Execution with more resources
Kernel grid Block 0 Block 1 Block 2 Block 3 Device
Block 0
Block 1
Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 5 Block 2 Block 6 Block 3 Block 7
Block 2
Block 3
time
Block 4
Block 4
Block 5
Block 6
Block 7
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
t0 t1 t2 tm
SM 0 SM 1
MT IU SP MT IU SP
t0 t1 t2 tm
Blocks
Blocks
Shared Memory Shared Memory
Upon Kernel launch, threads are assigned to Streaming Multiprocessors, SMs in block granularity Up to 8 blocks assigned to each SM as resources allows SM in GT200 can take up to 1024 threads
Could be 256 (threads/block) * 4 blocks Or 128 (threads/block) * 8 blocks, etc
14
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
t0 t1 t2 tm
SM 0 SM 1
MT IU SP MT IU SP
t0 t1 t2 tm
Blocks
Up to 240 (30*8) blocks execute simultaneously Up to 30,720 (30*1024) concurrent threads can be residing in SMs for execution
SM maintains thread/block id #s SM manages/schedules thread execution
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
15
Streaming Multiprocessor
Instruction L1 Instruction Fetch/Dispatch Shared Memory SP SP SFU SP SP SP SP SP SP SFU
If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?
Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps in each SM
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
16
Instruction:
Time
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
17
For 16X16 blocks , we have 256 threads per Block. Since each SM can take up to 1024 threads, it can take up to 4 Blocks (1024/256) Full thread capacity in each SM Maximal number of warps for scheduling around long-latency operations (1024/32= 32 wraps)
For 32X32 blocks, we have 1024 threads per Block. Not even one can fit into an SM! (512 threads /block limitation)
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
18
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
19
20
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
21
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign
23
24
25