Documente Academic
Documente Profesional
Documente Cultură
MurphyStein NewYorkUniversity
Overview
WhatisCUDA?
CUDAstandsfor:
ComputeUnifiedDeviceArchitecture
Itis2things:
1.DeviceArchitectureSpecification 2.AsmallextensiontoC
=NewSyntax+BuiltinVariablesRestrictions+Libraries
DeviceArchitecture:StreamingMultiprocessor(SM) 1SMcontains8scalarcores
Upto8corescanrun simulatenously Eachcoreexecutesidentical instructionset,orsleeps SMschedulesinstructions acrosscoreswith0overhead Upto32threadsmaybe scheduledatatime,calleda warp,butmax24warpsactive in1SM Threadlevelmemorysharing supportedviaSharedMemory Registermemoryislocalto thread,anddividedamongstall blocksonSM
SM
InstructionFetch/Dispatch
SharedMemory16KB
Registers8KB
...
Streaming Core #8
TransparentScalability
Hardwareisfreetoassignsblockstoany processoratanytime
Akernelscalesacrossanynumberof parallelprocessors
Device Kernelgrid Block0 Block1 Block2 Block3 Block0 Block2 Block4 Block6 Block1 Block3 Block5 Block7 Block4 Block5 Block6 Block7 Device
time
Block0 Block4
Block1 Block5
Block2 Block6
Block3 Block7
Eachblockcanexecuteinanyorderrelativetootherblocks.
SMWarpScheduling
SMhardwareimplementszero overheadWarpscheduling
DeviceArchitecture
Host
BlockExecutionManager
1GPU
...
ConstantMemory64KB TextureMemory
GlobalMemory768MB4GB
CExtension
Consistsof:
CExtension:BuiltinVariables
NewSyntax:
CExtension:BuiltinVariables
BuiltinVariables: dim3gridDim;
Dimensionsofthegridinblocks(gridDim.zunused)
dim3blockDim;
Dimensionsoftheblockinthreads
dim3blockIdx;
Blockindexwithinthegrid
dim3threadIdx;
Threadindexwithintheblock
CExtension:Restrictions
NewRestrictions:
Norecursionindevicecode Nofunctionpointersindevicecode
CUDAAPI
CompilingaCUDAProgram
C/C++CUDA Application
C/C++Code
NVCC
VirtualPTX Code
gcc
PTXtoTarget Compiler
CPU Instructions
GPU Instructions
MatrixTranspose
M i,j
M j,i
MatrixTranspose
A C
B D
A B
C D
MatrixTranspose:Firstidea
MatrixTranspose:Firstidea
#include<stdio.h> #include<stdlib.h> __global__ voidtranspose(float*in,float*out,uintwidth){ uinttx=blockIdx.x*blockDim.x+threadIdx.x; uintty=blockIdx.y*blockDim.y+threadIdx.y; out[tx*width+ty]=in[ty*width+tx]; } intmain(intargs,char**vargs){ constintHEIGHT=1024; constintWIDTH=1024; constintSIZE=WIDTH*HEIGHT*sizeof(float); dim3bDim(16,16); dim3gDim(WIDTH/bDim.x,HEIGHT/bDim.y); float*M=(float*)malloc(SIZE); for(inti=0;i<HEIGHT*WIDTH;i++) {M[i]=i;} float*Md=NULL; cudaMalloc((void**)&Md,SIZE); cudaMemcpy(Md,M,SIZE,cudaMemcpyHostToDevice); float*Bd=NULL; cudaMalloc((void**)&Bd,SIZE); transpose<<<gDim,bDim>>>(Md,Bd,WIDTH); cudaMemcpy(M,Bd,SIZE,cudaMemcpyDeviceToHost); return0; }
FurtherReading
OnlineCourse:
UIUCNVIDIAProgrammingCoursebyDavidKirkandWenMeiW.Hwu http://courses.ece.illinois.edu/ece498/al/Syllabus.html
CUDA@MIT'09
http://sites.google.com/site/cudaiap2009/materials1/lectures
GreatMemoryLatencyStudy:
LU,QRandCholeskyFactorizationsusingVectorCapabilitiesofGPUsby Vasily&Demmel
Bookofadvancedexamples:
GPUGems3EditedbyHubertNguyen
CUDASDK
TonsofsourcecodeexamplesavailablefordownloadfromNVIDIA'swebsite