Documente Academic
Documente Profesional
Documente Cultură
Dg.x*Dg.y = number of blocks, Dg.z = 1. Device Memory
Db.x*Db.y*Db.z = number threads per block.
Error Handling
Ns = dynamically allocated shared memory, Linear Memory cudaError_t cudaGetLastError( void )
optional, default=0. cudaMalloc( void ** devptr, size_t size )
const char * cudaGetErrorString( cudaError_t
S = associated stream, optional, default=0. error )
cudaFree( void * dptr )
cudaMemcpy( void *dst, const void *src,
Thread Hierarchy CUDA Compilation
size_t size, enum cudaMemcpyKind kind )
kind = cudaMemcpyHostToHost or nvcc flags file.cu
cudaMemcpyHostToDevice or
A few common flags
cudaMemcpyDeviceToHost or
‐o output file name
cudaMemcpyDeviceToDevice
‐g host debugging information
CUDA Arrays ‐G device debugging
See Programming Guide for description of ‐deviceemu emulate on host
CUDA arrays and texture references. ‐use_fast_math use fast math library
‐arch compile for specific GPU architecture
‐X pass option to host compiler
Pagelocked Host Memory
#pragma unroll n unroll loop n times.
cudaMallocHost( void ** ptr, size_t size )
cudaFreeHost( void * ptr )
Language Extensions Grids are 1D or 2D so Dg.z = 1 always Atomic Operations
Ns optional, default 0 atomicAdd(), atomicSub(), atomicExch(),
Function Qualifiers S optional, default 0 atomicMin(), atomicMax(), atomicInc,(),
__global__ call host, execute device. atomicDec(), atomicCAS(), atomicAnd(),
__device__ call device, execute device. Builtin Variables atomicOr(), atomicXor().
__host__ call host, execute host (default). dim3 gridDim size of grid (1D, 2D).
__noinline__ if possible, do not inline dim3 blockDim size of block (1D, 2D, 3D). Warp Voting Functions
dim3 blockIdx location in grid. int __all( int predicate )
__host__ and __device__ may be combined to dim3 threadIdx location in block. int __any( int predicate )
generate code for both host and device. int warpSize threads in warp.
Variable Qualifiers Memory Fence Functions
__device__ variable on device __threadfence(), __threadfence_block()
__constant__ variable in constant memory
__shared__ variable in shared memory Synchronisation Function
__syncthreads()
Vector Types
[u]char1, [u]char2, [u]char3, [u]char4 Fast Mathematical Functions
[u]short1, [u]short2, [u]short3, [u]short4 __fdividef(x,y), __sinf(x), __cosf(x), __tanf(x),
[u]int1, [u]int2, [u]int3, [u]int4 __sincosf(x,sinptr,cosptr), __logf(x),
[u]long1, [u]long2, [u]long3, [u]long4 __log2f(x), __log10f(x), __expf(x), __exp10f(x),
longlong1, longlong2 __powf(x,y)
float1, float2, float3, float4
double1, double2 Texture Functions
tex1Dfetch(), tex1D(), tex2D(), tex3D()
Execution configuration
kernel <<< dim3 Dg, dim3 Db, size_t Ns, Timing
cudaStream_t S >>> ( arguments ) clock_t clock( void )