Parallel Programming Models For Real-Time Graphics: Aaron Lefohn

Parallel Programming Models for Real-Time Graphics
Aaron Lefohn
Intel Corporation
Beyond Programmable Shading Course, ACM SIGGRAPH 2011
Hardware Resources
Core Execution Context SIMD functional units On-chip memory
CPU-GPU System-on-a-Chip
Abstraction
Abstraction enables portability and system optimization
E.g., dynamic load balancing, SIMD utilization, producer-consumer
Lack of abstraction enables arch-specific programmer optimization

E.g., multiple execution contexts jointly building on-chip data structure
When a parallel programming model abstracts a HW resource,

code written in that programming model scales across architectures with varying amounts of that resource
Execution Definitions
Execution context
The state required to execute an instruction stream: instruction pointer, registers, etc
(aka thread)
Work
A logically related set of instructions executed in a single execution context
(aka shader, instance of a kernel, task)
Concurrent execution
Multiple units of work that may execute simultaneously
(because they are logically independent)
Parallel execution
Multiple units of work whose execution contexts are guaranteed to be live simultaneously
(because you want them to be for locality, synchronization, etc)
Synchronization
Synchronization between execution contexts
Enables inter-context communication Restricts when work is permitted to execute
Granularity of permitted synchronization

determines at which granularity system allows programmer to control scheduling
Vertex Shaders: Pure Data Parallelism

Execution
Concurrent execution of identical per-vertex work
What is abstracted?
Cores, execution contexts, SIMD functional units, memory
hierarchy
What synchronization is allowed?

Between draw calls
Pure Data-parallel Pseudocode
concurrent_for( i = 1 to numVertices) { // Execute vertex shader }
Conventional Thread Parallelism

Execution
Parallel execution of N different units of work on N execution contexts Parallel execution of M identical units of work on M-wide SIMD functional unit
What is abstracted?
Nothing (ignoring preemption)
Where is synchronization allowed?

Between any execution context at various granularities
Conventional Thread Parallelism
CPU
Launch a pthread per hardware execution context
GPU
Persistent threads
Launch a workgroup per hardware execution

context sized to the HW SIMD width
D3D/OpenGL Rendering Pipeline

Execution

Concurrent execution of identical work within each shading stage Concurrent execution of different shading stages Each stage spawns work to the next stage No parallelism exposed to user
What is abstracted?
Cores, execution contexts, SIMD functional units, memory hierarchy,

and fixed-function graphics units (tessellator, rasterizer, ROPs, etc)

Between draw calls
Abstracting SIMD ALUs
Explicit SIMD Programming

float16 a = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; float16 b = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; float16 c = a + b;
Mechanisms
Intrinsics Assembly Wide vector types

SPMD/Implicit SIMD Programming

parallel_for( i = 1 to SIMD_width) { // Per-lane code goes here }
concurrent_for( i = 1 to someBigNumber) { // Per-lane code goes here }
SPMD/Implicit SIMD Programming

GPU
Current GPU programming models are always SPMD
CPU
Intel SPMD Program Compiler (ISPC) SPMD combined with other abstractions
OpenCL (some implementations) Intel Array Building Blocks

Abstracting Cores and Execution Contexts
Task Systems (Cilk, TBB, ConcRT, GCD, )
Execution
Concurrent execution of many (likely different) units of work Work runs in a single execution context
What is abstracted?
Cores and execution contexts Not abstracted: SIMD functional units or memory hierarchy Between tasks
Task Pseudo Code

void myTask(some arguments) {
}
void main() { for( i = 0 to NumTasks - 1 ) { spawn myTask();
}
sync; // More work
}
Nested Task Pseudo Code

void barTask(some parameters) { } void fooTask(some parameters) { if (someCondition) { spawn barTask(); } else { spawn fooTask(); } } void main() { concurrent_for( i = 0 to NumTasks - 1 ) { fooTask(); } sync;
More code
GPU Compute Pseudo Code

void myWorkGroup() { parallel_for(i = 0 to NumWorkItems - 1) { GPU Kernel Code (This is where you write GPU compute code) } } void main() { concurrent_for( i = 0 to NumWorkGroups - 1) { myWorkGroup(); } sync; }
GPU Compute Languages

Execution
Lower level is parallel execution of identical work (work-items)
within work-group Upper level is concurrent execution of identical work-groups
What is abstracted?
Work-group abstracts a cores execution contexts, SIMD

functional units, memory
Between work-items in a work-group Between passes (set of work-groups)

Summary of Concepts
Abstraction
When a parallel programming model abstracts a HW resource,
code written in that programming model scales across architectures with varying amounts of that resource
Concurrency versus parallelism

locality
Concurrency provides scalability and portability Parallel execution permits explicit communication and capturing
Synchronization
Where is user allowed to control scheduling?
Conclusions
Current real-time rendering programming uses a mix of
data-, task-, and pipeline-parallel programming (and conventional threads as means to an end)
Future SOC (CPU + GPU) programming model directions

Tasks are effective way to abstract execution contexts and cores SPMD is an effective way to abstract over SIMD ALUs Many open questions
Look for uses of these different models throughout the rest

of the course
Acknowledgements

Tim Foley and Matt Pharr at Intel Mike Houston at AMD Kayvon Fatahalian at CMU The Advanced Rendering Technology research team, Pete Baker, Aaron Coday, and Elliot Garbus at Intel
References

GPU-inspired compute languages DX11 DirectCompute, OpenCL (CPU+GPU+), CUDA The Fusion APU Architecture: A Programmers Perspective (Ben Gaster) http://developer.amd.com/afds/assets/presentations/2901_final.pdf Task systems (CPU and CPU+GPU+) Cilk, Thread Building Blocks (TBB), Grand Central Dispatch (GCD), ConcRT, Task Parallel Library Conventional CPU thread programming Pthreads GPU task systems and persistent threads (i.e., conventional thread programming on GPU) Aila et al, Understanding the Efficiency of Ray Traversal on GPUs, High Performance Graphics 2009 Tzeng et al, Task Management for Irregular-Parallel Workloads on the GPU, High Performance Graphics 2010 Parker et al, OptiX: A General Purpose Ray Tracing Engine, SIGGRAPH 2010 Additional input (concepts, terminology, patterns, etc) Foley, Parallel Programming for Graphics,

Beyond Programmable Shading SIGGRAPH 2009 Beyond Programmable Shading CS448s Stanford course Fatahalian, Running Code at a Teraflop: How a GPU Shader Core Works, Beyond Programmable Shading SIGGRAPH 2009-2010 Keutzer et al, A Design Pattern Language for Engineering (Parallel) Software: Merging the PLPP and OPL projects, ParaPLoP 2010

Parallel Programming Models For Real-Time Graphics: Aaron Lefohn

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Parallel Programming Models For Real-Time Graphics: Aaron Lefohn

Încărcat de

Drepturi de autor:

Formate disponibile

Parallel Programming Models for Real-Time Graphics

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Lack of abstraction enables arch-specific programmer optimization

When a parallel programming model abstracts a HW resource,

Granularity of permitted synchronization

Vertex Shaders: Pure Data Parallelism

What synchronization is allowed?

Pure Data-parallel Pseudocode

concurrent_for( i = 1 to numVertices) { // Execute vertex shader }

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Conventional Thread Parallelism

Where is synchronization allowed?

Conventional Thread Parallelism

Launch a workgroup per hardware execution

D3D/OpenGL Rendering Pipeline

Cores, execution contexts, SIMD functional units, memory hierarchy,

Where is synchronization allowed?

Abstracting SIMD ALUs

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Explicit SIMD Programming

Intrinsics Assembly Wide vector types

SPMD/Implicit SIMD Programming

concurrent_for( i = 1 to someBigNumber) { // Per-lane code goes here }

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

SPMD/Implicit SIMD Programming

Abstracting Cores and Execution Contexts

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Task Systems (Cilk, TBB, ConcRT, GCD, )

Where is synchronization allowed?

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Task Pseudo Code

Nested Task Pseudo Code

GPU Compute Pseudo Code

GPU Compute Languages

Work-group abstracts a cores execution contexts, SIMD

Where is synchronization allowed?

Between work-items in a work-group Between passes (set of work-groups)

Concurrency versus parallelism

Where is user allowed to control scheduling?

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Future SOC (CPU + GPU) programming model directions

Look for uses of these different models throughout the rest

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

Beyond Programmable Shading Course, ACM SIGGRAPH 2011

S-ar putea să vă placă și