Documente Academic
Documente Profesional
Documente Cultură
Zhen Lin
North Carolina State University
Based on GPGPU-Sim Tutorial and Manual by UBC
Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study
Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study
GPGPU-Sim in a Nutshell
Microarchitecture timing model of contemporary GPUs
Run unmodified CUDA/OpenCL
Functional model
PTX
A low-level, data-parallel virtual machine and instruction set architecture (ISA)
Between CUDA and hardware ISA (SASS)
Stable ISA that spans multiple GPU generations
SASS/PTXPLUS
Hardware native ISA
PTX -> Translate + Optimize -> SASS
More accurate, but not well supported
Scalar ISA
SSA representation: register allocation not done in PTX
Compilation Path
Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study
Demo1
Setup
Stats
Configuration
Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study
Fetch
Decode
Issue
Read operand
Execution
Writeback
Fetch + Decode
Arbitrate the I-cache
among warps
Cache miss handled by
fetching again later
Fetched instruction is
decoded and then
stored in the I-Buffer
1 or more entries / warp
Only warp with vacant
entries are considered in
fetch
Issue
Selects a warp with a ready
instruction
Acquires the activemask
from TOS of SIMT stack
Invalid the I-buffer
Scoreboard
Checks for RAW and WAW
dependency hazard
Flag instructions with hazards as not ready in I-Buffer
(masking them out from the scheduler)
December 2012
4.17
Read Operand
Bank 0
Bank 1
Bank 2
Bank 3
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
No Conflict
Conflict at bank 0
December 2012
4.18
Operand Collector
(from instruction issue stage)
dispatch
December 2012
4.19
Execution
ALU
Stream processor (SP)
Specific function unit (SFU)
MEM
Shared memory
Local memory
Global memory
Texture memory
Constant memory
ALU Pipelines
SIMD Execution Unit
Fully Pipelined
Each pipe may execute a subset of instructions
Configurable bandwidth and latency (depending on the instruction)
Default: SP + SFU pipes
December 2012
4.21
Memory Unit
A
G
U
Bank
Conflict
Shared MSHR
Mem
Access
Coalesc.
Data
Cache
December 2012
Const.
Cache
Texture
Cache
Memory Port
4.22
Writeback
Write result to register file
Scoreboard updates the r-bit
Outline
GPGPU-Sim Overview
Demo1: Setup & Configuration
GPGPU-Sim Internals
Demo2: Scheduling Study
Demo2
Software framework overview
To monitor the warp scheduling order
Compare with different scheduling policies