005 Superscalar

Superscalar Processors
Superscalar Execution How it can help Issues:

Maintaining Sequential Semantics Scheduling
Scoreboard Superscalar vs. Pipelining Example: Alpha 21164 and 21064
A. Moshovos
ECE1773 - Fall 07 ECE Toronto
Sequential Semantics - Review

Instructions appear as if they executed: In the order they appear in the program One after the other Pipelining: Partial Overlap of Instructions Initiate one instruction per cycle Subsequent instructions overlap partially Commit one instruction per cycle
A. Moshovos
Superscalar - In-order
Two or more consecutive instructions in the original program order can execute in parallel This is the dynamic execution order N-way Superscalar Can issue up to N instructions per cycle 2-way, 3-way,
A. Moshovos
Superscalar vs. Pipelining

loop: ld add sub bne
decode fetch
r2, r3, r1, r1,
10(r1) r3, r2 r1, 1 r0, loop
sum += a[i--]
Pipelining:
fetch ld decode fetch
time
add decode fetch sub decode
Superscalar:
fetch decode fetch fetch ld decode decode fetch
bne
add sub decode
bne
A. Moshovos
Superscalar Performance
Performance Spectrum? What if all instructions were dependent?
Speedup = 0, Superscalar buys us nothing
What if all instructions were independent?

Speedup = N where N = superscalarity
Again key is typical program behavior Some parallelism exists

A. Moshovos ECE1773 - Fall 07 ECE Toronto
Real-Life Performance
OLTP = Online Transaction Processing
SOURCE: Partha Ranganathan Kourosh Gharachorloo** Sarita Adve* Luiz Andr Barroso** Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors ASPLOS98
A. Moshovos
Real Life Performance

SPEC CPU 2000: Simplescalar sim: 32K I$ and D$, 8K bpred
0.9 0.8 0.7 0.6 1 2 4 8 16
IPC
0.5 0.4 0.3 0.2 0.1 0

a p x 5. vo r 7. m 8. a 17 7. p 17 18 16 3. e 25 6. b 17 18 25 25 30 0. tw ol f zi p cf pr e 1. m 5. v 6. g 4. g qu 4. g m ar zi p2 se r ap cc es ak te m 19
A. Moshovos
18
Superscalar Issue
An instruction at decode can execute if: Dependences
RAW Input operand availability WAR and WAW
Must check against Instructions:

Simultaneously Decoded In-progress in the pipeline (i.e., previously issued) Recall the register vector from pipelining
Increasingly Complex with degree of superscalarity 2-way, 3-way, , n-way

Issue Rules
Stall at decode if: RAW dependence and no data available
Source registers against previous targets
WAR or WAW dependence

Target register against previous targets + sources
No resource available This check is done in program order
A. Moshovos
Issue Mechanism A Group of Instructions at Decode

Program order
tgt
src1
src1
tgt
tgt
src1
src1
src1
src1
simplifications may be possible resource checking not shown
Assume 2 source & 1 target max per instr. comparators for 2-way:
3 for tgt and 2 for src (tgt: WAW + WAR, src: RAW)
comparators for 4-way:

2nd instr: 3 tgt and 2 src 3rd instr: 6 tgt and 4 src 4th instr: 9 tgt and 6 src
Issue Checking for Dependences with In-Flight instructions

Nave Implementation: Compare registers with all outstanding registers RAW, WAR and WAW How many comparators we need?
Stages x Superscalarity x Ops per Instructruction
Priority enforcers? But we need some of this for bypassing

RAW
A. Moshovos
Issue Checking for Dependences with In-Flight instructions

Scoreboard: Pending Write per register, one bit
Set at decode / Reset at writeback
Pending Read?
Not needed if all reads are done in order WAR and WAW not possible
Can handle structural hazards Busy indicators per resource Can handle bypass Where a register value is produced R0 busy, in ALU0, at time +3
A. Moshovos
Implications
Need to multiport some structures Register File
Multiple Reads and Writes per cycle
Register Availability Vector (scoreboard)

Multiple Reads and Writes per cycle From Decode and Commit Also need to worry about WAR and WAW
Resource tracking Additional issue conditions Many Superscalars had additional restrictions
E.g., execute one integer and one floating point op one branch, or one store/load
Preserving Sequential Semantics

In principle not much different than pipelining Program order is preserved in the pipeline Some instructions proceed in parallel But order is clearly defined Defer interrupts to commit stage (i.e., writeback) Flush all subsequent instructions
may include instructions committing simultaneously
Allow all preceding instructions to commit Recall comparisons are done in program order Must have sufficient time in clock cycle to handle these
Interrupts Example
Exception raised fetch decode fetch fetch ld decode decode fetch Exception taken add div decode fetch Exception taken
bne decode
bne
Exception raised fetch
Exception raised decode fetch fetch ld decode decode fetch
add div decode fetch
bne decode
bne
A. Moshovos
Superscalar and Pipelining

In principle they are orthogonal Superscalar non-pipelined machine Pipelined non-superscalar Superscalar and Pipelined (common) Additional functionality needed by Superscalar: Another bound on clock cycle At some point it limits the number of pipeline stages
A. Moshovos
Superscalar vs. Superpipelining

Superpipelining: Vaguely defined as deep pipelining, i.e., lots of stages Superscalar issue complexity: limits super-pipelining How do they compare? 2-way Superscalar vs. Twice the stages Not much difference. fetch fetch decode decode fetch fetch inst inst decode decode
inst inst E2 E1 D2
F1
F2 F1
D1 F2 F1
D2 D1 F2 F1
E1 D2 D1 F2
E2 E1 D2 D1
E2 E1
E2
A. Moshovos

fetch fetch decode decode fetch fetch inst inst decode decode fetch fetch inst inst decode decode fetch fetch E2 E1 D2 D1 F2 F1
inst inst decode decode
inst inst
F1
F2 F1
D1 F2 F1
D2 D1 F2 F1
E1 D2 D1 F2 F1
E2 E1 D2 D1 F2 F1
E2 E1 D2 D1 F2 F1
E2 E1 D2 D1 F2
E2 E1 D2 D1
E2 E1 D2
E2 E1
E2
A. Moshovos

WANT 2X PERFORMANCE:
fetch fetch decode decode fetch fetch inst decode decode fetch fetch RAW inst inst decode decode fetch fetch RAW inst D2 inst D1 D2 D2 F2 D1 D1 F1 F2 F2 F1 F2
inst inst decode decode
inst inst
F1
F2 F1
D1 F2 F1
D2 D1 F2 F1
inst D2 D2 D1 D1 F2 F2 F1 F1
inst D2 inst D1 D1 D2 D2 F2 D1 D1
inst D2 inst
A. Moshovos
Superscalar vs. Superpipeling: Another View

Source: Lipasti, Shen, Wood, Hill, Sohi, Smith (CMU/Wisconsin)
Amdhals Law
Work performed
N No. of Processors f 1 1-f
Time
f = fraction that is vectorizable (parallelism) v = speedup for f 1 Overall speedup: Speedup
f 1 f v
A. Moshovos
Amdhals Law: Sequential Part Limits Performance

Parallelism cant help if there isnt any Even if v is infinite Performance limited by nonvectorizable portion (1-f) 1 1
lim
f 1 f v
1 f
N No. of Processors f 1
A. Moshovos
1-f Time
Pipeline Performance
N Pipeline Depth 1 1-g g
g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) 1-g = performance suffers
A. Moshovos
Case Study: Alpha 21164
A. Moshovos
21164: Int. Pipe
A. Moshovos
21164: Memory Pipeline
A. Moshovos
21164: Floating-Point Pipe
A. Moshovos
Performance Comparison
Source:
A. Moshovos
CPI Comparison
A. Moshovos
Compiler Impact
Optimized Performance Base
A. Moshovos
Issue Cycle Distribution - 21164
A. Moshovos
Issue Cycle Distribution - 21064
A. Moshovos
Stall Cycles - 21164

Data Dependences/Data Stalls No instructions
A. Moshovos
Stall Cycles Distrubution

Model:
When no instruction is committing Does not capture overlapping factors: Stall due to dependence while committing Stall due to cache miss while committing
Replay Traps
Tried to do something and couldnt Store and write-buffer is full
Cant complete instruction
Load and miss-address-file full

Cant complete instruction
Assumed Cache hit and was miss

Dependent instructions executed Must re-execute dependent instructions
Re-execute the instruction and everything that follows
A. Moshovos
Replay Traps Explained

ld r1 add _, r1
F D F E D M D W E
Cache hit
Cache miss
D F
E D
M D
M D
W E
A. Moshovos
Optimistic Scheduling
ld r1 add _, r1
F D F E D M D M D W E
Cache hit
Hit/miss known here E
add should start execution here
Must decide that add should execute Start making scheduling decisions
Optimistic Scheduling #2
ld r1 add _, r1
F D F E D M D M D W E
Cache hit
Guess Hit/Miss
Hit/miss known here E
add should start execution here
Must decide that add should execute Start making scheduling decisions
Stall Distribution
A. Moshovos
21164 Microarchitecture
Instruction Fetch/Decode + Branch Units Integer Execution Unit Floating-Point Execution Unit Memory Address Translation Unit Cache Control and Bus Interface Data Cache Instruction Cache Second-Level Cache
A. Moshovos
Instruction Decode/Issue
Up to four insts/cycle Naturally aligned groups Must start at 16 byte boundary (INT16) Simplifies Fetch path (in a second) All of group must issue before next group gets in
Simplifies Scheduling No need for reshuffling

Instruction Decode/Issue
Up to four insts/cycle Naturally aligned groups Must start at 16 byte boundary (INT16) Simplifies Fetch path
Where instructions come from? I-Cache:
CPU needs:
Fetching Four Instructions

Where instructions come from? I-Cache:
CPU needs:
Software must guarantee alignment at 16 byte boundaries Lots of NOPs

Instruction Buffer and Prefetch

I-buffer feeding issue 4-entry, 8-instruction prefetch buffer Check I-Cache and PB in parallel PB hit: Fill Cache, Feed pipeline PB miss: Prefetch four lines
A. Moshovos
Branch Execution
One cycle delay Calc. target PC Nave implementation:
Can fetch every other cycle
Branch Prediction to avoid the delay Up to four pending branches in stage 2 Assignment to Functional Units One at stage 3 Instruction Scheduling/Issue One at stage 4 Instruction Execution Full and execute from right PC
Return Address Stack

Returns Target Address Changes Conventional Branch Prediction cant handle Predictable change Return address = Call site return point Detect Calls Push return address onto hardware stack Return pops address Speculative 12-entry stack
Circular queue overflow/underflow messes it up
A. Moshovos
Instruction Translation Buffer

Translate Virtual Addresses to Physical 48-entry fully-associative Pages 8KB to 4MB Not-last-used/Not-MRU replacement 128 Address space identifiers
A. Moshovos
Integer Execution Unit

Two of: Adder Logic One of: Barrel shifter Byte-manipulation Multiply Asymmetric Unit Configurations are common Tradeoff between;
Flexibility/Performance Area/Cost/Complexity
How to decide? Common application behavior

Integer Register File

32+8 registers 8 are legacy DEC Four read ports, two write ports Support for up to two integer ops
A. Moshovos
Floating-Point Unit
FPU ADD FPU Multiply 2 ops per cycle Divides take multiple cycles 32 registers, five reads, four writes Four reads and two writes for FP pipe One read for stores (handled by integer pipe) One write for loads (handled by integer pipe)
A. Moshovos
Memory Unit
Up to two accesses Data translation buffer 512-entries, not-MRU Loads access in parallel with D-cache Miss Address File Pending misses Six data loads Four instruction reads Merges loads to same block
A. Moshovos
Store/Load Conflicts
Load immediately after a store Cant see the data Detect and replay
Flush pipe and re-execute
Compiler can help Schedule load three cycles after store Two cycles stalls the load at issue/address generation
A. Moshovos
Write Buffer
Six 32-byte entries Defer stores until there is a port available Loads can read from Writebuffer
A. Moshovos
Pipeline Processing Front-End
A. Moshovos
Integer Add
A. Moshovos
Floating-Point Add
A. Moshovos
Load Hit
A. Moshovos
Load Miss
A. Moshovos
Store Hit
A. Moshovos
80486 Pipeline
Fetch Load 16-bytes from into prefetch buffer Decode 1 Determine instruction length and type Decode 2 Compute memory address Generate immediate operands Execute Register Read ALU Memory read/write Write-back Update register file (source: CS740 CMU, 97, all slides on 486)
80486 Pipeline detail

Fetch Moves 16 bytes of instruction stream into code queue Not required every time About 5 instructions fetched at once (avg. length 2.5 bytes) Only useful if dont branch Avoids need for separate instruction cache D1 Determine total instruction length Signals code queue aligner where next instruction begins May require two cycles
When multiple operands must be decoded About 6% of typical DOS program
A. Moshovos
80486 Pipeline
D2 Extract memory displacements and immediate operands Compute memory addresses Add base register, and possibly scaled index register May require two cycles
If index register involved, or both address & immediate operand Approx. 5% of executed instructions
EX Read register operands Compute ALU function Read or write memory (data cache) WB Update register result
A. Moshovos

005 Superscalar

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

005 Superscalar

Încărcat de

Drepturi de autor:

Formate disponibile

Superscalar Processors

Superscalar Execution How it can help Issues:

Scoreboard Superscalar vs. Pipelining Example: Alpha 21164 and 21064

ECE1773 - Fall 07 ECE Toronto

Sequential Semantics - Review

ECE1773 - Fall 07 ECE Toronto

ECE1773 - Fall 07 ECE Toronto

Superscalar vs. Pipelining

r2, r3, r1, r1,

10(r1) r3, r2 r1, 1 r0, loop

add sub decode

ECE1773 - Fall 07 ECE Toronto

What if all instructions were independent?

Again key is typical program behavior Some parallelism exists

ECE1773 - Fall 07 ECE Toronto

Real Life Performance

0.5 0.4 0.3 0.2 0.1 0

ECE1773 - Fall 07 ECE Toronto

Must check against Instructions:

Increasingly Complex with degree of superscalarity 2-way, 3-way, , n-way

WAR or WAW dependence

No resource available This check is done in program order

ECE1773 - Fall 07 ECE Toronto

Issue Mechanism A Group of Instructions at Decode

simplifications may be possible resource checking not shown

comparators for 4-way:

Issue Checking for Dependences with In-Flight instructions

Priority enforcers? But we need some of this for bypassing

ECE1773 - Fall 07 ECE Toronto

Issue Checking for Dependences with In-Flight instructions

Register Availability Vector (scoreboard)

Preserving Sequential Semantics

Exception raised fetch

Exception raised decode fetch fetch ld decode decode fetch

add div decode fetch

ECE1773 - Fall 07 ECE Toronto

Superscalar and Pipelining

ECE1773 - Fall 07 ECE Toronto

Superscalar vs. Superpipelining

ECE1773 - Fall 07 ECE Toronto

Superscalar vs. Superpipelining

inst inst decode decode

ECE1773 - Fall 07 ECE Toronto

Superscalar vs. Superpipelining

inst inst decode decode

ECE1773 - Fall 07 ECE Toronto

Superscalar vs. Superpipeling: Another View

N No. of Processors f 1 1-f

f = fraction that is vectorizable (parallelism) v = speedup for f 1 Overall speedup: Speedup

ECE1773 - Fall 07 ECE Toronto

Amdhals Law: Sequential Part Limits Performance

ECE1773 - Fall 07 ECE Toronto

Case Study: Alpha 21164

ECE1773 - Fall 07 ECE Toronto

21164: Int. Pipe

ECE1773 - Fall 07 ECE Toronto

21164: Memory Pipeline

ECE1773 - Fall 07 ECE Toronto

21164: Floating-Point Pipe

ECE1773 - Fall 07 ECE Toronto

ECE1773 - Fall 07 ECE Toronto

ECE1773 - Fall 07 ECE Toronto