08-Speculation - 2 (Compatibility Mode) PDF

Dynamic ILP
Speculation
Outline
Speculation
Re-order buffers
Limits to ILP
Speculation
Branch Prediction Out of Order
Execution
Control Dependence Ignored

If CPU stalls on branches, how much would
CPI increase?
Control dependence need not be preserved in
the whole execution
willing to execute instructions that should not have been
executed, thereby violating the control dependences, if
can do so without affecting correctness of the program
Two properties critical to program

correctness are:
data flow
exception behavior
Branch Prediction and

Speculative Execution
Speculation is to run
instructions on
prediction predictions
could be wrong.
Example:
for (i=0; i<1000; i++)
C[i] = A[i]+B[i];
Branch prediction:
cannot be avoided,
could be very accurate
Branch prediction:
predict the execution as
accurate as possible
(frequent cases)
Speculative execution
recovery: if prediction is
wrong, roll the execution
back
Misprediction is less
frequent event but can
we ignore?
Exception Behavior
Preserving exception behavior -- exceptions must be
raised exactly as in sequential execution
Same sequence as sequential
No extra exceptions
Example:
DADDU
BEQZ
LW
L1:
R2,R3,R4
R2,L1
R1,0(R2)
Problem with moving LW before BEQZ?
Again, a dynamic execution must look like a sequential

execution, any time when it is stopped
Exceptions in Order
Solutions:
Early detection of FP exceptions
The use of software mechanisms to restore a precise
exception state before resuming execution,
Delaying instruction completion until we know an
exception is impossible
Precise Interrupts
An interrupt is precise if the saved process
state corresponds with a sequential model of
program execution where one instruction
completes before the next begins.
Tomasulo had:
In-order issue, out-of-order execution, and
out-of-order completion
Need to fix the out-of-order completion
aspect so that we can find precise breakpoint
in instruction stream.
Short Seminar Precise

Exceptions
1. 01277582(Implementation of precise exception
in a 5-stage pipeline embedded processor CNF03).pdf
2. 01354393(A 0.18-spl mu-m CMOS
implementation of an area efficient precise
exception handling unit for processing-inmemory systems - CNF04).pdf
3. 00004607(Implementing precise interrupts in
pipelined processors - JNL88).pdf
9
Branch Prediction Vs. Precise

Interrupt
Mis-prediction is exception on the branch
inst
Execution branches out on exceptions
Every instruction is predicted not to take the branch
to interrupt handler
Same technique for handling both issue:

in-order completion or commit: change
register/memory only in program order
(sequential)
How does it ensure the correctness?
10
HW Support for More ILP

Speculation: allow an instruction to issue that is
dependent on branch predicted to be taken without any
consequences (including exceptions) if branch is not
actually taken (HW undo);
Combine branch prediction with dynamic scheduling
to execute before branches resolved
Separate speculative bypassing of results from real
bypassing of results
When instruction no longer speculative,
write boosted results (instruction commit)
or discard boosted results
execute out-of-order but commit in-order
to prevent irrevocable action (update state or exception)
until instruction commits
11
HW support for More ILP

Need HW buffer for results of
uncommitted instructions: reorder
buffer
4 fields: instr, destination, value, ready
Reorder buffer can be operand source =>
more registers like RS
Use reorder buffer number instead of
reservation station when execution
completes
Supplies operands between execution
complete & commit
Once operand commits,
result is put into register
Instructions commit in order
As a result, its easy to undo speculated
instructions
on mispredicted branches
or on exceptions
12
Reorder Buffer Implementation
13
Result Shift Register

Result Shift Register" is used to control
the result bus
N is the length of the longest functional
unit pipeline
An instruction that takes i clock
periods reserves stage i
If the stage already contains valid
control information, then issue is held
until the next clock period
Issuing instruction places control
information in the result shift register.
the functional unit that will be supplying the
result
the destination register
This control information is also marked
"valid"
Each clock period, the control

information is shifted down one stage
toward stage one.
When it reaches stage one, it is used
during the next clock period to control
the result bus
14
The Hardware: Reorder Buffer
If inst write results in program order,

reg/memory always get the correct
values
IM
Fetch Unit
Reorder buffer (ROB) reorder out-oforder inst to program order at the time of
writing reg/memory (commit)
If some inst goes wrong, handle it at the

time of commit just flush inst
afterwards
Inst cannot write reg/memory

immediately after execution, so ROB also
buffer the results
No such a place in Tomasulo original
Reorder
Buffer
Decode
Rename
Regfile
S-buf
L-buf
RS
RS
FU1
FU2
DM
15
Four Steps of Speculative

Tomasulo Algorithm
1. Issueget instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send
operands & reorder buffer no. for destination (this stage sometimes
called dispatch)
2. Executionoperate on operands (EX)

When both operands ready then execute; if not ready, watch CDB for
result; when both in reservation station, execute; checks RAW
(sometimes called issue)
3. Write resultfinish execution (WB)

Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commitupdate register with reorder result

When instr. at head of reorder buffer & result present, update register
with result (or store to memory) and remove instr from reorder buffer.
Mispredicted branch flushes reorder buffer (sometimes called
graduation)
16
A ready bit indicates if the

instruction has completed
execution and the value is ready
Ready?
Program Counter
Write results to dest at the time of commit

Which PC to hold?
Exceptions?
Holds dest, result and PC
Result
Flush pipeline when any bit is set
Dest reg
Holds Instruction type: branch, store, ALU

register operation
Holds branch valid and exception bits
Branch or L/W?
Reorder Buffer Details
Reorder Buffer
Supplies operands between execution

complete and commit
ROB replaces the Store Buffer also
17
Speculative Execution
Recovery
IM
Flush the pipeline on misprediction

MIPS 5-stage pipeline
used flushing on taken
branches
Where is the flush signal

from?
When to flush?
Fetch Unit
Reorder
Buffer
Decode
Rename
Regfile
S-buf
L-buf
RS
RS
FU1
FU2
DM
18
Changes to Other Components

Use ROB index as tag
Why not RS index any more?
Why is ROB index a valid choice?
Renaming table maps architecture registers

to ROB index if the register is renamed
Reservation stations now use ROB index for
tracking dependence and for wakeup
Again tag (now ROB index) and data are
broadcast on CDB at writeback
Inst may receive values from reg/mem, data
broadcasting, or ROB
19
Complexity of ROB
Assume dual-issue superscalar
Load/Store machine with three-operand instructions
64 registers
16-entry circular buffer
Hardware support needed for ROB

two write ports
Four read ports (two source operands of two instructions)
Four 6-bit comparators for associative lookup
Limited capacity of ROB is a structural hazard

Repeated writes to same register actually happen
This is not the case in classical Tomasulo
20
Code Example
Loop: LD R2, 0(R1)
DADDIU R2, R2, #1
SD R2, 0(R1)
DADDIU R1, R1, #4
BNE R2, R3, Loop
How would this code be executed?
Inst
Issue
Exec
Memoryre
ad
Write
results
Commit
LD
21
Summary
Reservations stations: implicit register renaming to
larger set of registers + buffering source operands
Prevents registers as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Not limited to basic blocks when compared to static

scheduling (integer units gets ahead, beyond
branches)
Today, helps cache misses as well
Dont stall for L1 Data cache miss
Can support memory-level parallelism
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation
360/91 descendants are Pentium III; PowerPC 604;

MIPS R10000; HP-PA 8000; Alpha 21264
22
Dynamic Scheduling: The Only

Choice?
Most high-performance processors today are dynamically
scheduled superscalar processors
With deeper and n-way issue pipeline
Other alternatives to exploit instruction-level parallelism

Statically scheduled superscalar
VLIW
Mixed effort: EPIC Explicit Parallel Instruction Computing

Example: Intel Itanium processors
Why is dynamic scheduling so popular today?

Technology trends: increasing transistor budget, deeper pipeline, wide
issue
23
Advantages of HW (Tomasulo)
vs. SW (VLIW) Speculation
HW determines address conflicts
HW better branch prediction
HW maintains precise exception model
Works across multiple implementations
SW speculation is much easier for HW design
24

08-Speculation - 2 (Compatibility Mode) PDF

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

08-Speculation - 2 (Compatibility Mode) PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Dynamic ILP

Control Dependence Ignored

Two properties critical to program

Branch Prediction and

Problem with moving LW before BEQZ?

Again, a dynamic execution must look like a sequential

Short Seminar Precise

Branch Prediction Vs. Precise

Same technique for handling both issue:

HW Support for More ILP

HW support for More ILP

Reorder Buffer Implementation

Result Shift Register

Each clock period, the control

The Hardware: Reorder Buffer

If inst write results in program order,

If some inst goes wrong, handle it at the

Inst cannot write reg/memory

Four Steps of Speculative

2. Executionoperate on operands (EX)

3. Write resultfinish execution (WB)

4. Commitupdate register with reorder result

A ready bit indicates if the

Write results to dest at the time of commit

Holds dest, result and PC

Flush pipeline when any bit is set

Holds Instruction type: branch, store, ALU

Reorder Buffer Details

Supplies operands between execution

Flush the pipeline on misprediction

Where is the flush signal

Changes to Other Components

Renaming table maps architecture registers

Hardware support needed for ROB

Limited capacity of ROB is a structural hazard

Not limited to basic blocks when compared to static

360/91 descendants are Pentium III; PowerPC 604;

Dynamic Scheduling: The Only

Other alternatives to exploit instruction-level parallelism

Mixed effort: EPIC Explicit Parallel Instruction Computing

Why is dynamic scheduling so popular today?

S-ar putea să vă placă și