Sunteți pe pagina 1din 24

Dynamic ILP

Speculation

Outline
Speculation
Re-order buffers

Limits to ILP

Speculation
Branch Prediction Out of Order
Execution

Control Dependence Ignored


If CPU stalls on branches, how much would
CPI increase?
Control dependence need not be preserved in
the whole execution
willing to execute instructions that should not have been
executed, thereby violating the control dependences, if
can do so without affecting correctness of the program

Two properties critical to program


correctness are:
data flow
exception behavior

Branch Prediction and


Speculative Execution
Speculation is to run
instructions on
prediction predictions
could be wrong.

Example:
for (i=0; i<1000; i++)

C[i] = A[i]+B[i];

Branch prediction:
cannot be avoided,
could be very accurate

Branch prediction:
predict the execution as
accurate as possible
(frequent cases)
Speculative execution
recovery: if prediction is
wrong, roll the execution
back

Misprediction is less
frequent event but can
we ignore?

Exception Behavior
Preserving exception behavior -- exceptions must be
raised exactly as in sequential execution
Same sequence as sequential
No extra exceptions

Example:
DADDU
BEQZ
LW
L1:

R2,R3,R4
R2,L1
R1,0(R2)

Problem with moving LW before BEQZ?

Again, a dynamic execution must look like a sequential


execution, any time when it is stopped

Exceptions in Order
Solutions:
Early detection of FP exceptions
The use of software mechanisms to restore a precise
exception state before resuming execution,
Delaying instruction completion until we know an
exception is impossible

Precise Interrupts
An interrupt is precise if the saved process
state corresponds with a sequential model of
program execution where one instruction
completes before the next begins.
Tomasulo had:
In-order issue, out-of-order execution, and
out-of-order completion
Need to fix the out-of-order completion
aspect so that we can find precise breakpoint
in instruction stream.

Short Seminar Precise


Exceptions
1. 01277582(Implementation of precise exception
in a 5-stage pipeline embedded processor CNF03).pdf
2. 01354393(A 0.18-spl mu-m CMOS
implementation of an area efficient precise
exception handling unit for processing-inmemory systems - CNF04).pdf
3. 00004607(Implementing precise interrupts in
pipelined processors - JNL88).pdf
9

Branch Prediction Vs. Precise


Interrupt
Mis-prediction is exception on the branch
inst
Execution branches out on exceptions
Every instruction is predicted not to take the branch
to interrupt handler

Same technique for handling both issue:


in-order completion or commit: change
register/memory only in program order
(sequential)
How does it ensure the correctness?
10

HW Support for More ILP


Speculation: allow an instruction to issue that is
dependent on branch predicted to be taken without any
consequences (including exceptions) if branch is not
actually taken (HW undo);
Combine branch prediction with dynamic scheduling
to execute before branches resolved
Separate speculative bypassing of results from real
bypassing of results
When instruction no longer speculative,
write boosted results (instruction commit)
or discard boosted results
execute out-of-order but commit in-order
to prevent irrevocable action (update state or exception)
until instruction commits

11

HW support for More ILP


Need HW buffer for results of
uncommitted instructions: reorder
buffer
4 fields: instr, destination, value, ready
Reorder buffer can be operand source =>
more registers like RS
Use reorder buffer number instead of
reservation station when execution
completes
Supplies operands between execution
complete & commit
Once operand commits,
result is put into register
Instructions commit in order
As a result, its easy to undo speculated
instructions
on mispredicted branches
or on exceptions

12

Reorder Buffer Implementation

13

Result Shift Register


Result Shift Register" is used to control
the result bus
N is the length of the longest functional
unit pipeline
An instruction that takes i clock
periods reserves stage i
If the stage already contains valid
control information, then issue is held
until the next clock period
Issuing instruction places control
information in the result shift register.
the functional unit that will be supplying the
result
the destination register
This control information is also marked
"valid"

Each clock period, the control


information is shifted down one stage
toward stage one.
When it reaches stage one, it is used
during the next clock period to control
the result bus
14

The Hardware: Reorder Buffer

If inst write results in program order,


reg/memory always get the correct
values

IM
Fetch Unit

Reorder buffer (ROB) reorder out-oforder inst to program order at the time of
writing reg/memory (commit)

If some inst goes wrong, handle it at the


time of commit just flush inst
afterwards

Inst cannot write reg/memory


immediately after execution, so ROB also
buffer the results
No such a place in Tomasulo original

Reorder
Buffer

Decode

Rename

Regfile

S-buf

L-buf

RS

RS

FU1

FU2

DM

15

Four Steps of Speculative


Tomasulo Algorithm
1. Issueget instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send
operands & reorder buffer no. for destination (this stage sometimes
called dispatch)

2. Executionoperate on operands (EX)


When both operands ready then execute; if not ready, watch CDB for
result; when both in reservation station, execute; checks RAW
(sometimes called issue)

3. Write resultfinish execution (WB)


Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.

4. Commitupdate register with reorder result


When instr. at head of reorder buffer & result present, update register
with result (or store to memory) and remove instr from reorder buffer.
Mispredicted branch flushes reorder buffer (sometimes called
graduation)

16

A ready bit indicates if the


instruction has completed
execution and the value is ready

Ready?

Program Counter

Write results to dest at the time of commit


Which PC to hold?

Exceptions?

Holds dest, result and PC

Result

Flush pipeline when any bit is set

Dest reg

Holds Instruction type: branch, store, ALU


register operation
Holds branch valid and exception bits

Branch or L/W?

Reorder Buffer Details

Reorder Buffer

Supplies operands between execution


complete and commit
ROB replaces the Store Buffer also
17

Speculative Execution
Recovery
IM

Flush the pipeline on misprediction


MIPS 5-stage pipeline
used flushing on taken
branches

Where is the flush signal


from?

When to flush?

Fetch Unit

Reorder
Buffer

Decode

Rename

Regfile

S-buf

L-buf

RS

RS

FU1

FU2

DM

18

Changes to Other Components


Use ROB index as tag
Why not RS index any more?
Why is ROB index a valid choice?

Renaming table maps architecture registers


to ROB index if the register is renamed
Reservation stations now use ROB index for
tracking dependence and for wakeup
Again tag (now ROB index) and data are
broadcast on CDB at writeback
Inst may receive values from reg/mem, data
broadcasting, or ROB

19

Complexity of ROB
Assume dual-issue superscalar
Load/Store machine with three-operand instructions
64 registers
16-entry circular buffer

Hardware support needed for ROB


two write ports
Four read ports (two source operands of two instructions)
Four 6-bit comparators for associative lookup

Limited capacity of ROB is a structural hazard


Repeated writes to same register actually happen
This is not the case in classical Tomasulo

20

Code Example
Loop: LD R2, 0(R1)
DADDIU R2, R2, #1
SD R2, 0(R1)
DADDIU R1, R1, #4
BNE R2, R3, Loop
How would this code be executed?
Inst

Issue

Exec

Memoryre
ad

Write
results

Commit

LD

21

Summary
Reservations stations: implicit register renaming to
larger set of registers + buffering source operands
Prevents registers as bottleneck
Avoids WAR, WAW hazards of Scoreboard

Not limited to basic blocks when compared to static


scheduling (integer units gets ahead, beyond
branches)
Today, helps cache misses as well
Dont stall for L1 Data cache miss
Can support memory-level parallelism

Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation

360/91 descendants are Pentium III; PowerPC 604;


MIPS R10000; HP-PA 8000; Alpha 21264
22

Dynamic Scheduling: The Only


Choice?
Most high-performance processors today are dynamically
scheduled superscalar processors
With deeper and n-way issue pipeline

Other alternatives to exploit instruction-level parallelism


Statically scheduled superscalar
VLIW

Mixed effort: EPIC Explicit Parallel Instruction Computing


Example: Intel Itanium processors

Why is dynamic scheduling so popular today?


Technology trends: increasing transistor budget, deeper pipeline, wide
issue

23

Advantages of HW (Tomasulo)
vs. SW (VLIW) Speculation
HW determines address conflicts
HW better branch prediction
HW maintains precise exception model
Works across multiple implementations
SW speculation is much easier for HW design

24

S-ar putea să vă placă și