Branch Prediction - 1: Computer Architecture: A Constructive Approach

Computer Architecture: A Constructive Approach
Branch Prediction - 1
Arvind
Computer Science & Artificial Intelligence Lab.
Massachusetts Institute of Technology
April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-1

Control Flow Penalty
Next fetch PC
started
Fetch
I-cache
Modern processors may
have > 10 pipeline stages Fetch
between next PC calculation Buffer
Decode
and branch resolution !
Issue
How much work is lost if Buffer
pipeline doesnt follow Execute
correct instruction flow? Func.
Units
~ Loop length x pipeline
Result
width Branch Buffer Commit
executed
Arch.
State
Average Run-Length
between Branches
Average dynamic instruction mix from SPEC92:
SPECint92 SPECfp92
ALU 39 % 13 %
FPU Add 20 %
FPU Mult 13 %
load 26 % 23 %
store 9% 9%
branch 16 % 8%
other 10 % 12 %
SPECint92: compress, eqntott, espresso, gcc , li
SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor
What is the average run-length between branches?

MIPS Branches and Jumps
Each instruction fetch depends on one or two
pieces of information from the preceding
instruction:
1. Is the preceding instruction a taken branch?
2. If so, what is the target address?
Instruction Taken known? Target known?
J After Inst. Decode After Inst. Decode
JR After Inst. Decode After Reg. Fetch
BEQZ/BNEZ After Exec After Inst. Decode

Currently our simple pipelined
architecture does very simple
branch prediction
What is it?
Branch is predicted not taken: pc, pc+4, pc+8,
Can we do better?

Branch Prediction Bits
Assume 2 BP bits per instruction
Use saturating counter
1 1 Strongly taken
On taken
On taken
1 0 Weakly taken
0 1 Weakly taken
0 0 Strongly taken

Branch History Table (BHT)
Fetch PC 00
k 2k-entry
I-Cache BHT Index BHT,
2 bits/entry
Instruction
Opcode offset
Branch? Target PC Taken/Taken?

4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
Where does BHT fit in the
processor pipeline?
BHT can only be used after instruction decode
What should we do at the fetch stage?
Need a mechanism to update the BHT

where does the update information come from

Overview of branch prediction
Best predictors
reflect program
BP, behavior
JMP,
Next Addr Ret
Pred
P Reg
Decode Execute
C Read
Instr type, Simple Complex

Need next PC PC relative conditions, conditions
immediately targets register targets available
available available
Tight loop Loose loop Loose loop Loose loop

Next Address Predictor (NAP)
first attempt
predicted BPb
target
Branch
Target
iMem Buffer
(2k entries)
k
PC
target BP
BP bits are stored with the predicted target address.

IF stage: nPC = If (BP=taken) then target else pc+4
later: check prediction, if wrong then kill the instruction
and update BTB & BPb else update BPb
Address Collisions
132 Jump 100
Assume a
128-entry 1028 Add .....
NAP target BPb
236 take
Instruction
What will be fetched after the instruction at 1028? Memory
NAP prediction = 236
Correct target = 1032
kill PC=236 and fetch PC=1032
Is this a common occurrence?

Can we avoid these bubbles?
Use NAP for Control Instructions only
NAP contains useful information for branch and
jump instructions only
Do not update it for other instructions
For all other instructions the next PC is (PC)+4 !
How to achieve this effect without decoding the

instruction?

Branch Target Buffer (BTB)
a special form of NAP
I-Cache PC 2k-entry direct-mapped BTB
Entry PC Valid predicted
target PC
=
match valid target
Keep the (pc, predicted pc) in the BTB
pc+4 is predicted if no pc match is found
BTB is updated only for branches and jumps
Permits nextPC to be determined before instruction is decoded
Consulting BTB Before
Decoding
132 Jump 100
entry PC target BPb

132 236 take 1028 Add .....
The match for pc =1028 fails and 1028+4 is fetched

eliminates false predictions after ALU instructions
BTB contains entries only for control transfer instructions

more room to store branch targets
Even very small BTBs are very effective

Observations
There is a plethora of branch prediction
schemes their importance grows with the
depth of processor pipeline
Processors often use more than one prediction
scheme
It is usually easy to understand the data
structures required to implement a particular
scheme
It takes considerably more effort to understand
how a particular scheme with its lookup and
updates is integrated in the pipeline and how
various schemes interact with each other

Plan
We will begin with a very simple 2-stage pipeline
and integrate a simple BTB scheme in it
We will extend the design to a multistage

pipeline and integrate at least one more
predictor, say BHT, in the pipeline (next lecture)
revisit the simple two-stage pipeline without

branch prediction

Decoupled Fetch and Execute
<updated pc>
nextPC
Fetch Execute
ir
<instructions,
pc, epoch>
Fetch sends instructions to Execute along with

pc and other control information
Execute sends information about the target pc
to Fetch, which updates pc and other control
registers whenever it looks at the nextPC fifo

A solution using epoch
Add fEpoch and eEpoch registers to the
processor state; initialize them to the same
value
The epoch changes whenever Execute
determines that the pc prediction is wrong.
This change is reflected immediately in eEpoch
and eventually in fEpoch via nextPC FIFO
Associate the fEpoch with every instruction
when it is fetched
In the execute stage, reject, i.e., kill, the
instruction if its epoch does not match eEpoch

Two-Stage pipeline
A robust two-rule solution
Bypass
eEpoch
fEpoch
FIFO
Register File
nextPC
+4
PC ir Decode Execute
Pipeline
FIFO
Inst Data
Memory Memory
Either fifo can be a normal (>1 element) fifo

Two-stage pipeline
Decoupled
module mkProc(Proc);
Reg#(Addr) pc <- mkRegU;
RFile rf <- mkRFile;
IMemory iMem <- mkIMemory;
DMemory dMem <- mkDMemory;
PipeReg#(TypeFetch2Decode) ir <- mkPipeReg;
Reg#(Bool) fEpoch <- mkReg(False);
Reg#(Bool) eEpoch <- mkReg(False);
FIFOF#(Addr) nextPC <- mkBypassFIFOF;
rule doFetch (ir.notFull); explicit guard
let inst = iMem(pc);
ir.enq(TypeFetch2Decode
{pc:pc, epoch:fEpoch, inst:inst});
if(nextPC.notEmpty) begin
pc<=nextPC.first; fEpoch<=!fEpoch; nextPC.deq;end
else pc <= pc + 4; simple branch prediction
endrule
Two-stage pipeline
Decoupled cont
rule doExecute (ir.notEmpty);
let irpc = ir.first.pc; let inst = ir.first.inst;
if(ir.first.epoch==eEpoch) begin
let eInst = decodeExecute(irpc, inst, rf);
let memData <- dMemAction(eInst, dMem);
regUpdate(eInst, memData, rf);
if (eInst.brTaken) begin
nextPC.enq(eInst.addr);
eEpoch <= !eEpoch;
end
end
ir.deq;
endrule
endmodule

Two-Stage pipeline with a
Branch Predictor
eEpoch
fEpoch
Register File
nextPC
ir
PC + Decode Execute
Branch ppc
Predictor
Data
Inst
Memory
Memory

Branch Predictor Interface
interface NextAddressPredictor;
method Addr prediction(Addr pc);
method Action update(Addr pc,
Addr target);
endinterface

Null Branch Prediction
module mkNeverTaken(NextAddressPredictor);
return pc+4;
endmethod
method Action update(Addr pc, Addr target);

noAction;
endmethod
endmodule
Replaces PC+4 with

Already implemented in the pipeline
Right most of the time
Why?

Branch Target Prediction (BTB)
module mkBTB(NextAddressPredictor);
RegFile#(LineIdx, Addr) tagArr <- mkRegFileFull;
RegFile#(LineIdx, Addr) targetArr <- mkRegFileFull;
LineIdx index = truncate(pc >> 2);
let tag = tagArr.sub(index);
let target = targetArr.sub(index);
if (tag==pc) return target; else return (pc+4);
endmethod
method Action update(Addr pc, Addr target);
LineIdx index = truncate(pc >> 2);
tagArr.upd(index, pc);
targetArr.upd(index, target);
endmethod
endmodule
Two-stage pipeline + BP
module mkProc(Proc);
Reg#(Addr) pc <- mkRegU;
RFile rf <- mkRFile;
IMemory iMem <- mkIMemory;
DMemory dMem <- mkDMemory;
PipeReg#(TypeFetch2Decode) ir <- mkPipeReg;
Reg#(Bool) fEpoch <- mkReg(False);
Reg#(Bool) eEpoch <- mkReg(False);
FIFOF#(Tuple2#(Addr,Addr)) nextPC <- mkBypassFIFOF;
NextAddressPredictor bpred <- mkNeverTaken; Some
target
The definition of TypeFetch2Decode is changed to
predictor
include predicted pc
typedef struct {
Addr pc; Addr ppc; Bool epoch; Data inst;
} TypeFetch2Decode deriving (Bits, Eq);

Fetch rule
rule doFetch (ir.notFull);
let ppc = bpred.prediction(pc);
let inst = iMem(pc);
ir.enq(TypeFetch2Decode
{pc:pc, ppc:ppc, epoch:fEpoch, inst:inst});
if(nextPC.notEmpty) begin
match{.ipc, .ippc} = nextPC.first;
pc <= ippc; fEpoch <= !fEpoch; nextPC.deq;
bpred.update(ipc, ippc);
end
else pc <= ppc;
endrule

Execute rule
rule doExecute (ir.notEmpty);
let irpc = ir.first.pc; let inst = ir.first.inst;
let irppc = ir.first.ppc;
if(ir.first.epoch==eEpoch) begin
let eInst = decodeExecute(irpc, irppc, inst, rf);
let memData <- dMemAction(eInst, dMem);
regUpdate(eInst, memData, rf);
if (eInst.missPrediction) begin
nextPC.enq(tuple2(irpc,
eInst.brTaken ? eInst.addr : irpc+4));
eEpoch <= !eEpoch;
end
end
ir.deq;
endrule
endmodule
Execute Function
function ExecInst exec(DecodedInst dInst, Data rVal1,
Data rVal2, Addr pc, Addr ppc);
ExecInst einst = ?;
let aluVal2 = (dInst.immValid)? dInst.imm : rVal2

let aluRes = alu(rVal1, aluVal2, dInst.aluFunc);
let brAddr = brAddrCal(pc, rVal1, dInst.iType,
dInst.imm);
einst.itype = dInst.iType;
einst.addr = (memType(dInst.iType)? aluRes : brAddr;
einst.data = dInst.iType==St ? rVal2 : aluRes;
einst.brTaken = aluBr(rVal1, aluVal2, dInst.brComp);
einst.missPrediction = brTaken ? brAddr!=ppc :
(pc+4)!=ppc;
einst.rDst = dInst.rDst;
return einst;
endfunction
April 7, 2012 http://csg.csail.mit.edu/6.s078Rev L7-29

Branch Prediction - 1: Computer Architecture: A Constructive Approach

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Branch Prediction - 1: Computer Architecture: A Constructive Approach

Încărcat de

Drepturi de autor:

Formate disponibile

Computer Architecture: A Constructive Approach

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-1

What is the average run-length between branches?

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-4

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-5

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-6

Branch? Target PC Taken/Taken?

What should we do at the fetch stage?

Need a mechanism to update the BHT

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-8

Instr type, Simple Complex

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-9

BP bits are stored with the predicted target address.

kill PC=236 and fetch PC=1032

Is this a common occurrence?

For all other instructions the next PC is (PC)+4 !

How to achieve this effect without decoding the

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-12

entry PC target BPb

The match for pc =1028 fails and 1028+4 is fetched

BTB contains entries only for control transfer instructions

Even very small BTBs are very effective

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-14

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-15

We will extend the design to a multistage

revisit the simple two-stage pipeline without

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-16

Fetch sends instructions to Execute along with

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-17

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-18

Either fifo can be a normal (>1 element) fifo

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-21

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-22

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-23

method Action update(Addr pc, Addr target);

Replaces PC+4 with

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-24

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-26

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-27

let aluVal2 = (dInst.immValid)? dInst.imm : rVal2

S-ar putea să vă placă și