Sunteți pe pagina 1din 29

Computer Architecture: A Constructive Approach

Branch Prediction - 1

Arvind
Computer Science & Artificial Intelligence Lab.
Massachusetts Institute of Technology

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-1


Control Flow Penalty
Next fetch PC
started
Fetch
I-cache
Modern processors may
have > 10 pipeline stages Fetch
between next PC calculation Buffer
Decode
and branch resolution !
Issue
How much work is lost if Buffer
pipeline doesnt follow Execute
correct instruction flow? Func.
Units
~ Loop length x pipeline
Result
width Branch Buffer Commit
executed
Arch.
State
April 9, 2012 http://csg.csail.mit.edu/6.S078 L12-2
Average Run-Length
between Branches
Average dynamic instruction mix from SPEC92:
SPECint92 SPECfp92
ALU 39 % 13 %
FPU Add 20 %
FPU Mult 13 %
load 26 % 23 %
store 9% 9%
branch 16 % 8%
other 10 % 12 %
SPECint92: compress, eqntott, espresso, gcc , li
SPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor

What is the average run-length between branches?


April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-3
MIPS Branches and Jumps
Each instruction fetch depends on one or two
pieces of information from the preceding
instruction:
1. Is the preceding instruction a taken branch?
2. If so, what is the target address?
Instruction Taken known? Target known?
J After Inst. Decode After Inst. Decode
JR After Inst. Decode After Reg. Fetch
BEQZ/BNEZ After Exec After Inst. Decode

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-4


Currently our simple pipelined
architecture does very simple
branch prediction
What is it?
Branch is predicted not taken: pc, pc+4, pc+8,

Can we do better?

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-5


Branch Prediction Bits
Assume 2 BP bits per instruction
Use saturating counter

1 1 Strongly taken
On taken

On taken

1 0 Weakly taken

0 1 Weakly taken

0 0 Strongly taken

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-6


Branch History Table (BHT)
Fetch PC 00

k 2k-entry
I-Cache BHT Index BHT,
2 bits/entry

Instruction
Opcode offset

Branch? Target PC Taken/Taken?


4K-entry BHT, 2 bits/entry, ~80-90% correct predictions
April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-7
Where does BHT fit in the
processor pipeline?
BHT can only be used after instruction decode

What should we do at the fetch stage?

Need a mechanism to update the BHT


where does the update information come from

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-8


Overview of branch prediction
Best predictors
reflect program
BP, behavior
JMP,
Next Addr Ret
Pred

P Reg
Decode Execute
C Read

Instr type, Simple Complex


Need next PC PC relative conditions, conditions
immediately targets register targets available
available available
Tight loop Loose loop Loose loop Loose loop

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-9


Next Address Predictor (NAP)
first attempt
predicted BPb
target
Branch
Target
iMem Buffer
(2k entries)
k

PC

target BP

BP bits are stored with the predicted target address.


IF stage: nPC = If (BP=taken) then target else pc+4
later: check prediction, if wrong then kill the instruction
and update BTB & BPb else update BPb
April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-10
Address Collisions
132 Jump 100

Assume a
128-entry 1028 Add .....
NAP target BPb
236 take
Instruction
What will be fetched after the instruction at 1028? Memory
NAP prediction = 236
Correct target = 1032

kill PC=236 and fetch PC=1032

Is this a common occurrence?


Can we avoid these bubbles?
April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-11
Use NAP for Control Instructions only
NAP contains useful information for branch and
jump instructions only
Do not update it for other instructions

For all other instructions the next PC is (PC)+4 !

How to achieve this effect without decoding the


instruction?

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-12


Branch Target Buffer (BTB)
a special form of NAP
I-Cache PC 2k-entry direct-mapped BTB
Entry PC Valid predicted
target PC

=
match valid target
Keep the (pc, predicted pc) in the BTB
pc+4 is predicted if no pc match is found
BTB is updated only for branches and jumps
Permits nextPC to be determined before instruction is decoded
April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-13
Consulting BTB Before
Decoding
132 Jump 100

entry PC target BPb


132 236 take 1028 Add .....

The match for pc =1028 fails and 1028+4 is fetched


eliminates false predictions after ALU instructions

BTB contains entries only for control transfer instructions


more room to store branch targets

Even very small BTBs are very effective

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-14


Observations
There is a plethora of branch prediction
schemes their importance grows with the
depth of processor pipeline
Processors often use more than one prediction
scheme
It is usually easy to understand the data
structures required to implement a particular
scheme
It takes considerably more effort to understand
how a particular scheme with its lookup and
updates is integrated in the pipeline and how
various schemes interact with each other

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-15


Plan
We will begin with a very simple 2-stage pipeline
and integrate a simple BTB scheme in it

We will extend the design to a multistage


pipeline and integrate at least one more
predictor, say BHT, in the pipeline (next lecture)

revisit the simple two-stage pipeline without


branch prediction

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-16


Decoupled Fetch and Execute
<updated pc>

nextPC
Fetch Execute
ir
<instructions,
pc, epoch>

Fetch sends instructions to Execute along with


pc and other control information
Execute sends information about the target pc
to Fetch, which updates pc and other control
registers whenever it looks at the nextPC fifo

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-17


A solution using epoch
Add fEpoch and eEpoch registers to the
processor state; initialize them to the same
value
The epoch changes whenever Execute
determines that the pc prediction is wrong.
This change is reflected immediately in eEpoch
and eventually in fEpoch via nextPC FIFO
Associate the fEpoch with every instruction
when it is fetched
In the execute stage, reject, i.e., kill, the
instruction if its epoch does not match eEpoch

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-18


Two-Stage pipeline
A robust two-rule solution

Bypass

eEpoch
fEpoch

FIFO
Register File
nextPC

+4
PC ir Decode Execute

Pipeline
FIFO

Inst Data
Memory Memory

Either fifo can be a normal (>1 element) fifo


April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-19
Two-stage pipeline
Decoupled
module mkProc(Proc);
Reg#(Addr) pc <- mkRegU;
RFile rf <- mkRFile;
IMemory iMem <- mkIMemory;
DMemory dMem <- mkDMemory;
PipeReg#(TypeFetch2Decode) ir <- mkPipeReg;
Reg#(Bool) fEpoch <- mkReg(False);
Reg#(Bool) eEpoch <- mkReg(False);
FIFOF#(Addr) nextPC <- mkBypassFIFOF;
rule doFetch (ir.notFull); explicit guard
let inst = iMem(pc);
ir.enq(TypeFetch2Decode
{pc:pc, epoch:fEpoch, inst:inst});
if(nextPC.notEmpty) begin
pc<=nextPC.first; fEpoch<=!fEpoch; nextPC.deq;end
else pc <= pc + 4; simple branch prediction
endrule
April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-20
Two-stage pipeline
Decoupled cont
rule doExecute (ir.notEmpty);
let irpc = ir.first.pc; let inst = ir.first.inst;
if(ir.first.epoch==eEpoch) begin
let eInst = decodeExecute(irpc, inst, rf);
let memData <- dMemAction(eInst, dMem);
regUpdate(eInst, memData, rf);
if (eInst.brTaken) begin
nextPC.enq(eInst.addr);
eEpoch <= !eEpoch;
end
end
ir.deq;
endrule
endmodule

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-21


Two-Stage pipeline with a
Branch Predictor

eEpoch
fEpoch

Register File
nextPC

ir
PC + Decode Execute

Branch ppc
Predictor

Data
Inst
Memory
Memory

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-22


Branch Predictor Interface
interface NextAddressPredictor;
method Addr prediction(Addr pc);
method Action update(Addr pc,
Addr target);
endinterface

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-23


Null Branch Prediction
module mkNeverTaken(NextAddressPredictor);
method Addr prediction(Addr pc);
return pc+4;
endmethod

method Action update(Addr pc, Addr target);


noAction;
endmethod
endmodule

Replaces PC+4 with


Already implemented in the pipeline
Right most of the time
Why?

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-24


Branch Target Prediction (BTB)
module mkBTB(NextAddressPredictor);
RegFile#(LineIdx, Addr) tagArr <- mkRegFileFull;
RegFile#(LineIdx, Addr) targetArr <- mkRegFileFull;
method Addr prediction(Addr pc);
LineIdx index = truncate(pc >> 2);
let tag = tagArr.sub(index);
let target = targetArr.sub(index);
if (tag==pc) return target; else return (pc+4);
endmethod
method Action update(Addr pc, Addr target);
LineIdx index = truncate(pc >> 2);
tagArr.upd(index, pc);
targetArr.upd(index, target);
endmethod
endmodule
April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-25
Two-stage pipeline + BP
module mkProc(Proc);
Reg#(Addr) pc <- mkRegU;
RFile rf <- mkRFile;
IMemory iMem <- mkIMemory;
DMemory dMem <- mkDMemory;
PipeReg#(TypeFetch2Decode) ir <- mkPipeReg;
Reg#(Bool) fEpoch <- mkReg(False);
Reg#(Bool) eEpoch <- mkReg(False);
FIFOF#(Tuple2#(Addr,Addr)) nextPC <- mkBypassFIFOF;
NextAddressPredictor bpred <- mkNeverTaken; Some
target
The definition of TypeFetch2Decode is changed to
predictor
include predicted pc
typedef struct {
Addr pc; Addr ppc; Bool epoch; Data inst;
} TypeFetch2Decode deriving (Bits, Eq);

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-26


Two-stage pipeline + BP
Fetch rule
rule doFetch (ir.notFull);
let ppc = bpred.prediction(pc);
let inst = iMem(pc);
ir.enq(TypeFetch2Decode
{pc:pc, ppc:ppc, epoch:fEpoch, inst:inst});
if(nextPC.notEmpty) begin
match{.ipc, .ippc} = nextPC.first;
pc <= ippc; fEpoch <= !fEpoch; nextPC.deq;
bpred.update(ipc, ippc);
end
else pc <= ppc;
endrule

April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-27


Two-stage pipeline + BP
Execute rule
rule doExecute (ir.notEmpty);
let irpc = ir.first.pc; let inst = ir.first.inst;
let irppc = ir.first.ppc;
if(ir.first.epoch==eEpoch) begin
let eInst = decodeExecute(irpc, irppc, inst, rf);
let memData <- dMemAction(eInst, dMem);
regUpdate(eInst, memData, rf);
if (eInst.missPrediction) begin
nextPC.enq(tuple2(irpc,
eInst.brTaken ? eInst.addr : irpc+4));
eEpoch <= !eEpoch;
end
end
ir.deq;
endrule
endmodule
April 9, 2012 http://csg.csail.mit.edu/6.S078 L16-28
Execute Function
function ExecInst exec(DecodedInst dInst, Data rVal1,
Data rVal2, Addr pc, Addr ppc);
ExecInst einst = ?;

let aluVal2 = (dInst.immValid)? dInst.imm : rVal2


let aluRes = alu(rVal1, aluVal2, dInst.aluFunc);
let brAddr = brAddrCal(pc, rVal1, dInst.iType,
dInst.imm);
einst.itype = dInst.iType;
einst.addr = (memType(dInst.iType)? aluRes : brAddr;
einst.data = dInst.iType==St ? rVal2 : aluRes;
einst.brTaken = aluBr(rVal1, aluVal2, dInst.brComp);
einst.missPrediction = brTaken ? brAddr!=ppc :
(pc+4)!=ppc;
einst.rDst = dInst.rDst;
return einst;
endfunction
April 7, 2012 http://csg.csail.mit.edu/6.s078Rev L7-29

S-ar putea să vă placă și