ARM Pipelining

Lecture 5 - ARM Organization and Implementation - ICE 1222/2342
Fall, 2008
Daeyoung Kim kimd@icu.ac.kr http://resl.icu.ac.kr/~kimd
Contents
3-stage pipeline ARM organization & implementation 5-stage pipeline ARM organization & implementation
3-stage pipeline ARM Organization

A[31:0] control
ARM Processors up to ARM7
address register P C
incrementer
register bank
PC
instruction decode A L U b u s multiply register A b u s B b u s & control
barrel shifter
ALU
data out register D[31:0]
data in register
3-stage pipeline
Fetch
Instruction is fetched and placed in the instruction pipeline
Decode
The instruction is decoded and the datapath control signals prepared for the next cycle The instruction owns the decode logic but not the datapath
Execute

The instruction owns the datapath Register bank is read, an operand is shifted, ALU result generated and written back into a destination register
ARM single-cycle instruction 3stage pipeline operation

1 2 3 instruction fetch decode fetch execute decode fetch execute decode execute time
ARM multi-cycle instruction 3stage pipeline operation

1 2 3 4 5 instruction fetch ADD decode fetch STR execute decode calc. addr. data xfer fetch ADD fetch ADD decode execute decode fetch ADD time execute decode execute
To achieve higher performance

Tprog = Ninst x CPI / fclk
Increase the clock rate, fclk
The logic in each pipeline stage to be simplified and, therefore, the number of pipeline stages to be increased
Reduce the average number of clock cycles per instruction, CPI
Instructions which occupy more than one pipeline slot are reimplemented to occupy fewer slots Pipeline stalls caused by dependencies between instructions are reduced
Memory bottleneck Von Neumann bottleneck

Deliver more than 32 bits per access Separate instruction and data memory
ARM9TDMI 5-stage pipeline organization

next pc
+4
I-cache
fetch
pc + 4
Fetch Instruction is fetched and placed in the instruction pipeline Decode The instruction is decoded and register operands read Execute An operand is shifted and ALU result generated. Load/Store -> memory address is calculated in ALU B, BL Buffer/Data MOV pc Data memory is accessed if SUBS pc required Otherwise ALU result is simply buffered Write-back Result is written back to register LDR pc file
pc + 8 r15
I decode instruction decode

immediate fields
register read
mul
LDM/ STM
+4
postindex
shift ALU
reg shift
pre-index
execute mux
forwarding paths
byte repl. buffer/ data
load/store address
D-cache
rot/sgn ex
register write
write-back
Data Forwarding
next pc pc + 4
+4 I-cache fetch
A major source of complexity in 5stage pipeline Instruction execution is spread across the stages To resolve data dependencies without stalling the pipeline
pc + 8 r15
I decode instruction decode

immediate fields
register read
Forwarding paths
LDM/ STM
mul +4
postindex
Even with forwarding we can not avoid stall

shift ALU
reg shift
pre-index
LDR rN, [..] ADD r2, r1, rN One cycle stall required rN available at the end of buffer/data stage Use instruction level scheduling
execute mux
B, BL MOV pc SUBS pc forwarding paths
byte repl. buffer/ data
Do not put a dependent instruction immediately after a load instruction

LDR pc
load/store address
D-cache
rot/sgn ex
register write
write-back
Data Processing Instructions

address register increment Rd Rn registers PC Rm Rd Rn registers address register increment PC
mult as ins. as instruction
mult as ins. as instruction [7:0]
data out
data in
i. pipe
data out
data in
i. pipe
(a) register - register operations
(b) register - immediate operations
10
Data Transfer Instructions (STR)

address register increment PC Rn registers Rn address register increment PC registers
Rd
mult lsl #0 =A /A+ B / A- B [11:0]
mult shifter = A + B /A - B
data out
data in
i. pipe
byte?
data in
i. pipe
(a) 1st cycle - compute addr ess
(b) 2nd cycle - store data & auto-index
immediate offset If store byte, replicates it four times, 11 byt Lowest two bits are used for proper
Branch Instructions
address register increment R14 PC registers PC registers address register increment
mult lsl #2 =A+ B [23:0] data out data in i. pipe data out
mult shifter =A
data in
i. pipe
(a) 1st cycle - compute branch tar get
(b) 2nd cycle - save r eturn addr ess
12
ARM Implementation - 1
Clocking Scheme

Most ARMs do not operate with edge-sensitive registers Based around 2-phase non-overlapping clocks generated internally from a single input clock signal

Allows level-sensitive transparent latches Data movement is controlled by passing the data alternatively through latches open during phase 1 and latches open during phase 2 Non-overlapping property ensures no race condition
phase 1 phase 2 1 clock cycle

13
Datapath Timing (1)

ALU operands latched phase 1 register read time shift time phase 2 read bus valid precharge invalidates buses register write time
shift out valid
ALU time
ALU out
14
Datapath Timing (2) The minimum datapath cycle time is the sum of

Register read time Shifter delay ALU delay

Dominates cycle time Logical operations relatively faster than Arithmetic operations Why?
Register write set-up time Phase 2 and phase 1 non-overlap time
15
Adder Design 1
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.htm 32-bit addition time has a significant effect on the datapath cycle time Influence maximum clock rate and processors performance The first Arm processor prototype

Ripple-carry adder circuit Cout long Worst-case carry path is 32 gates
A B
sum
16
Cin
Adder Design - 2
ARM2 4-bit look-ahead scheme
To reduce the worst-case carry path length

Cout[3]
A[3:0]
G P 4-bit adder logic sum[3:0]
B[3:0]
Cin[0]
17
Carry-Look-Ahead (CLA) Adder 1
calculating the carry signals in advance
a carry signal will be generated

when both bits Ai and Bi are 1 when one of the two bits is 1 and the carry-in (carry of the previous stage) is 1 COUT = Ci+1 = Ai.Bi + (Ai $ Bi).Ci (1) Ci+1 = Gi + Pi.Ci
(2)
(3) (4) Propagate
Gi = Ai.Bi Generate Pi = (Ai $ Bi)
Propagate and Generate terms only depend on the input bits

will be valid after one gate delay If one uses the above expression to calculate the carry signals, one does not need to wait for the carry to ripple through all the previous stages to find its proper value. Lets apply this to a 4-bit adder
18
Lets apply this to a 4-bit adder
C1 = G0 + P0.C0 (5) C2 = G1 + P1.C1 = G1 + P1.G0 + P1.P0.C0 (6) C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.C0 (7) C4 = G3 + P3.G2 + P3.P2.G1 + P3P2.P1.G0 + P3P2.P1.P0.C0 (8) carry-out bit, Ci+1, of the last stage will be available after three delays (one delay to calculate the Propagate signal and two delays as a result of the AND and OR gate) Sum signal can be calculated as follows
Si = Ai $ Bi $ Ci = Pi $ Ci. (9)
19
4-bit adder
20
16-bit adder (Group)

PG = P3.P2.P1.P0 (10) GG = G3 + P3G2 + P3.P2.G1. + P3.P2.P1.G0(11)
21
ALU functions
Adder, address computations for memory transfer, branch calculations, bit-wise logical functions, and so on
fs 5 0 0 0 0 0 1 0 0 0 0 0 fs 4 0 0 0 1 1 1 0 0 0 0 0 fs 3 0 1 1 1 0 0 0 0 0 1 1 fs 2 1 0 0 0 1 1 0 0 1 0 1 fs 1 0 0 0 0 1 1 0 0 0 1 0 fs 0 0 0 1 1 0 0 0 1 1 0 0 ALU o ut p ut A and B A and not B A xor B A plus not B plus carry A plus B plus carry not A plus B plus carry A A or B B not B zero
22
ALU functions The ARM2 ALU logic for one result bit
fs: 5 NB bus
01 23
carry logic G
ALU bus P NA bus
23
ARM6 Carry-Select Adder
Computes the sums of various fields of the word for a carry-in of both zero and one The final result is selected by using the correct carry-in bit
a,b[3:0] + c +, +1 +, +1 s s+1 mux a,b[31:28]
mux
mux sum[3:0] sum[7:4] sum[15:8] sum[31:16]
24
ARM6 ALU Organization

A operand latch invert A XOR gates B operand latch XOR gates invert B
function
logic functions
adder
C in C V
logic/arithmetic
result mux zero detect result
N Z
25
Barrel Shifter The shifter performance is critical
Shifter time contributes to the datapath cycle time

right 3 right 2 right 1 no shift in[3] in[2] in[1] in[0] left 1 left 2 left 3
out[0] out[1] out[2] out[3] 26
The ARM register bank

A bus read decoders B bus read decoders Vdd Vss ALU bus PC bus INC bus PC register cells ALU bus A bus B bus write decoders
27
Control Structures
instruction coprocessor
decode PLA
cycle count
multiply control load/store multiple
address control
register control
ALU control
shifter control
28
ARM Coprocessor Interface - 1
A general-purpose extension of its instruction set through the addition of hardware coprocessors
Also supports software emulation of coprocessors through undefined instruction trap
Coprocessor Architecture

16 logical coprocessors Each coprocessor have up to 16 private registers of any reasonable size Load-store architecture

Internal operations on registers Load and store from and to the memory Move data to or from an ARM register Board level coprocessor slow speed On-chip coprocessor high clock speed, cache and memory management, etc.
Implementation

29
ARM7TDMI Coprocessor interface
Bus watching
Coprocessor is attached to a bus where the ARM instruction stream flows into the ARM Coprocessor copies the instructions into an internal pipeline Handshake between ARM and coprocessor
cpi* (from ARM to all coprocessors) Coprocessor instruction cpa (from the coprocessors to ARM) Coprocessor absent cpb (from the coproessors to ARM) Coprocessor busy
30
Handshake outcomes
ARM may decide not to execute it
It falls in a branch shadow or fails condition code test / cpi* high Undefined instruction trap
ARM may decide to execute it (cpi* low), but cpa high
ARM decides to execute it and a coprocessor accepts it, but cannot execute it yet

cpa low but cpb high Busy-wait while stalling instruction stream Enabled interrupt request arrives? Handle it and retry coprocessor instruction later
ARM decides to execute it and coprocessor accepts it and executes it immediately
cpi* low, cpa low, cpb low
31

ARM Pipelining

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

ARM Pipelining

Încărcat de

Drepturi de autor:

Formate disponibile

Lecture 5 - ARM Organization and Implementation - ICE 1222/2342

3-stage pipeline ARM Organization

ARM Processors up to ARM7

instruction decode A L U b u s multiply register A b u s B b u s & control

data out register D[31:0]

Instruction is fetched and placed in the instruction pipeline

ARM single-cycle instruction 3stage pipeline operation

ARM multi-cycle instruction 3stage pipeline operation

To achieve higher performance

Increase the clock rate, fclk

Reduce the average number of clock cycles per instruction, CPI

Memory bottleneck Von Neumann bottleneck

ARM9TDMI 5-stage pipeline organization

I decode instruction decode

byte repl. buffer/ data

I decode instruction decode

Even with forwarding we can not avoid stall

byte repl. buffer/ data

Do not put a dependent instruction immediately after a load instruction

Data Processing Instructions

mult as ins. as instruction

mult as ins. as instruction [7:0]

(a) register - register operations

(b) register - immediate operations

Data Transfer Instructions (STR)

mult lsl #0 =A /A+ B / A- B [11:0]

(a) 1st cycle - compute addr ess

(b) 2nd cycle - store data & auto-index

(a) 1st cycle - compute branch tar get

(b) 2nd cycle - save r eturn addr ess

phase 1 phase 2 1 clock cycle

Datapath Timing (1)

shift out valid

Register read time Shifter delay ALU delay

Register write set-up time Phase 2 and phase 1 non-overlap time

Ripple-carry adder circuit Cout long Worst-case carry path is 32 gates

ARM2 4-bit look-ahead scheme

To reduce the worst-case carry path length

G P 4-bit adder logic sum[3:0]

Carry-Look-Ahead (CLA) Adder 1

calculating the carry signals in advance

a carry signal will be generated

Gi = Ai.Bi Generate Pi = (Ai $ Bi)

Propagate and Generate terms only depend on the input bits

Carry-Look-Ahead (CLA) Adder 2

Lets apply this to a 4-bit adder

Carry-Look-Ahead (CLA) Adder 3

Carry-Look-Ahead (CLA) Adder 4

16-bit adder (Group)

PG = P3.P2.P1.P0 (10) GG = G3 + P3G2 + P3.P2.G1. + P3.P2.P1.G0(11)

ALU bus P NA bus

ARM6 Carry-Select Adder

mux sum[3:0] sum[7:4] sum[15:8] sum[31:16]

ARM6 ALU Organization

result mux zero detect result

Barrel Shifter The shifter performance is critical

Shifter time contributes to the datapath cycle time

out[0] out[1] out[2] out[3] 26

The ARM register bank

multiply control load/store multiple

ARM Coprocessor Interface - 1

Also supports software emulation of coprocessors through undefined instruction trap

ARM Coprocessor Interface - 2