Documente Academic
Documente Profesional
Documente Cultură
Fall, 2008
Daeyoung Kim kimd@icu.ac.kr http://resl.icu.ac.kr/~kimd
Contents
3-stage pipeline ARM organization & implementation 5-stage pipeline ARM organization & implementation
address register P C
incrementer
register bank
PC
barrel shifter
ALU
data in register
3-stage pipeline
Fetch
Decode
The instruction is decoded and the datapath control signals prepared for the next cycle The instruction owns the decode logic but not the datapath
Execute
The instruction owns the datapath Register bank is read, an operand is shifted, ALU result generated and written back into a destination register
The logic in each pipeline stage to be simplified and, therefore, the number of pipeline stages to be increased
Instructions which occupy more than one pipeline slot are reimplemented to occupy fewer slots Pipeline stalls caused by dependencies between instructions are reduced
Deliver more than 32 bits per access Separate instruction and data memory
+4
I-cache
fetch
pc + 4
Fetch Instruction is fetched and placed in the instruction pipeline Decode The instruction is decoded and register operands read Execute An operand is shifted and ALU result generated. Load/Store -> memory address is calculated in ALU B, BL Buffer/Data MOV pc Data memory is accessed if SUBS pc required Otherwise ALU result is simply buffered Write-back Result is written back to register LDR pc file
pc + 8 r15
register read
mul
LDM/ STM
+4
postindex
shift ALU
reg shift
pre-index
execute mux
forwarding paths
load/store address
D-cache
rot/sgn ex
register write
write-back
Data Forwarding
next pc pc + 4
+4 I-cache fetch
A major source of complexity in 5stage pipeline Instruction execution is spread across the stages To resolve data dependencies without stalling the pipeline
pc + 8 r15
register read
Forwarding paths
LDM/ STM
mul +4
postindex
shift ALU
reg shift
pre-index
LDR rN, [..] ADD r2, r1, rN One cycle stall required rN available at the end of buffer/data stage Use instruction level scheduling
execute mux
B, BL MOV pc SUBS pc forwarding paths
load/store address
D-cache
rot/sgn ex
register write
write-back
data out
data in
i. pipe
data out
data in
i. pipe
10
Rd
mult shifter = A + B /A - B
data out
data in
i. pipe
byte?
data in
i. pipe
immediate offset If store byte, replicates it four times, 11 byt Lowest two bits are used for proper
Branch Instructions
address register increment R14 PC registers PC registers address register increment
mult lsl #2 =A+ B [23:0] data out data in i. pipe data out
mult shifter =A
data in
i. pipe
12
ARM Implementation - 1
Clocking Scheme
Most ARMs do not operate with edge-sensitive registers Based around 2-phase non-overlapping clocks generated internally from a single input clock signal
Allows level-sensitive transparent latches Data movement is controlled by passing the data alternatively through latches open during phase 1 and latches open during phase 2 Non-overlapping property ensures no race condition
ARM Implementation - 2
ALU time
ALU out
14
ARM Implementation - 3
Datapath Timing (2) The minimum datapath cycle time is the sum of
Dominates cycle time Logical operations relatively faster than Arithmetic operations Why?
15
ARM Implementation - 4
Adder Design 1
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/Comb/adder.htm 32-bit addition time has a significant effect on the datapath cycle time Influence maximum clock rate and processors performance The first Arm processor prototype
A B
sum
16
Cin
ARM Implementation - 5
Adder Design - 2
A[3:0]
B[3:0]
Cin[0]
17
when both bits Ai and Bi are 1 when one of the two bits is 1 and the carry-in (carry of the previous stage) is 1 COUT = Ci+1 = Ai.Bi + (Ai $ Bi).Ci (1) Ci+1 = Gi + Pi.Ci
(2)
(3) (4) Propagate
will be valid after one gate delay If one uses the above expression to calculate the carry signals, one does not need to wait for the carry to ripple through all the previous stages to find its proper value. Lets apply this to a 4-bit adder
18
C1 = G0 + P0.C0 (5) C2 = G1 + P1.C1 = G1 + P1.G0 + P1.P0.C0 (6) C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.C0 (7) C4 = G3 + P3.G2 + P3.P2.G1 + P3P2.P1.G0 + P3P2.P1.P0.C0 (8) carry-out bit, Ci+1, of the last stage will be available after three delays (one delay to calculate the Propagate signal and two delays as a result of the AND and OR gate) Sum signal can be calculated as follows
Si = Ai $ Bi $ Ci = Pi $ Ci. (9)
19
4-bit adder
20
21
ARM Implementation - 6
ALU functions
Adder, address computations for memory transfer, branch calculations, bit-wise logical functions, and so on
fs 5 0 0 0 0 0 1 0 0 0 0 0 fs 4 0 0 0 1 1 1 0 0 0 0 0 fs 3 0 1 1 1 0 0 0 0 0 1 1 fs 2 1 0 0 0 1 1 0 0 1 0 1 fs 1 0 0 0 0 1 1 0 0 0 1 0 fs 0 0 0 1 1 0 0 0 1 1 0 0 ALU o ut p ut A and B A and not B A xor B A plus not B plus carry A plus B plus carry not A plus B plus carry A A or B B not B zero
22
ARM Implementation - 7
ALU functions The ARM2 ALU logic for one result bit
fs: 5 NB bus
01 23
carry logic G
23
ARM Implementation - 8
Computes the sums of various fields of the word for a carry-in of both zero and one The final result is selected by using the correct carry-in bit
a,b[3:0] + c +, +1 +, +1 s s+1 mux a,b[31:28]
mux
24
ARM Implementation - 9
function
logic functions
adder
C in C V
logic/arithmetic
N Z
25
ARM Implementation - 10
ARM Implementation - 10
27
ARM Implementation - 11
Control Structures
instruction coprocessor
decode PLA
cycle count
address control
register control
ALU control
shifter control
28
A general-purpose extension of its instruction set through the addition of hardware coprocessors
Coprocessor Architecture
16 logical coprocessors Each coprocessor have up to 16 private registers of any reasonable size Load-store architecture
Internal operations on registers Load and store from and to the memory Move data to or from an ARM register Board level coprocessor slow speed On-chip coprocessor high clock speed, cache and memory management, etc.
Implementation
29
Bus watching
Coprocessor is attached to a bus where the ARM instruction stream flows into the ARM Coprocessor copies the instructions into an internal pipeline Handshake between ARM and coprocessor
cpi* (from ARM to all coprocessors) Coprocessor instruction cpa (from the coprocessors to ARM) Coprocessor absent cpb (from the coproessors to ARM) Coprocessor busy
30
Handshake outcomes
It falls in a branch shadow or fails condition code test / cpi* high Undefined instruction trap
ARM decides to execute it and a coprocessor accepts it, but cannot execute it yet
cpa low but cpb high Busy-wait while stalling instruction stream Enabled interrupt request arrives? Handle it and retry coprocessor instruction later
31