Documente Academic
Documente Profesional
Documente Cultură
Introduction &
Instruction Set Architecture
Aleksandar Milenkovic
E-mail:
Web:
milenka@ece.uah.edu
http://www.ece.uah.edu/~milenka
Outline
ARM Architecture
ARM Organization and Implementation
ARM Instruction Set
Thumb Instruction Set
Architectural Support for System
Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
2
ARM History
ARM Acorn RISC Machine (1983 1985)
Acorn Computers Limited, Cambridge, England
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13
r14
r15 (PC)
CPSR
user mode
r8_fiq
r9_fiq
r10_fiq
r11_fiq
r12_fiq
r13_fiq
r14_fiq
r13_svc
r14_svc
r13_abt
r14_abt
r13_irq
r14_irq
r13_und
r14_und
SPSR_irq SPSR_und
SPSR_abt
SPSR_fiq SPSR_svc
fiq
mode
svc
mode
abort
mode
irq
undefined
mode
mode
I F interrupt enables
31
28 27
NZCV
8 7 6 5 4
unused
IF T
mode
bit 31
bit 0
23
22
21
20
19
18
17
16
15
14
13
12
11
10
word16
half-word14 half-word12
word8
byte6 half-word4
byte
address
Instructions
Data Processing use and change only register values
Data Transfer copy memory values into registers
(load) or copy register values into memory (store)
Control Flow
o branch
o branch-and-link
save return address to resume the original sequence
o trapping into system code supervisor calls
I/O system
I/O is memory mapped
internal registers of peripherals (disk controllers,
network interfaces, etc) are addressable locations
within the ARMs memory map and may be read and
written using the load-store instructions
ARM exceptions
ARM supports a range of interrupts, traps, and supervisor calls
all are grouped under the general heading of exceptions
Handling exceptions
current state is saved by copying the PC into r14_exc and CPSR
into SPSR_exc (exc stands for exception type)
processor operating mode is changed to the appropriate
exception mode
PC is forced to a value between 0016 and 1C16, the particular
value depending on the type of exception
instruction at the location PC is forced to (the vector address)
usually contains a branch to the exception handler; the
exception handler will use r13_exc, which is normally initialized
to point to a dedicated stack in memory, to save some user
registers
return: restore the user registers and then restore PC and
CPSR atomically
10
C source
C libraries
C compiler
assembler
.aof
object
libraries
linker
.axf
Cross-development
tools run on different
architecture from one
for which they
produce code
asm source
system model
ARMulator
debug
ARMsd
development
board
11
Outline
ARM Architecture
ARM Assembly Language Programming
ARM Organization and Implementation
ARM Instruction Set
Architectural Support for High-level Languages
Thumb Instruction Set
Architectural Support for System Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
12
13
Arithmetic operations
Bit-wise logical operations
Register-movement operations
Comparison operations
14
r0 := r1 + r2
r0 := r1 + r2 + C
r0 := r1 or r2
r0 := r1 - r2
r0 := r1 xor r2
r0 := r1 - r2 + C - 1
r0 := r1 and (not) r2
r0 := r2 r1
r0 := r2 r1 + C - 1
Register Movement
Comparison Operations
MOV r0, r2
r0 := r2
CMP r1, r2
set cc on r1 - r2
MVN r0, r2
r0 := not r2
CMN r1, r2
set cc on r1 + r2
TST r1, r2
set cc on r1 and r2
TEQ r1, r2
set cc on r1 xor r2
15
r3 := r3 + 3
r5 := r5 + 2r2 x r3
16
31
31
00000
00000
LSL #5
31
LSR #5
0
31
00000 0
11111 1
31
ROR #5
RRX
17
18
Multiplies
Example (Multiply, Multiply-Accumulate)
MUL r4, r3, r2
r4 := [r3 x r2]<31:0>
r4 := [r3 x r2 + r1]
<31:0>
Note
least significant 32-bits are placed in the result register,
the rest are ignored
immediate second operand is not supported
result register must not be the same
as the first source register
if `S` bit is set the V is preserved and
the C is rendered meaningless
19
20
r0 := mem32[r1]
mem32[r1] := r0
Base+offset addressing
(offset of up to 4Kbytes)
LDR r0, [r1, #4] r0 := mem32[r1 +4]
r0 := mem8[r1]
Auto-indexing addressing
LDR r0, [r1, #4]! r0 := mem32[r1 + 4]
r1 := r1 + 4
Post-indexed addressing
LDR r0, [r1], #4 r0 := mem32[r1]
r1 := r1 + 4
21
; r1 points to TABLE1
; r2 points to TABLE2
TABLE1: ...
TABLE2:...
COPY:
LOOP:
; r1 points to TABLE1
; r2 points to TABLE2
TABLE1: ...
TABLE2:...
22
r0 := mem32[r1]
r2 := mem32[r1 + 4]
r5 := mem32[r1 + 8]
Stack organizations
FA full ascending
EA empty ascending
FD full descending
ED empty descending
23
r9
r5
r1
r0
1018 16
r9
100c 16
r9
r5
r1
r0
100c 16
1000 16
r9
r9
r5
r1
r0
1000
1000 16
16
100c 16
16
1018 16
100c 16
r9
r9
16
r5
r1
r0
1000
16
24
25
26
Conditional execution
Conditional execution to avoid branch instructions
used to skip a small number of non-branch
instructions
Example
CMP r0, #5
;
BEQ BYPASS
; if (r0!=5) {
;}
r1:=r1+r0-r2
BYPASS: ...
...
27
SUBR:
..
; return here
..
Nested subroutines
BL SUB1
..
SUB1:
SUB2:
..
MOV pc, r14 ; copy r14 into r15
28
Supervisor calls
Supervisor is a program which operates at a
privileged level it can do things that a user-level
program cannot do directly
Example: send text to the display
29
Jump tables
Call one of a set of subroutines depending on a
value computed by the program
BL JTAB
JTAB:
...
BL JTAB
CMP r0, #0
...
BEQ SUB0
JTAB:
CMP r0, #1
BEQ SUB1
CMP r0, #2
B ERROR
BEQ SUB2
Note: slow when the list is long,
and all subroutines are equally
frequent
30
EQU
&0
; output character in r0
SWI_Exit
EQU
&11
; finish program
ENTRY
LOOP:
CMP r0, #0
SWINE SWI_WriteC
BNE LOOP
SWI SWI_Exit
TEXT
; end of execution
31
ARM
Organization and Implementation
Aleksandar Milenkovic
E-mail:
Web:
milenka@ece.uah.edu
http://www.ece.uah.edu/~milenka
Outline
ARM Architecture
ARM Organization and Implementation
ARM Instruction Set
Architectural Support for High-level Languages
Thumb Instruction Set
Architectural Support for System Development
ARM Processor Cores
Memory Hierarchy
Architectural Support for Operating Systems
ARM CPU Cores
Embedded ARM Applications
33
ARM organization
A[31:0]
control
address register
Register file
P
C
incrementer
PC
register
bank
instruction
decode
A
L
U
b
u
s
multiply
register
&
b
u
s
b
u
s
barrel
shifter
control
ALU
data in register
D[31:0]
34
Three-stage pipeline
Fetch
the instruction is fetched from memory and placed in
the instruction pipeline
Decode
the instruction is decoded and the datapath control
signals prepared for the next cycle; in this stage the
instruction owns the decode logic but not the
datapath
Execute
the instruction owns the datapath; the register bank
is read, an operand shifted, the ALU register
generated and written back into a destination register
35
1
2
3
instruction
fetch
decode
exec ute
fetch
decode
execute
fetch
decode
execute
time
36
fetch
sub r2,r3,r6
cmp r2,#3
add r0,r1,#5
37
2
3
4
5
instruction
execute
decode
fetch ADD
execute
decode
execute
execute
time
38
ldmia
r0,{r2,r3}
sub r2,r3,r6
cmp r2,#3
Instruction delayed
fetch
decode ex sub
fetch decodeex cmp
time
39
40
41
42
ARM9TDMI
5-stage pipeline
next
pc
pc + 4
Fetch
Decode
instruction is decoded
register operands read
(3 read ports)
Execute
an operand is shifted and
the ALU result
generated, or
B, BL
MOV pc
address is computed
SUBS pc
Buffer/data
data memory is
accessed (load, store)
LDR pc
Write-back
write to register file
+4
I-cache
pc+8
fetch
I decode
r15
instruction
decode
register read
LDM/
STM postindex
+4
pre-index
mux
immediate
fields
mul
shift
reg
shift
ALU
execute
forwarding
paths
byte repl.
load/store
address
D-cache
buffer/
data
rot/sgn ex
register write
writeback
43
ARM9TDMI
Data Forwarding
Data Forwarding
next
pc
I-cache
fetch
pc + 4
+4
pc+8
I decode
r5 := r5 + 2r2 x r3
r15
instruction
decode
register read
r8 := r9 + r10
r5 := r5 + 2r2 x r3
Stall?
LD r3, [r2]
r3 := mem[r2]
r1 := r2 + r3
B, BL
MOV pc
SUBS pc
LDM/
STM postindex
+4
pre-index
mux
mul
shift
reg
shift
ALU
execute
forwarding
paths
byte repl.
load/store
address
LDR pc
immediate
fields
D-cache
buffer/
data
rot/sgn ex
register write
writeback
44
ARM9TDMI
PC generation
3-stage pipeline
next
pc
+4
I-cache
fetch
pc + 4
PC behavior:
operands are read in
execution stage
r15 = PC + 8
5-stage pipeline
operands are read in decode
stage and r15 = PC + 4?
incompatibilities between 3stage and 5-stage
B, BL
MOV pc
implementations =>
SUBS pc
unacceptable
to avoid this 5-stage
pipeline ARMs emulate the
behavior of the older 3LDR pc
stage designs
pc+8
I decode
r15
instruction
decode
register read
LDM/
STM postindex
+4
pre-index
mux
immediate
fields
mul
shift
reg
shift
ALU
execute
forwarding
paths
byte repl.
load/store
address
D-cache
buffer/
data
rot/sgn ex
register write
writeback
45
address register
address register
increment
Rd
PC
Rn
Reg-Imm
registers
increment
Rd
Rm
Rn
mult
Rd = Rn op
Imm
r15 = AR + 4
AR = AR + 4
PC
registers
mult
as ins.
as ins.
as instruction
as instruction
[7:0]
data out
data in
i. pipe
data out
data in
i. pipe
46
address register
AR = Rn op
Disp
r15 = AR + 4
increment
increment
PC
Rn
Store data
(Ex2)
AR = PC
mem[AR] =
Rd<x:y>
If
autoindexing
=>
Rn = Rn +/- 4
address register
Rn
registers
PC
registers
mult
mult
shifter
lsl #0
= A / A +B/
Rd
= A +B/ A -B
A -B
[11:0]
data out
data in
i. pipe
byte?
data in
i. pipe
47
address register
address register
AR = PC + Disp,lsl
#2
increment
increment
R14
PC
registers
PC
mult
mult
(if required)
r14 = PC
Third
cycle:
do+
a small
AR
= AR
4
registers
shifter
lsl #2
=A
= A+B
[23:0]
data in
i. pipe
data out
data in
i. pipe
48
ARM Implementation
Datapath
RTL (Register Transfer Level)
Control unit
FSM (Finite State Machine)
49
50
Shift operation
second operand passes through barrel shifter
ALU operation
ALU has input latches which are open in phase 1,
allowing the operands to begin combining in ALU
as soon as they are valid, but they close at the end of phase 1
so that the phase 2 precharge does not get through to the ALU
ALU processes the operands during the phase 2, producing the
valid output towards the end of the phase
the result is latched in the destination register
at the end of phase 2
51
phase 2
read bus valid
precharge
invalidates
shift out valid buses
register
write time
ALU time
ALU out
52
sum
Cin
53
Cout[3]
G
4-bit
adder
logic
P
B[3:0]
sum[3:0]
Cin[0]
54
ALU
bus
P
NA
bus
55
56
a,b[3:0]
+
c
a,b[31:28]
+, +1 +, +1
s
s+1
mux
mux
mux
sum[3:0] sum[7:4] sum[15:8]
sum[31:16]
57
B operand latc h
andinvert
multiplexor
selects the output
A
XOR gates
func tion
XOR gates
logic functions
logic /arithmetic
adder
result mux
zero detec t
invert B
C in
C
V
N
Z
result
58
bi
Ci
vi, wi
0, 0
1, 1
1, 0
1, 0
vi ai bi
wi ai bi
ai
bi
ai-1 bi-1
Ci
vi, wi
0, 0
1, 1
0(1
)
1(0
)
0, 0
0(1
)
1(0
)
1, 1
0(1
)
1(0
)
0(1
)
1(0
)
1, 0
59
left 1
in[2]
left 2
in[1]
left 3
in[0]
60
61
Multiplier design
All ARMs apart form the first prototype have included
support for integer multiplication
older ARM cores include low-cost multiplication hardware
that supports only the 32-bit result multiply and
multiply-accumulate
recent ARM cores have high-performance multiplication
hardware and support 64-bit result multiply and
multiply-accumulate
62
63
64
(a)
(b)
B Cin
B Cin
Cout S
Cout
B Cin
Cout S
B Cin
Cout
B Cin
Cout
B Cin
Cout
B Cin
Cout
B Cin
Cout
65
registers
Rs >> 8 bits/cycle
Rm
rotate sum and
carry 8 bits/cy cle
carry-save adders
partial sum
partial carry
ALU (add partials)
67
write
read read
A
B
ALU bus
A bus
B bus
68
Vdd
Vss
ALU
bus
PC
bus
INC
bus
ALU
bus
PC
register cells
A bus
B bus
69
address register
incrementer
Ad
PC
inc
B
register bank
multiplier
shift out
ALU
shifter
instruction
Din
data in
instruction pipe
data out
70
decode
PLA
address
control
register
control
cycle
count
ALU
control
multiply
control
load/store
multiple
shifter
control
71