Documente Academic
Documente Profesional
Documente Cultură
Confidential 2
Introduction
I assume you have heard Shyam’s presentation
Focus on transition vs. porting
Porting is mostly about making it work
Transitioning covers:
Change in peripheral interface
Change in application approach
Porting issues for performance, size, behavior
Will cover issues coming from 8-bit/16-bit, ARM7, and ARM9
Time to unlearn bad habits forced on you
Do not be fooled by MHz or “MIPS”
Application style to best fit the needs
More integrated in HW and more done in SW
Focus on code/data size, performance, BOM cost, power
Confidential 3
C/C++ vs. Assembly
Why do you normally end up having to write assembly?
Vector table (needs “call” or “mov”, etc)
Interrupt entry/exit stubs
Keyword usually will not support priority nesting
Compiled code too big and/or slow – parts of application must be hand coded
Specialized features, not compiler friendly
Initialization code – unless acceptable one in C runtime lib
So, why not with Cortex-M3?
Vector table is C array of pointers. 1st entry is Stack pointer.
All ISRs are normal C functions with no special keyword, even Reset
Priority nesting supported in all cases (including faults, system handling)
Instruction set is compiler friendly.
Compilers can detect cases of special instructions
e.g. REVerse and REV16
((x & 0x00ff) << 8) | ((x & 0xff00) >> 8)
Initialization is C function (ResetISR) with stack already setup by HW
Confidential 4
Coming from PIC
Cortex-M3 uses standard C code
No #fuses or #uses, set configuration as and when needed
No #INT_xxx function tags – just normal C functions
Just point to function from one or more vectors in C array
ISRs can call other functions, have all registers available
Stack may be common or separate one for all ISRs
Hardware routes directly to each ISR – no software looking at flags
Can change ISRs dynamically (vector table can be moved to SRAM or
elsewhere in Flash, not just one “alternate” as in PIC24+)
At least 8 priorities (vs. 2 on 8-bit, 7 on PIC24+), easily set/changed, with
priority masking. Faults can be prioritized also
NMI for safety use - cannot be masked off
All GPIOs/Peripherals are direct writable/readable/configurable.
ARM GPIOs allow up to 8 GPIOs to be accessed in one LDR/STR
Confidential 5
Coming from 8051
One unified address space, divided into ½GB regions
Same instructions used for all locations, no speed penalty
Code (Flash) from 0, SRAM (internal) from 0x20000000, Peripherals (internal)
from 0x40000000
External RAM/Peripherals in middle (but not likely used)
System registers (interrupt contoller, etc) from 0xE0000000
Bit access for 1st 1MB of RAM and 1st 1MB of Peripherals
Same model as 8051 (can access same location by bits & byte/half/word)
RMW is atomic
Does not need special instructions, so compiler friendly
Any pointer or variable may be used
System/peripheral registers (SFR) are memory mapped
All accessible using normal C code
Similar interrupt model: enables, priority, fixed assignments
More than two levels of priority, no SW save of PSW/ACC/etc, vector pointers
Confidential 6
Coming from MSP430
Cortex-M3 uses standard C/C++
32-bit words vs. 16-bit
Contiguous RAM, starts at ½GB
Contiguous Flash for code and data/configuration, starts at 0
Multiply (and Divide) is safe for interrupts
Interrupts are also vectored
User set priority vs. position in table
Nesting is automatic (by priority) vs. GIE
No special code or special work in C
RISC oriented instruction set
13 general purpose registers (3 more are for SP, LR, PC)
Constants from MOV instruction
No indirect addressing, but PC-relative address “literals” in Flash
Most instructions are 1 cycle, not related to size
ARM GPIOs are similar design
But consistent (all follow same rules) and more pin control
Confidential 7
Performance
Do not be fooled by MHz or “MIPS”
Only valid measure is amount of work done in a given period of time
Faster MHz for many processors/MCUs barely runs faster
Introduces code wait states (Flash and/or RAM)
Stalls on peripherals – often big part of application
Non-deterministic behavior
Instruction “prediction”, branch caches, caches, etc
MIPS is measure of instruction set style, not work
E.g. 3-12 cycle hardware 32-bit/32-bit DIVide vs. 50+ cycles in software
50+ cycles has higher MIPS!
Less RISCy instructions get more work done, lower MIPS number
Better code density from less RISCy instructions
DMIPS mainly tells you about 3 things – memcpy(), strcmp(), and div
Div and strcmp() are often “gamed” by compiler vendors
More and more compilers “cheat” using auto-inlining, whole program opt
Confidential 8
Peripherals and their bus interface
Wait states on peripherals are a hidden cost
Watch for slower peripheral buses – on any processor
When peripheral bus is slower than core clock – wait states
Impacts even store when have to write more than one in a row
Impacts maximum toggle rate of GPIOs, ability to feed/drain data, etc
You want a fast bus, regardless of peripheral rate
Wait states means processor stalled
Affects what you can do, but also interrupt latency
Ideal is feed/drain peripheral FIFO quickly, then have lots of time
before need to service peripheral again
Even more critical if you have to bit-bang through GPIOs
Confidential 9
Performance – interrupt overhead
Real measure: Time from HW trigger to 1st line of real user code
Longest instruction which stops interrupt adds to latency (tA)
e.g. LDM of 8 elements on ARM7 holds off interrupts for 10 cycles if from 0 wait state
memory, for 26 cycles if 2 wait state peripheral, etc
Cortex-M3 uses interrupt-continue for LDM/STM and abandon for DIV/UMLAL/etc
Pushing registers, messing with modes, etc (tB)
Many example applications use direct entry, but that does not scale to multiple interrupts
or multiple at same time (nesting based on priority)
Often more than 20 cycles of difference in timing when allow nesting
Cortex-M3 does in HW in 12 cycles (saves registers and loads pipeline).
So, user code is now running – but be aware of function prologue on any.
Popping registers and resetting interrupt controller (tD)
Even when leaving one to enter another – pops all then pushes again
Cortex-M3 “tail chains” – skips pop/push and just jumps to new ISR (skips tE)
Higher priority interrupt held off by any of above
This is the case you have to allow for. If no nesting, then add longest ISR! (tC)
Cortex-M3: full priorities and nesting, pre-empt anytime, take over during transitions
tA tB tC tD tE
Confidential 10
Interrupt jitter
Interrupt jitter is variability of response to interrupt trigger
(external or internal)
Priority jitter is a given (higher priority interrupt should delay lower
one).
Jitter on high priority interrupt is a serious matter.
Most common jitter cause is high priority interrupt being held off
when in overhead for lower priority one (registers/mode-save).
Even worse is case where processor does not allow nesting
High priority interrupt delayed by length of lower priority ISR
Trigger
Time to ISR on different invocations
t
Range of time before
ISR serviced=jitter
Confidential 11
Interrupt response jitter
If you have two (or more) interrupts, what happens when they intersect?
Gpio1 is higher priority than Gpio2. Gpio2 is fixed periodic.
Both ISRs take the same time (for this example).
Shows skew in start time for Gpio2 and Gpio1. Key:
Expect priority based jitter for Gpio2 Interrupt entry overhead
Issue is Gpio1 jitter (purple double line) Interrupt exit overhead
Pre-empted
Confidential 12
Interrupt response jitter
If you have two (or more) interrupts, what happens when they intersect?
Gpio1 is higher priority than Gpio2. Gpio1 is fixed periodic (this time).
Both ISRs take the same time (for this example).
Shows skew in start time for Gpio2 and Gpio1.
Expect priority based jitter for Gpio2
Issue is jitter for Gpio1 (purple double line)
Confidential 13
Effect of Critical sections
Critical sections tend to “pack” interrupts at the enable point.
This is made worse when triggers are result of outputs (cycles)
The input/output cycle moves against enable
Over time, more do this, and inputs tend to land on each other
Long latency instructions, on processors that block, do this too
CM3 provides ways to mitigate most and avoid many
When you need them, use priority masking (BASEPRI) not disable
Don’t punish the ISRs that are not using the critical data!
Confidential 14
Performance and size
Code ported from 8-bit/16-bit may bloat on 32-bit
Short/char locals can cause 40%+ increase in size and speed impact
Use ints (unsigned, int, long, unsigned long) – they are optimal
Can up-cast from smaller global/statics (e.g. extern short x; int lx = (int)x;)
Do not take address of local, forces to stack – otherwise in register only
How you access peripherals affects performance and size a lot
Casted constants may be worst way! (e.g. *((unsigned*)0x40001008)
Smaller number of larger functions more optimal (opposite of 8-bit)
Back-to-back loads from peripheral is faster and smaller
Avoiding back-to-back stores to peripheral is faster
Use optimizer
Many 8-bit/16-bit compilers have no real optimizer – very important on 32-bit
Code size and performance are dramatically affected (often >30%)
Check if compiler defaults to optimize for size or speed – not consistent
Use volatile for peripheral pointers (#define or not) and peripheral objects
Optimizer may get rid of code, reverse order, or otherwise “optimize”
Confidential 15
Using locals smaller than register size
Locals of size int Locals of size short int (half word)
typedef int BASE; typedef short BASE;
BASE foo(BASE last, BASE x, BASE y) { BASE foo(BASE last, BASE x, BASE y) {
0: 2300 movs r3, #0 0: f04f 0c00 mov.w ip, #0 ; 0x0
2: e002 b.n a <foo+0xa> 4: e004 b.n 10 <foo+0x10>
BASE i; BASE i;
for (i = 0; i < last; i++) for (i = 0; i < last; i++)
x += (y * x); x += (y * x);
4: fb02 1101 mla r1, r2, r1, r1 6: fb02 1301 mla r3, r2, r1, r1
8: 3301 adds r3, #1 a: f10c 0c01 add.w ip, ip, #1 ;
a: 4283 cmp r3, r0 0x1
c: dbfa blt.n 4 <foo+0x4> e: b219 sxth r1, r3
e: ebc2 0001 rsb r0, r2, r1 10: fa0f f38c sxth.w r3, ip
return(x-y);} 14: 4283 cmp r3, r0
12: 4770 bx lr 16: dbf6 blt.n 6 <foo+0x6>
18: ebc2 0001 rsb r0, r2, r1
1c: b200 sxth r0, r0
return(x-y);}
1e: 4770 bx lr
Confidential 16
Application style
Application design affects performance, size, power use
Three most common types
Pure interrupt
Polling (PLC, DSP style, event/PID loop, etc)
Polling/RTOS with ISRs
Many people move to polling due to processor issues
When 30% or more lost to interrupts, context switching, etc, what
choice?
Pure interrupt ideal for many smaller applications
Polling/RTOS with ISRs gives excellent design options
Communications in ISRs
Time critical operations in ISRs
The rest is easier to design and program
Confidential 17
Application design – mixed example
Motor control ISRs (e.g. PWM, ADC)
Confidential 18
Avoiding interrupt latency on Cortex-M3
I have critical data, don’t I just create latency with int disable?
Three easy ways to avoid this
BASEPRI and BASEPRI_MAX: set priority to mask, don’t disable
If critical data used by priorities 5 to 7, set BASEPRI to 5
Interrupts 0 to 4 can still activate as normal (e.g. motor control)
BASEPRI_MAX will only change if makes higher priority mask
No compare needed. Set, critical-section, restore w/BASEPRI
Exclusives (LDREX/STREX for byte, half, word)
Much better than test-and-set
ISRs can set/clear data non-locking/non-blocking
main loop and lower priority ISRs just try again – no block/lock
E.g. RTOS queues between thread/ISR with no critical section
Bit band forms atomic read-modify-write on SRAM and Peripherals
Set population/claim/request bits
E.g. Thread-wake population bit + PendSV
Confidential 19
Polling vs. interrupt
Polling is poor use of processor (wastes time)
Introduces jitter (based on loop size, load time, etc)
Performance degrades quickly as more checks added
Most common reason used is easier to understand
If interrupt overhead is low, it is better use
Some processors add so much overhead that polling is better
Cortex-M3 offers low overhead and low latency
With multiple priorities and low latency, easily understood behavior
FIFOed communication peripherals offer best of both
Amortize whatever interrupt overhead, but no extra spins polling
If interrupt overhead is too high, then FIFO needed just to work at all
Confidential 20
Poll loop – simple example
for (i = 0; i < loops; i++)
{ // poll loop
if(HWREG(GPIO_IN_PORT_CLOCK))
{ // detect high, drive high
HWREG(GPIO_SCOPE_PORT) = PIN_OUT_SCOPE;
break;
}
else
{ // detect low, drive low
HWREG(GPIO_SCOPE_PORT) = 0;
}
}
// now capture data
…
Confidential 21
Polling: read of input, write output
C-M3
(50MHz)
INPUT
(CLOCK)
OUTPUT
(SCOPE)
Confidential 22
Polling: ARM7 (same clock speed)
ARM7
(60MHz)
INPUT
(CLOCK)
OUTPUT
(SCOPE)
Confidential 23
Interrupt driven
CM3 – can nest and prioritize, etc
void GPIO_trigger_ISR(void) { // on falling
define locals for rest of routine here
HWREG(GPIO_SCOPE_PORT) = PIN_OUT_SCOPE;// drive high
// now capture data, etc
…
HWREG(GPIO_SCOPE_PORT) = 0; // drive low
return; // done
Confidential 24
Interrupt: drive high on falling edge (+work)
C-M3
(50MHz)
FRAME
CLOCK
DATA
CLOCK
DATA
Confidential 26
RTOS
Concerns about using an RTOS
Efficiency of Task Switching
Extra Memory Used
Cortex-M3 has many RTOS-friendly features
Faster/easier context switch - PendSV
Separation of service call (SVC) and context switch
Option of separate thread vs. interrupt/system stack
User/privilege for those that need it (use SVC vs. call)
Standard timer, SYSTICK
Standard interrupt controller
MPU for safety
Confidential 27
PendSV for context switch
PendSV is software triggered exception
Pended, so executes when priority allows
Can be set by scheduler, ISR, or system code
Can be used with SVC or not (all privileged code)
Set at low(est) priority in the system
Ensures it is the last handler to run (tail chaining)
On entry, half of interrupted thread is already saved
Steps are simple:
Save other half on old process stack
Retrieve new process stack from TCB
Switch process stack
Load half of new process context from process stack
Exception return (loads rest in HW)
Confidential 28
RTOS using PendSV (and maybe SVC)
Thread calls system for request, which then uses
Threads Privileged: SVC when blocking/thread-change needed. SVCall
uses PendSV to cause dispatch. Key for all figures:
App
T1 T2 System
Kernel
Tail chain
Thread calls system for request, which then uses Tail chain
PendSV to cause dispatch.
T1 T2
T1 T2
T1 T3
Confidential 29
FreeRTOS.org Context Switching
ARM7 (SWI) Cortex-M3 (PendSV)
Context Save Context Save
Save R0 Get PSP in R0
Get task SP in R0 Push R4-R11 on task stack
Save return address Push nesting depth
Restore R0 Store new SP in TCB
Push all registers (task stack) (11 Thumb-2 instructions, far
Push SPSR fewer instructions, ints not
Push nesting depth on stack blocked)
Store new task SP in TCB
(19 ARM instructions, many
cycles, ints blocked in Push)
Context Restore Context Restore
Get task SP from TCB Get SP from TCB
Pop nesting depth Pop nesting depth
Pop SPSR Pop R4-R11
Pop all registers Load PSP with new task stack
Pop return address If non-zero nesting, mask ints
Return (new task) Return (new task)
(12 ARM instructions, …) (14 Thumb-2 instructions,
12 or 13 executed)
Confidential 30
Example FreeRTOS.org timing
Cortex-M3 ARM7
Time per switch 4 µs/switch† 6.9 µs/switch
(thread+kernel)
Switches/second 250K/sec 145K/sec
Confidential 31
Many excellent RTOS ports available
CMX Systems CMX-RTX and CMX-Tiny
Express Logic ThreadX
FreeRTOS.org FreeRTOS
IAR PowerPac
Interniche NicheTask
Keil/ARM RTX
Micrium μC/OS-II
Pumpkin Salvo
Segger embOS
Others…
Confidential 32
Lower total BOM cost
Do more in SW
For example, motor control
Bit-bang vs. CPLD or FPGA
High speed serial to accomplish more
Use lower cost components when can offload work
Higher end peripherals
More supportable with Cortex-M3, so can do more
Can service higher rates
e.g. 100baseT, 1Mbps CAN, 1Msps ADC, 25MHz SPI, etc
Safety (e.g. IEC 61508)
Faults, MPU, lock-up, NMI, prioritized ISRs for deterministic response
What was two or three 8-bit MCUs can be done in one
Acts like virtual multi-processor (via ISRs)
Confidential 33
Special instructions
Thumb-2 and Cortex-M3 have many special instructions
Many are directly used by compiler
e.g. SDIV/UDIV, MUL/MLA/MLS, UMULL/SMULL/SMLAL/UMLAL,
SBFX/UBFX, BFI/BFC, MOVT/MOVW, SXTH/UXTH/SXTB/UXTB
Some compilers may detect some cases and use:
e.g. REV/REV16/REVSH, CLZ
Else, use access “instruction intrinsics” (e.g. ntohs/htons inlined)
Others available through “instruction intrinsics”
e.g. USAT, SSAT, RBIT, WFI, WFE, SEV, MSR, MRS, CPS, etc
System features available as memory mapped registers
NVIC controls, setup, management
Most system controls, systick, reset control, MPU, etc
MPU optimized to allow STM/LDM to handle multiple regions at once
Also allows sub-regions for better granularity
Confidential 34
Sleep primitives
Sleep vs. Deep-sleep – memory mapped register
Deep sleep allows chip vendor more cycles to wakeup
Sleep-on-exit control
When last ISR returns, sleep
Idle thread – skips pop/push for no purpose
WFI – wait for interrupt to wake up, sleep until
WFE – wait for event, sleep until
Trip-latch – remembers previous set (SEV, or event)
Wakes on interrupt pending if SEVONPEND
Used for intelligent polling
Makes for non-bus contending poll
Confidential 35
Using SWV to get interrupt trace
Accurate to the cycle (e.g. 20ns at 50MHz)
Can see jitter, variability of execution time, periodicity, etc
Allows seeing nesting behavior (pre-emption)
Can also see related to sleep time and main thread time
Can be intermixed with other traced info, to see real behavior
For example, RTOS trace, watch-trace, host strings, etc
Confidential 36
Using SWV for extreme accuracy profiling
HW PC Sampling at speeds such as 48,828 samples/second
CPI calculations add detailed information on mix of instructions and overhead
Confidential 37
Concerned about Cortex-M3 maturity?
Cortex-M3 has exceeded the high reliability and maturity
standard set by previous cores by a wide margin
The r1p0 core used in Stellaris Sandstorm parts, and the r1p1 core
used in Stellaris Fury parts have had no application affecting bugs
Additions/changes have been features and minor trace related fixes
This stability and lack of errors has shown the high quality of the
modern ARM validation and test model
It has also shown the value of the support that Luminary and other
lead partners has given ARM in ensuring the highest quality core
Moving forward
Shyam has covered
Goal oriented: focus on end users
ARM and its partners working together to get best benefit
Ultra low power, specific performance, specialized areas
Confidential 38
Conclusion
You may move to all C/C++ and off-the-shelf code
Assembly should be unnecessary – you can use intrinsics if needed
If coming from 8-bit/16-bit make sure using ints/unsigned
Optimizer is important – size and/or performance (can mix/match)
Do not be afraid to use interrupts
Use priority masking vs. interrupt disable for critical sections
Do not be afraid to use an RTOS if application suits
Reduce BOM cost by reducing parts on board, reducing
number of MCUs, and doing more in SW
Cortex-M3 based MCUs exceed quality and reliability
standards
Confidential 39