Sunteți pe pagina 1din 40

Texas Instruments TMS320C54x DSP

Architecture and Programming


Andrew Fernandez April 30, 2001

Why DSP?
Growing Market
Dedicated ASIC not always best option for implementing signal processing Flexibility

Sample Products

Sony VAIO MC-P10 Music Clip May 2, 2001

Nokia 8210 Handset

JVC GR-DVM90 Digital Camcorder 2

C54 Architecture and Programming

Different Goals Generate Different Architectures


Differences from P
Multiply and Accumulate (MAC) is typical Many memory accesses required Predictable execution time

Types of Processing
Continuous / Real - Time Limited storage Hard constraints Offline Entire signal stored in memory Softer constraints
May 2, 2001 C54 Architecture and Programming 3

Von Neumann Architecture

Inefficient for memory intensive operations One memory space Example: 20 Tap FIR 4 Memory Accesses 1 Parallel MAC At least 80 cycles per output!
May 2, 2001 C54 Architecture and Programming 4

Next Obvious Step: Harvard Architecture

Separate program and data memory We can do better

May 2, 2001

C54 Architecture and Programming

Modified Harvard Architecture (C54)

Separate program and data memory Enables parallel memory access (improves w/ DARAM) May store coefficients in program memory (ROM)
May 2, 2001 C54 Architecture and Programming 6

The Popular TMS320C54x


Technology
1.0 - 5.0 V Core 0.25 CMOS 30-160 MHz 0.21-0.52 mW/MIP 4.0 mW standby
(Texas Instruments - ISSCC 1997)

Architecture (From 10,000 ft.)


16 bit fixed instructions 64K/64K Data/Program 1 MAC, 1 ALU, 2 Accumulators
Source: www.ti.com and ISLPED 2000 Tutorial May 2, 2001 C54 Architecture and Programming 7

8 Auxiliary Registers (ARs) DARAM Compare Select and Store (CSSU) for Viterbi

C54 Block Diagram


Memory Access
4 Internal Bus Pairs 8 Auxiliary Registers (AR0-AR7) Address Generation Circ. Buffers Inc/Dec.

Number Crunching
40 bit Acc. (A and B) 40 bit Barrel Shifter Temporary Register Dedicated support CSSU (Viterbi) Bit reverse (FFT)
C54 Architecture and Programming 8

May 2, 2001

May 2, 2001

C54 Architecture and Programming

C54x Memory, Buses, and Pipeline


Program A/D Bus (P) Internal Memory Data Read A/D Bus (D) Data Read A/D Bus (C) Data Write A/D Bus (E) Extl Mem I/F A D External Memory

Pipeline Phases P - generate program address F - get opcode D - decode instruction A - generate read address R - read operands X - execute

P F D A R X P F D A R X P F D A R X P F D A R X P F D A R X P F D A R X Full Pipeline

May 2, 2001

C54 Architecture and Programming

10

C541 Memory Maps


0000 1400 Program RAM? 0000
OVLY bit

Data MMR / RAM

0000

I/O

1400

External memory 9000 Internal or External memory FF80 FFFF

External memory

I/O Memory

E000
DROM bit

VECTORS

External memory or Internal ROM PAGE 1 (64K) FFFF PAGE 2 (64K)


11

PAGE 0 (64K)

FFFF

May 2, 2001

C54 Architecture and Programming

Our Generic Data Memory


0000 DARAM and SARAM 1480 0400 0000 DARAM Block a 0000 MMR 0060 0080 SPRAM

External memory

SARAM

DARAM Block a

FFFF
May 2, 2001

147F
C54 Architecture and Programming

03FF
12

Ground Zero: Programming


Characteristics of DSP routines
Short Repeated very often Time/Performance critical Assembly!

High Level Language (C)


Speed in Development and Reuse Lower Development Cost

Low Level (Assembly)


High Performance Lower Product Cost
May 2, 2001 C54 Architecture and Programming 13

Shorthand Notation
Term Smem Xmem Ymem lk dmad pmad src dst PA
May 2, 2001

What it means 16-bit single data memory operand 16-bit dual data memory operand used in dual-operand instructions and some single-operand instructions. Read through D bus. 16-bit dual data-memory operand used in dual-operand instructions. Read through C bus. 16-bit long constant 16-bit immediate data memory address (0 - 65,535) 16-bit immediate program memory address (0 - 65,535) This includes extended program memory devices Source accumulator (A or B) Destination accumulator (A or B) 16-bit port (I/O) immediate address (0 - 65,535)
C54 Architecture and Programming 14

C54 Data Addressing


The C54x uses 5 basic data addressing modes:
Indirect Uses Direct Absolute Immediate MMR 16-bit registers as pointers Random access from a specified base address Specify entire 16-bit address Instruction contains the data operand Access memory mapped registers

May 2, 2001

C54 Architecture and Programming

15

Indirect Addressing Options


Option No Modification Increment / Decrement Indexed Circular Syntax *ARn *ARn+ *ARn*ARn+0 *ARn-0 *ARn+% *ARn-% *ARn+0% *ARn-0% *ARn+0B *ARn-0B *ARn (lk) *+ARn (lk) *+ARn (lk)% *+ARn *(lk) Action no modification to ARn post increment by 1 post decrement by 1 post increment by AR0 post decrement by AR0 post increment by 1 - circular post decrement by 1 - circular post increment by AR0 - circular post decrement by AR0 - circular post inc. ARn by AR0 with reverse carry post dec. ARn by AR0 with reverse carry *(ARn+LK), ARn unchanged *(ARn+LK), ARn changed *(ARn+LK), ARn changed - circular pre-increment by 1, during write only 16-bit lk is used as an absolute address See Absolute Addressing AR0 BK BK, AR0 AR0 (=FFT size/2) Affected by:

Bit-Reversed Pre-modify

BK

Absolute

May 2, 2001

C54 Architecture and Programming

16

Indirect Addressing - *
LD STL *AR1+,A A,*AR2+ ;...

Indirect Addressing allows sequential access to arrays 8 address registers (AR0-7) can be used as 16-bit pointers to data ARs can be optionally modified How do we initialize the ARs?

May 2, 2001

C54 Architecture and Programming

17

MMR and Immediate Addressing


start:STM STM #tbl,AR1 #x,AR2
0000h MMRs 0060h SPRAM 007Fh #tbl is the 16-bit address of the assembly variable tbl. 2 words, 2 cycles Immediate operands, like #tbl, are located in program memory as part of the opcode.
18

STM (STore to Memorymapped register) stores an immediate value to the specified MMR or SPRAM address. STM writes value to register in the access phase of the pipeline to avoid latencies (more later)

STM to AR1 # tbl


16 bits

May 2, 2001

C54 Architecture and Programming

Immediate Addressing (Cont.) - #


LD #k5, ASM LD #k8, dst LD #k9, DP RPT #k8 FRAME #k8

;A or B

Short immediate instructions are 1 word, 1 cycle: All other immediate constants are 16 bits and require 2 words, 2 cycles.

May 2, 2001

C54 Architecture and Programming

19

Direct Addressing - @
Instruction Address opcode 9-bit DP 7-bit offset 7-bit offset

16 bits

Direct Addressing allows random, single-cycle access to 128 locations positively offset from a base address The direct 16-bit address is formed by concatenating the base address (DP) with the 7-bit offset contained in the instruction: How is the Data Page (DP) initialized?
May 2, 2001 C54 Architecture and Programming 20

Generating Direct Addresses


LD LD ADD ADD #x,DP @x+1,A @x,A @x+2,A
The first instruction loads the upper 9 bits of address x into DP (located in ST0) in a single cycle.

0000 0000 1 000 0101


16-bit address of x

= 85h

LD #x, DP LD @x+1,A ADD @x,A ADD @x+2,A


May 2, 2001

0000 0000 1 - Data Page 1 - Base Addr = 80h DP 0000 0000 1 000 0110 = 86h 0000 0000 1 000 0101 0000 0000 1 000 0111
C54 Architecture and Programming

= 85h = 87h
21

Absolute Addressing STL A,*(y)

*( )

Guarantees access to any location in the memory map by supplying the entire 16-bit address Uses the indirect hardware to generate the address, hence the asterisk ( ) Always MINIMUM of 2 words, 2 cycles

May 2, 2001

C54 Architecture and Programming

22

Dual Operand Instructions (X,Y)


Require less code Execute faster Dual operand addressing allows only certain pointers and modes:
Pointers: AR2 AR3 AR4 AR5 Modes: *ARn *ARn+ *ARn*Arn+0%

Modifiers: BK + AR0 Since the only index offered is circular, regular index is only accessible if BK is set to 0, or made very large, e.g., FFFFh.
May 2, 2001 C54 Architecture and Programming 23

Example Program: FIR Filter


x0 a0 z-1 x1 z-1 a2 x2 z-1 a3 x3

a1

...
y0

y 0 = an * xn n=0

19

y0 = a0*x0 + a1*x1 + a2*x2 + + a19*x19

20 Tap FIR implementation Our goal is to compute one output (y0) First, lets setup the link.cmd file and memory sections...
May 2, 2001 C54 Architecture and Programming 24

Coding Environment
lab1.obj -o lab1.out -m lab1.map

Link.cmd

Overview
ROM

MEMORY { PAGE 1: /* Data Memory */ SPRAM: org=00060h len=0020h InRAM: org=00400h len=0400h OutRAM: org=00800h len=0400h PAGE 0: /* Program Memory */ ROM: org=0F000h len=0F80h } SECTIONS { code :> init :> input :> output :> coeff :> }
May 2, 2001

code init_a[20] x a y

C54x InRAM x[20] OutRAM y[1] SPRAM a[20]

ROM ROM InRAM OutRAM SPRAM

PAGE PAGE PAGE PAGE PAGE

0 0 1 1 1

.usect input",20 .usect coeff",20 .usect output",1 .sect init init_a .int 1,2,3,4,5 .int 1,2,3,4,5 .int 1,2,3,4,5 .int 1,2,3,4,5 .mmregs FIR.asm .sect "code"
25

C54 Architecture and Programming

Processing Loop
fir:

FIR.asm

Two methods may be used to find y0: 1. Multiply, then add

math: MAC

*AR2+,*AR3+,A

MPY *AR2+, *AR3+, B ADD B,A

2. Multiply/Accumulate
MAC *AR2+, *AR3+, A
done:

Dual-operand instructions must use:


AR2, AR3, AR4, AR5 Modifiers: none, +, -, +0%
May 2, 2001 C54 Architecture and Programming 26

Initialize Pointers
fir: STM STM STM STM #a,AR2 #a,AR2 #x,AR3 #x,AR3

FIR.asm AR2

Coefficients a0 a1 a2 ... AR3

Input Data x0 x1 x2 ...

math:

MAC

*AR2+,*AR3+,A

STM
done:

Stores #value to the MMR early in the pipeline to avoid latencies 2 words, 2 cycles

May 2, 2001

C54 Architecture and Programming

27

Load Accumulator
fir:

FIR.asm

We must first initialize A using a load instruction.


LD source, [leftshift,] dst

STM STM LD math: MAC

#a,AR2 #x,AR3 #0,A *AR2+,*AR3+,A

source: constant or memory


location

leftshift: Ex: LD @x,16,A


none T [5:0] (use TS) constant (-16 to +16)

done:

dst: A,B,T,DP,ASM Accumulator A G


39-32

H
31-16

L
15-0

LD:
Loads dst[15:0] by default May be 1 or 2 cycles
28

May 2, 2001

C54 Architecture and Programming

Store Result
fir:

FIR.asm

Memory is 16 bits wide, so we must specify which part of result to store


STL/H source, [leftshift,] dst

STM STM LD math: MAC STL

#a,AR2 #x,AR3 #0,A *AR2+,*AR3+,A A, *(y)

source: Accumulators A,B leftshift:


Ex: STL B,-8,*AR5 none ASM constant (-16 to 15) dst: any memory location STL/STH may be 1 or 2 cycles
29

done:

Accumulator A G
39-32

H
31-16

L
15-0

May 2, 2001

C54 Architecture and Programming

Streamline Loops
fir:

FIR.asm

Execute the next instruction n+1 times:

math:

STM STM LD RPT MAC STL

#a,AR2 #x,AR3 #0,A #(20-1) *AR2+,*AR3+,A A, *(y)

1. RPT #n 2. RPT Smem 3. RPTZ src,#n


RPT: 1 or 2 cycles RPTZ: Clears the ACC before repeating. Always 2 words, 2 cycles

done: Execute the next block of instructions n+1 times:

STM #n, BRC then... RPTB done-1


May 2, 2001 C54 Architecture and Programming 30

Copy Coefficients

FIR.asm
fir: STM RPT MVPD STM STM LD RPT MAC STL #a, AR2 #3 #(20-1) #init_a,*AR2+ #a,AR2 #x,AR3 #0,A #(20-1) *AR2+,*AR3+,A A, *(y)

Copy values from one memory location to another:

MVPD #pmad, Smem PC


init_a 1 2 3 ... a 1

AR2

math:

PC=PC+1 every access Move instructions:

done: Prog Data MVPD,MVDP READA,WRITA Data Data MVKD,MVDK,MVDD


May 2, 2001 C54 Architecture and Programming

MMR Data MVMD,MVDM MMR MVMM MMR

31

Program Flow
fir: STM RPT MVPD STM STM LD RPT MAC STL RET - or done: B done #a, AR2 FIR.asm #(20-1) #init_a,*AR2+ #a,AR2 #x,AR3 #0,A #(20-1) *AR2+,*AR3+,A A, *(y) Implementing a subroutine requires: CALL RET fir
2w, 4c 1w, 4c

Other program flow instructions: B BACC CALA next src src


2w, 4c 1w, 6c 1w, 6c

math:

done:

Conditional program flow: BC next,cnd, ... CC next,cnd, ... RC cnd, ...

2w, 3c/5c 2w, 3c/5c 1w, 3c/5c

Conditions: 3 max w/ restrictions, ANDed: Ex: CC fir, AEQ, AOV A/B: EQ,NEQ,LEQ,GEQ,LT,GT,OV,NOV TC,NTC,C,NC,BIO,NBIO

May 2, 2001

C54 Architecture and Programming

32

Fixed Point Processing


Q Point notation for placement of decimal in fixed bit number 40 bit accumulators used to prevent overflow (usually) Fractional multiplication to retain most of data in the product Saturation and Rounding Sign extension

9 x 9 8 1

value value double

May 2, 2001

C54 Architecture and Programming

33

Handling Accumulator Overflow


39 32 31 16 15 0

A or B

Guard

High

Low

Guard bits increase dynamic range from +/-1 to +/-128

Use Guard Bits (allow at least 128 signed summations) In a non-gain system temporary overflow is permitted. The output is guaranteed to remain bounded by the input. In a system with gain, the output is not guaranteed to remain bounded (i.e. result is larger than 32-bits).

How do you handle a result larger than 32-bits?


May 2, 2001 C54 Architecture and Programming 34

Fractional Multiplication
. 9 . 9 . 8 1 . 8 value times value yields double size result result to be stored

We assume F*F < 1 Notice that most of information is retained

May 2, 2001

C54 Architecture and Programming

35

Fractional Multiplication (Cont.)


Fractional model 0100 x 1101 00000100 0000000 000100 11100 11110100 ACC mem 1111 0100 1110

-1 1/2 1/4 1/8


Store 1.110 (-1/4) to memory How is the redundant sign bit eliminated? STH A,1,*AR0 -ORSSBX FRCT STH A,*AR0 ;MANUAL ;AUTO

FRCT shifts multiply results left by 1 The tools do not support fractions
To store 0.707 use: 32767 = 7FFFh = ~1 a0 .int 32768*707/1000

May 2, 2001

C54 Architecture and Programming

36

Implementing Circular Buffering


math: MAC
Coefficients

*AR2+,*AR3+0%,A
Input Buffer
l l

Circular addressing is modulo First, define buffer size using BK


STM #(N+1),BK

AR2

a0 a1 a2 ...

x[0] x[1] x[2] ... x[n] BK ... AR3

l l l

% modifier indicates circular is available for all ARs Why was +0% used? Because we are forced to use +0%, how do we make it look like +%?
STM #1,AR0

May 2, 2001

C54 Architecture and Programming

37

Circular Buffer Alignment


Circular Buffer

x[0] x[1] x[2] ... x[19] BK ...

Circular Buffers must be aligned on the next 2n boundary greater than BK. On what boundary should a block size of 20 be aligned?
a .usect coeff, 20

How? Use align argument in the linker command file:


SECTIONS{ coeff :> DARAM align(32) PAGE 1 }

align 32

The linker will attempt to fill unused memory locations


38

May 2, 2001

C54 Architecture and Programming

Pipeline Issues
Typical C54x System Code

Analysis: Most 'C54x code requires no special attention Some MMR writes require care (MMR reads are not a problem) Latency requirements resolved via Latency Tables

C Code
No Problem

ASM Code

CALU Operations
No Problem

MMR Writes

Step through code and use NOPs to resolve conflicts

Early Writes
Early: write occurs at least 6 cycles prior to a read Example: FIR setup code
May 2, 2001

All Other MMR Writes


Use Latency Tables

C54 Architecture and Programming

39

References
[1] TMS320C54x Users Guide, available from the Texas Instruments Literature Response Center. [2] TMS320C54x DSP Design Workshop, Texas Instruments Technical Training. [3] S. W. Smith, The Scientist and Engineers Guide to Digital Signal Processing, San Diego: California Technical Publishing, 1999. [4] Ingrid Verbauwhede, Dave Garrett, Low-Power DSPs for Wireless Communications, ISLPED 2000.

May 2, 2001

C54 Architecture and Programming

40

S-ar putea să vă placă și