Advanced Bus

STATE-OFSTATEOF-THETHE-ART
INTERCONNECT FABRICS AND

COMMUNICATION PROTOCOLS
AHB: critical overview

Protocol
Lacks parallelism

High arbitration overhead (min. 2 cycles on single-transfers)

Bus-centric vs. transaction-centric
In order completion
Address of next transaction just anticipated on the bus
No multiple outstanding transactions: cannot hide slave wait
states effectively
Initiators and targets are exposed to bus architecture (e.g.

arbiter)
No decoupling, instance-specific bus components
Topology
Scalability limitation of shared bus solution!
Toward improved utilization of

the topology (throughtput, latency)
Bus evolution
Protocol
Toward enhanced parallelism

Topology
Topology evolution
Shared bus with unidirectional
Request and response lanes
Crossbar with unidirectional

Request and response lanes
Topology evolution
Partial Crossbar
with unidirectional
request and
response lanes
0
M0 M1
S0
Shared bus
S1
P2 P3 T1 M2 M3
Shared bus
S2
P4 P5 T2 M4 M5
xbar
Shared bus
P6 P7 T3
M7
S3
Shared bus
S4
P8 P9 T4 M8 M9
Shared bus
M6
Multi-layer bus architecture
The communication bottleneck

Today: multi-layer topology

IPTG
LX
IP 1
IPTG
System interconnect
IPTG
IPTG
IPTG
IP 2
IPTG
IPTG
IPTG
IPTG
IP 3
IPTG
IPTG
IPTG
IPTG
IP 3
IPTG
IPTG
IP 5
IPTG
IPTG
Jeopardizing design predictability, feasibility and cost!
off-chip
memory
controller
The communication bottleneck

Today: multi-layer topology

IPTG
LX
IP 1
IPTG
System interconnect
IPTG
IPTG
IPTG
IP 2
IPTG
IPTG
IPTG
IPTG
IP 3
IPTG
IPTG
IPTG
IPTG
IP 3
IPTG
IPTG
IP 5
IPTG
IPTG
Jeopardizing design predictability, feasibility and cost!
off-chip
memory
controller
Topology evolution
4-ary 2mesh
Switches
16
Bis. Band.
Tiles x
Switch
Switch Arity
Max. Hops
Tile
Switch
4-ary 2-mesh
Topology evolution
4-ary 2mesh
2-ary 4mesh
Switches
16
16
Bis. Band.
Tiles x
Switch
Switch Arity
Max. Hops
Tile
Tile
Switch
Switch
4-ary 2-mesh
2-ary 4-mesh
Topology evolution
4-ary 2mesh
2-ary 4mesh
2-ary 2mesh
Switches
16
16
Bis. Band.
Tiles x
Switch
Switch Arity
10
Max. Hops
Tile
Tile
Switch
Switch
4-ary 2-mesh
2-ary 2-mesh
Low latency
Split transactions
A split
split--transaction bus is a bus where the request and response phases
are split and independent to improve bus utilization
-Master must arbitrate for the request phase
-Slave must arbitrate for the response phase
Master
Request
Bus
Bus released
Busy
Slave
Response
Bus
Bus released
busy
Multiple outstanding transactions

Master
Slave
Queue of pending
requests
Requests
Queue of pending
responses
Responses
The master needs to associate each response to one of its pending requests
The initiator should support multiple outstanding transactions too
Out--of
Out
of--order completion
Master
To S2
S2 -fast
S1-slow
Queue of
pending
requests
time
To S1
Queue of
pending
requests
Requests
From S2
From S1
Association between requests and responses is more challenging

The typical case for out-of-order completion is when a fast slave is
addressed after a slow slave. The fast slave will return its response earlier.
Out--of
Out
of--order completion
Master
S1
anticipated
S12
S11
S11
S12
time
Queue of
pending
requests
Requests
Resp of S12
Resp of S11
Out-of-order completion even in case multiple outstanding transactions are

addressed to the same complex slave
A complex slave may use local optimizations and change the processing
order of incoming requests (e.g., serve accesses to an open row first in an
SDRAM device)
Bus--centric architecture
Bus
Master
interface
Slave
interface
Bus
architecture
Internal bus components are directly exposed to the connected

master and slave interfaces
The bus architecture is instance-specific and lacks modularity
Transaction--centric architecture
Transaction
Slave interface
Master interface
Point-to-point
Communication
Protocol
Slave interface
Hidden components
Master interface
Bus
architecture
Internal bus components are hidden behind bus interfaces

Modular architecture
Orthogonalization of concerns
Internal bus architecture can freely evolve without impacting the interfaces
The only objective of interfaces: specifying communication transactions!
(communication abstraction)
But what is there on the market?
AMBA MultiMulti-layer AHB

Enables parallel access paths between multiple masters and

slaves
Fully compatible with AHB wrappers
It is a topology (not protocol) evolution
Pure combinational matrix (scales poorly with no of I/Os)
Master1
AHB
AHB
Interconnect
Matrix
Slave1
Slave1
Master2
Slave1
Multi--Layer AHB implementation

Multi

The matrix is completely flexible and can be adapted

MUXes are point arbitration stages
AHB layer can be AHB-lite: single master, no
req/grant, no split/retry
Multi--layer AHB implementation

Multi
A layer loosing arbitration is waited by means of

HREADY
When a layer is waited, input stage samples
pipelined address and control signals
Hierarchical systems
Slaves accessed only by masters on a given layer can

be made local to the layer
Multiple slaves
Multiple slaves appear as

single
slave to the matrix
combine low bandwidth
slaves
group slaves accessed
only
by one master (e.g. DMA
controller)
Alternatively, a slave can be
an AHB-to-APB bridge, thus
allowing connection to
multiple low-bandwidth
slaves
Multiple masters per layer
Combine masters that have

low bandwidth requirements
Putting it alltogether
Interconnect matrix and Slave4
are used for across-layer
communication
HW
semaphores
Dual port slaves
Common for off-chip SDRAM controllers

Layer1: bandwidth limited high priority traffic with
low latency requirements (e.g., processor cores)
Layer2: Bandwidth-critical traffic
(e.g., hardware accelerators)
The dual-port slave may even be connected to the matrix
AMBA 3.0 (AMBA AXI)

This is an evolution of the communication protocol
High bandwidth low latency designs
High frequency operation
Flexibility in the implementation
Backward compatible with AHB and APB
Novel features with respect to AHB
Burst-based transactions with only first address issued
Address information can be issued before/after actual
write data transfer
Multiple outstanding addresses
Out-of-order transaction completion
easy addition of register stages for timing closure
Design paradigm change

Slave
AXI
Master
Slave
Master
Initiator
Communication
architecture
AXI
Target
Point-to-point interface specification

Independent of the implementation
of the communication architecture
Communication architecture can (be) freely evolve (customized)
Transaction-based specification of the interface
Open Core Protocol (OCP) is another example of this paradigm
Transaction--centric bus
Transaction
AXI can be used to interconnect:
-an initiator to the bus
The interface definition
-a target to the bus
allows a variety of different
-an initiator with a target
interconnect
implementations
Slave
Master
Initiator
AXI
Target
Interconnect approaches
Slave
Slave
crossbar
Master
Slave
AXI
Master
Slave
Master
shared
Master
AXI
bus
Most systems use one of three interconnect approaches:

-shared address and data buses
Most common
-Shared address buses and multiple data buses
-Multilayer, with multiple address and data buses
Channel--based Architecture
Channel
Five groups of signals

Read Address
Read Data
Write Address
Write Data
Write Response
R. ADDRESS
AR signal name prefix

R signal name prefix
AW signal name prefix
W signal name prefix
B signal name prefix
W. ADDRESS
READ DATA
WRITE DATA
RESPONSE
Channels are independent and asynchronous wrt each other
Read transaction
Single address for burst transfers
Write transaction
Single response for an entire burst
Channels - One way flow

AWVALID
WVALID
RVALID
BVALID
AWDDR
WLAST
RLAST
BRESP
AWLEN
WDATA
RDATA
BID
AWSIZE
WSTRB
RRESP
BREADY
AWBURST
WID
RID
AWLOCK
WREADY
RREADY
AWCACHE
AWPROT
AWID
AWREADY
Channel: a set of unidirectional information

signals
Valid/Ready handshake mechanism

READY is the only return signal

Valid: source IF has valid data/control signals
Ready: destination IF is ready to accept data
Last: indicates last word of a burst transaction
Valid ready handshake
AMBA 2.0 AHB Burst

ADDRESS
DATA
A11
A12
A13 A14 A21
A22
A23
D31
D11
D12 D13 D14
D21
D22
D23
AHB Burst

Address and Data are locked together

Two pipeline stages
HREADY controls pipeline operation
D31
AXI - One Address for Burst

ADDRESS
DATA
A11
A21
D11
D12 D13 D14
D31
D21
AXI Burst
One Address for entire burst
D22
D23
D31
AXI - Outstanding Transactions

ADDRESS
A11
A21
DATA
D11
D31
D12 D13 D14
D21
D22
D23
AXI Burst

One Address for entire burst

Allows multiple outstanding addresses
D31
Problem: Slow slave
ADDRESS
DATA
A11
A21 A31
D11
D12
If one slave is very slow, all data is held

up.
Out--of
Out
of--Order Completion
ADDRESS
DATA
A21
D31
D21 D22 D23
D31
D11 D12 D13 D14
Out of order completion allowed

Fast slaves may return data ahead of slow slaves
Complex slaves may serve requests out-of-order
Each transaction has an ID attached (given by the master IF)
Channels have ID signals - AID, RID, etc.
Transactions with the same ID must be ordered
The interconnect in a multi-master system must append
another tag to ID to make each masters ID unique
A11
Ordering restrictions
Simple rules
A simple master can issue transactions with the same ID
(implicitely forcing in-order delivery)
A simple slave can serve requests in the order they arrive,
regardless of the ID tag
AXI - Data Interleaving

ADDRESS
DATA

A11
A21
D31
D21 D22
D11 D23 D12 D31 D13
D14
Returned data can even be interleaved

Gives maximum use of data bus
Note - Data within a burst is always in
order
Burst read
Valid high until ready high
The valid-ready handshake regulates data transfer

This is clearly a split transaction bus!
Overlapping burst read

Address of second burst issued:
True outstanding transactions
Burst write
Register slices for max frequency

Channels are
WID
asynchronous
WDATA
WSTRB
Register slices can
WLAST
WVALID
be applied across
WREADY
any channel
Allows maximum
frequency of operation
by changing delay into latency
Other AXI features

No early burst termination, but fine granularity specification of burst beats

(1-16)
Burst types:
Fixed (FIFO-like))
Incremental
Wrapping
Support for system caches

Bufferable vs. Cacheable transactions
Support for
Priviledged transactions vs. Normal ones
Secure vs. non-secure transactions
Support exclusive accesses

Read exclusive, followed by write exclusive
Support for locked accesses

Terminated by an unlocked access
Write data interleaving ( of transactions with different IDs)
Init1
Comparison
2 wait states memories
AHB
STBUS low buf
STBUS high buf
AXI
Init2
Init3
Mem1
Bus
Mem2
Mem3
It is impossible to
hide slave response
latency
While the previous
response phase is in
progress, a new request
can be processed by the
next addressed slave
More data pre-accessed
while previous response
phase is in progress
Interleaving support in
interfaces and
interconnect allow
better interconnect
exploitation
Scalability
Highly parallel benchmark (no slave bottlenecks)
1 memory wait state

110%
180%
100%
170%
160%
150%
80%
70%
2 Cores
60%
4 Cores
50%
6 Cores
8 Cores
40%
30%
20%
10%
Relative execution time
Relative execution time
90%
140%
130%
120%
110%
100%
90%
2 Cores
80%
70%
60%
6 Cores
4 Cores
8 Cores
50%
40%
30%
20%
10%
0%
0%
AHB
AXI
STBus
STBus (B)
1 kB cache (low bus

traffic)
AHB
AXI
STBus
STBus (B)
256 B cache (high

bus traffic)
Scalability
100%
100%
Interconnect busy
80%
70%
60%
50%
2 Cores
40%
4 Cores
6 Cores
8 Cores
30%
20%
0%
70%
60%
2 Cores
50%
4 Cores
6 Cores
40%
8 Cores
30%
20%
0%
AHB
80%
10%
10%
Interconnect usage efficiency
90%
90%
AXI
STBus
STBus (B)
AHB
AXI
STBus
STBus (B)
Increasing contention: AXI, STBus show 80%+

efficiency, AHB < 50%
Saturation of shared bus architectures
Networks--on
Networks
on--Chip (NoCs)
Same paradigm of Wide Area Networks and
of large scale multi-processors
IP core
master
NI
Packet
NI
IP core
master
NI
IP core
master
switch
TAIL
FLIT
PAYLOAD
L
FLIT
HEADER
FLIT
switch
FLIT
switch
IP core
slave
Clean separation
at session layer
Core issues end-to-end
transactions
(through AXI, OCP,..),
Network deals with
lower level issues
NoC
NI
IP core
slave
switch
NI
NI
IP core
slave
Modularity at HW level Physical design aware

Only 2 building blocks:
network interface,
switch
Path segmentation
Regular routing
Shared buses vs NoCs

NoCs Pros.
- Each integrated IP core adds bus load capacitance
+ Only point-to-point one-way links are used
- Bus timing problems in deep sub-micron designs
+ Better suited for GALS paradigm
- Arbiter delay grows with no of masters. Instance-specific arbiter
+ Distributed routing decisions. Reinstantiable switches
- Bus bandwidth is shared among all masters
+ Bus bandwidth scales with network dimension
Shared buses vs NoCs

NoCs Cons.
+ After bus is granted, bus access latency is null
- Unpredictable latency due to network congestion problems
+ Very low silicon cost
- High area cost
+ Simple bus-IP core interface
- Network-IP core interface can be very complex (e.g. packetization,..)
+ Design guidelines are well known
- Design guidelines start to consolidate

Advanced Bus

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Advanced Bus

Încărcat de

Drepturi de autor:

Formate disponibile

STATE-OFSTATEOF-THETHE-ART

INTERCONNECT FABRICS AND

AHB: critical overview

High arbitration overhead (min. 2 cycles on single-transfers)

Initiators and targets are exposed to bus architecture (e.g.

Scalability limitation of shared bus solution!

Toward improved utilization of

Toward enhanced parallelism

Crossbar with unidirectional

Multi-layer bus architecture

The communication bottleneck

Today: multi-layer topology

Jeopardizing design predictability, feasibility and cost!

The communication bottleneck

Today: multi-layer topology

Jeopardizing design predictability, feasibility and cost!

Multiple outstanding transactions

Association between requests and responses is more challenging

Out-of-order completion even in case multiple outstanding transactions are

Internal bus components are directly exposed to the connected

Internal bus components are hidden behind bus interfaces

But what is there on the market?

AMBA MultiMulti-layer AHB

Enables parallel access paths between multiple masters and

Multi--Layer AHB implementation

The matrix is completely flexible and can be adapted

Multi--layer AHB implementation

A layer loosing arbitration is waited by means of

Slaves accessed only by masters on a given layer can

Multiple slaves appear as

Multiple masters per layer

Combine masters that have

Dual port slaves

Common for off-chip SDRAM controllers

AMBA 3.0 (AMBA AXI)

Design paradigm change

Point-to-point interface specification

Most systems use one of three interconnect approaches:

Five groups of signals

AR signal name prefix

Channels are independent and asynchronous wrt each other

Single address for burst transfers

Single response for an entire burst

Channels - One way flow

Channel: a set of unidirectional information

READY is the only return signal

Valid ready handshake

AMBA 2.0 AHB Burst

A13 A14 A21

D12 D13 D14

Address and Data are locked together

AXI - One Address for Burst

D12 D13 D14

One Address for entire burst

AXI - Outstanding Transactions

D12 D13 D14

One Address for entire burst

Problem: Slow slave

If one slave is very slow, all data is held

D21 D22 D23

D11 D12 D13 D14

Out of order completion allowed

AXI - Data Interleaving