2 2 Mnnoc Document

Friday, October 12th, 2012
An introduction on the on-chip networks (NoC)
Davide Zoni PhD Student

email: zoni@elet.polimi.it
webpage: home.dei.polimi.it/zoni
Outline
Introduction to Network-on-Chip
New challenges
Scenario
Cache implications
Topologies and abstract metrics
Routing algorithms
Types
Deadlock free property
Limitations
Router microarchitecture
Flit based
Optimization dimensions
Tiled multi-core architecture with shared memory
Source: Natalie Jerger, ACACES Summer School, 2012
Some slides adapted from ...

Specific References
Timothy M. Pinkston, University of Southern California,

http://ceng.usc.edu/smart/slides/appendixE.html
On-Chip Networks, Natalie E. Jerger and Li-Shiuan Peh
Principles and Practices of Interconnection Networks, William J. Dally and Brian Towles
Other people
Chita R. Das Penn State NoC Research Group
Li-Shiuan-Peh, MIT
Onur Mutlu, CMU
Karen Bergman, Columbia
Bill Dally, Stanford
Rajeev Balasubramoniam, Utah
Steve Keckler, UT Austin
Valeria Bertacco, University of Michigan
What about an interconnection network ?
Applications: low-latency, high-bandwidth, dedicated channels between logic and

memory
Technology: Dedicated channels too expensive in terms of area, power and
reliability
What about an interconnection network ?
An Interconnection Network is a programmable

system that transports data between terminals
Technology: Interconnection network helps efficiently utilize scarce resources
Application: Managing communication can be critical to performance
What about a classification ?

Interconnection networks can be grouped into four domains
depending on number and proximity of devices to be
connected
Networks on Chip (NoCs or OCNs)

Devices include: microarchitectural elements (functional units, register files), caches,
directories, processors
Current/Future systems: dozens, hundreds of devices
Ex: Intel TeraFLOPS research prototypes 80 cores
Intel Single-chip Cloud Computer 48 cores
Proximity: millimeters
System/Storage Area Networks (SANs)
Multiprocessor and multicomputer systems
Interprocessor and processor-memory interconnections

Server and data center environments
Storage and I/O components

Hundreds to thousands of devices interconnected
IBM Blue Gene/L supercomputer (64K nodes, each with 2 processors)

Maximum interconnect distance
tens of meters (typical) to a few hundred meters
Examples (standards and proprietary): InfiniBand, Myrinet, Quadrics,

Advanced Switching Interconnect
LANs and WANs

Local Area Networks (LANs)
Interconnect autonomous computer systems
Machine room or throughout a building or campus
Hundreds of devices interconnected (1,000s with bridging)
Maximum interconnect distance
few kilometers to few tens of kilometers
Example (most popular): Ethernet, with 10 Gbps over 40Km
Wide Area Networks (WANs)
Interconnect systems distributed across globe
Internet-working support required
Many millions of devices interconnected
Max distance: many thousands of kilometers
Example: ATM (asynchronous transfer mode)
Network scenario
10
Network scenario
11
Why networks ?
12
What about computing demands ?
13
The energy-performance wall
14
The energy performance wall
15
16
17
Why on-chip networks?
They provide external connectivity from system to outside world
Also, connectivity within a single computer system at many levels
I/O units, boards, chips, modules and blocks inside chips
Trends: high demand on communication bandwidth
Increased computing power and storage capacity
Switched networks are replacing buses
Integral part of many-core architectures
Energy consumed by communication will exceed that of computation in

future systems
Lots of innovation needed!
Computer architects/engineers must understand interconnect problems

and solutions in order to more effectively design and evaluate systems
18
On-chip vs off-chip
Significant research in multi-chassis interconnection networks (off-chip)
Supercomputers and Clusters of workstations
Internet routers
Leverage research and insight but...
Constraints are different
Pin-limited bandwidth
Mix of short and long packets on-chip
Inherent overheads of off-chip I/O transmission
New research area to meet performance, area, thermal, power and reliability
needs (On-chip)
Wiring constraints and metal layer limitations
Horizontal and vertical layout
Short, fixed length
Repeater insertion limits routing of wires
Avoid routing over dense logic
Impact wiring density
19
20
Some examples
BLUEGENE/L
Mellanox Server Blade
IP Routers
- Huge power
consumption
- One million Watts
- Complicated
network structure
- Total power budget

Constrained by packaging and
cooling costs <= 30W
- Network power consumption
~10 to 15 W
- Constrained by costs
+ regulatory limits
- ~200W line card
- ~60W
interconnection
network
IB
4X
CPU
System
logic
Alpha 21364 & its Thermal Profile
Intel SCC 48-core
Alpha 21364
- Packaging and
cooling costs
Dells law <= $25
- Router+link
~25W
MIT Raw CMP
- Complicated
communication
networks
- On-chip network
consumes about
36% of total chip
power
MIT Raw CMP
On-chip Networks
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
21
On-chip Networks: outline
Topology
Routing
Properties
Deadlock avoidance
Router microarchitecture
Baseline model
Optimizations
Metrics
Power
Performance
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
22
On-chip Network: Where we are ...
General Purpose
Multi-cores
Shared
Memory
Distributed memory
(or Message Passing)
23
On-chip Network: Where we are ...
Here we are
Shared
Memory
General Purpose
Multi-cores
Distributed memory
(or Message Passing)
24
25
Shared memory multi-core
Memory Model in CMPs
Message Passing
Explicit movement of data between nodes and address spaces
Programmers manage communication
Shared Memory
Communication occurs implicitly through loads/stores and accessing

instructions
Will focus on shared memory
Look at optimization for cache coherence protocols
26
Memory Model in CMPs
Logically
Practically...
All processors access some shared memory

cache hierarchies reduce access latency to improve performance
Requires cache coherence protocol
to maintain coherent view in presence of multiple shared copies

Consistency model: the behaviour of the memory model in multi-core
environment, i.e. what is allowed and what is not allowed
Coherence: shadow the cache hierarchy to the programmer (without
lose performance improvement)
27
28
Tiled multi-core architecture with shared memory
Intel SCC
2D mesh State of the art VC routers
2Cores per each tiles
Multiple voltage islands
1 Vdd per each tile
1 NoC Vdd island
29
Coherence Protocol on Network Performance
Coherence protocol shapes communication needed by system
Single writer, multiple reader invariant
Requires:
Data requests
Data responses
Coherence permissions
Suggested reading for a quick review of coherence:
A Primer on Memory Consistency and Cache Coherence, Daniel

Sorin, Mark Hill and David Wood. Morgan Claypool Publishers, 2011.
30
Hardware cache coherence
Rough goal:
all caches have same data at all times
Minimal flushing, maximum caches best performance
Two solutions:
Broadcast-based protocol:
All processors see all requests at the same time, same order.
Often relies on bus
But can broadcast on unordered interconnect

Directory-based protocol:
Order of the requests relies on a different mechanism than bus
Maybe better flexibility and scalability
Maybe higher latency
31
Broadcast-based coherence
32
Coherence Bandwidth Requirements
How much address bus bandwidth does snooping need?
Well, coherence events generated on...
Misses (only in L2, not so bad)
Dirty replacements
Some parameters:
2 GHz CPUs, 2 IPC
33% memory operations, 2% of which miss in L2
50% of evictions are dirty
Some results:
(0.33 * 0.02) + (0.33 * 0.02 * 0.50)) = 0.01 events/insn
0.01 events/insns * 2 insn/cycle * 2 cycle/ns = 0.04 events/ns
Request: 0.04 events/ns * 4B/event = 0.16 GB/s = 160 MB/s
Data response: 0.04 events/ns * 64 B/event = 2.56 GB/s
What about scalability ? Thats 2.5 GB/s ... per processor
With 16 processors, that 40 GB/s!
With 128 processors, thats 320 GB/s!!
33
Scalable Cache Coherence
Two parts solution:
Bus-based interconnect:
Replace non-scalable bandwidth substrate (bus)...
... with scalable bandwidth substrate (point-to-point network,

e.g. mesh)
Processor'snooping'bandwidth:
Interesting most snoops result in no actions
Replace non scalable broadcast protocol (it spam

everyone)...with scalable directory protocol (it only spams
processors that care)
NOTE: physical address space statically partitioned (Still shared!!)
Can easily determine which memory module holds a given line
That memory module sometimes called home
Cant easily determine which processors have line in their caches
Bus-based protocol: broadcast events to all processors/caches
Simple and fast, but non-scalable
34
Scalable Cache Coherence
35
Coherence Protocol Requirements
Different message types
Unicast, multicast, broadcast
Directory protocol
Majority of requests: Unicast
Lower bandwidth demands on network
More scalable due to point-to-point communication
Broadcast protocol
Majority of requests: Broadcast
Higher bandwidth demands
Often rely on network ordering
36
Impact of Cache Hierarchy
Sharing of injection/ejection port among cores and caches
Caches reduce average memory latency
Private caches
Multiple L2 copies
Data can be replicated to be close to processor

Shared caches
Data can only exist in one L2 to bank
Addresses striped across banks (Lots of different ways to do

this)
Aside: lots of research on cache block placement, replication and
migration
Serve as filter for interconnect traffic
37
Private vs. Shared Caches
Private caches
Reduce latency of L2 cache hits
keep frequently accessed data close to processor
Increase off-chip pressure
Shared caches
Better use of storage
Non-uniform L2 hit latency
More on-chip network pressure
all L1 misses go onto network
38
On-chip Network: Private L2 Cache Hit
Private L2
Cache
Hit A 3Tag
s
Router
A
Logic
Data
Controller
L1 I/D
Cache
2
Core
Miss
A
LD A
Memory Controller
Source: Chita Das, ACACES Summer School, 2011
39
On-chip Network: Private L2 Cache Miss

Format message
to memory
controller
Miss
A
Private L2
3 Cache
Tag
Data
s
(off-chip)
Router
Logic
6
Data received,
sent to L2
Controller
L1 I/D
Cache
2
Miss
A
Core
LD A
Memory Controller
Request sent offchip
40
On-chip Network: Shared L2 Local Cache Miss

Receive data, send to L1 and
core
Format request message
3 and sent to L2 Bank that
A maps to
Router
41
(on-chip)
Shared L2 Cache
Tags
Data
Logic
Send data to
6
requestor
Receive message
and sent to L2 4
Shared L2 Cache
L2 Hit
Controller
Tags
L1 I/D
Cache
Core
Data
Controller
1 LD A
Miss A
A
Memory
Controller
Router
L1 I/D
Cache
Logic
A
Core
42
Network-on-Chip details
43
Topology nomenclature 1
Two broad classes: Direct and Indirect Networks

Direct Networks: Every node is both a terminal and a switch
Examples: Mesh, Torus, k-ary-n-cubes
Indirect Networks: The network is basically composed of switches that
connect the end nodes

Examples: MIN, Crossbar, etc
Direct
Indirect
44
Topology abstract metrics 1
Switch Degree: Number of links/edges incident on a node
Proxy for estimating cost
Higher degree requires more links and port counts at each router
2,3,4
45
Topology abstract metrics 2
Hop Count: Number of hops a message takes from source to destination

Proxy for network latency
Every node, link incurs some propagation delay even when no contention
Network diameter: large min hop count in network
Average minimum hop count: average across all source/destination pairs
Minimal hop count: smallest hop count connecting two nodes
Implementation may incorporate non-minimal paths (increase avg hop count)
Max=4
Avg=2.2
Max=4
Avg=1.77
Max=2
Avg=1.33
Topology abstract metrics implications
Abstract metrics are just proxies: Does not always correlate with the real metric
they represent
Example:
Network A with 2 hops, 5 stage pipeline, 4 cycle link traversal vs.
Network B with 3 hops, 1 stage pipeline, 1 cycle link traversal
Hop Count says A is better than B
But A has 18 cycle latency vs. 6 cycle latency for B
Topologies typically trade-off hop count and node degree
46
Traffic patterns
How to stress a NoC?

Synthetic traffic patterns
Uniform random
Optimistic, it allows to view a bad network as a good one
Matrix transpose
Many others based on probabilistic distributions and pattern selection
algorithms
Real traffic patterns
Real benchmarks executed on the simulated architecture
More accurate
Complete evaluation of the system performance
Time consuming simulation
Is the selected traffic suitable for my application?
47
Routing, Arbitration, and Switching

Routing
Defines the allowed path(s) for each packet (Which paths?)

Problems
Livelock and Deadlock
Arbitration
Determines use of paths supplied to packets (When allocated?)
Problems
Starvation
Switching
Establishes the connection of paths for packets (How allocated?)
Switching techniques
Circuit switching, Packet switching
48
49
Until now old wine in a new bottle...but for caches
Deadlock
Packets
Routing
algorithm
Flow control
Router/switch
Throughtput
Where is the difference?

Latency
50
Until now old wine in a new bottle...but for caches
Low power
Limited resources
High performance
High reliability
Thermal issues
On-chip network
criticalities
NoC granulatity overview
Messages: composed of one or more packets

(NOTE:If message size is maximum packet size only one packet created)
Packets: composed of one or more flits
Flit: flow control digit
Phit: physical digit

(Subdivides flit into chunks = to link width)
Off-chip: channel width limited by pins

On-chip: abundant wiring means phit size == flit size
51
Routing overview
Usually topology discussion assumes

ideal routing, while routing algorithm
are not ideal in practice
Once topology is fixed routing
determines the path from source to
destination
GOAL: distribute traffic evenly among paths
Avoid hot spots, contention
The more balanced algorithm is the closer to ideal throughput is
Keep complexity in mind
52
Routing algorithm attributes
Types
Oblivious: random without adaptiveness routing, that is very efficiently

implementable
Adaptive: the algorithm uses the network state to modify the routing
path for each packet even under the same source,destination pair
Routing path
Deterministic: all the packets from each couple (source,destination)

uses always the same path regardless the network state
Minimal: all packets uses the shortest path from source to destination
Non-minimal: packets may be routed to a longer path depending for
example on network state
Number of destinations
Unicast: typical and easy solution in NoC
Multicast: useful with cache coherence messages
Broadcast: typical in bus-based architectures
53
The deadlock avoidance property
Each packet is occupying a link and waiting for a link
Without routing restrictions, a resource cycle can occur
Leads to deadlock
This is because resource are shared
54
Deterministic routing
All messages from Source to Destination traverse the same path
Common example: Dimension Order Routing (DOR)
Message traverses network dimension by dimension
Aka XY routing
Cons:
Eliminates any path diversity provided by topology
Poor load balancing
Simple and inexpensive to implement
Deadlock-free (why???)
Pros:
55
Deterministic routing
aka X-Y Routing
Traverse network dimension by dimension
Can only turn to Y dimension after finished X
It removes a lot of turns to ensure deadlock free property
56
Adaptive routing
Exploits path diversity
Uses network state to make routing decisions
Buffer occupancies often used
Coupled with flow control mechanism
Local information readily available
Global information more costly to obtain
Network state can change rapidly
Use of local information can lead to non-optimal choices
Can be minimal or non-minimal
57
Minimal adaptive routing
Local information can result in sub-optimal choices
58
Non-minimal adaptive routing
Fully adaptive
Not restricted to take shortest path
Misrouting: directing packet along non-productive channel
Priority given to productive output
Some algorithms forbid U-turns
Livelock potential: traversing network without ever reaching destination
Limit number of misroutings
What about power consumption ?
59
Turn model for adaptive routing
DOR eliminates 4 turns in a 2d-mesh topology with two cycles
N to E, N to W, S to E, S to W
No adaptivity
It is possible to do better?
Hint: some models relax to eliminate 2 turns instead of 4 in 2d-mesh
Turn model
60
Turn model for adaptive routing 1
Basic steps
Partition channels according to the direction in which they route packets
Identify possible turns
Identify the cycles combining turns, i.e. the most single cycles
Break each simple cycle
61
Check if the combination of simple cycle allows the formation of

complex cycles
Example on a 2D-mesh
2 simple cycles
The DOR algorithm avoid 4 turns to ensure deadlock free property
What about removing just 1 turn per cycle ?
Maybe the deadlock property is still valid
62
Not all turns are valid to remove cycles and preserve deadlock free property
Theorem: The minimum number of turns that must be prohibited to prevent

deadlock in an n-dimensional mesh is n*(n-1) or a quarter of the possible turns
NOTE: However you have to choose carefully the prohibited turns
63
Turn model: west-first routing algorithm
The first direction to take is west, if any
Never possible to go west, after a while!!!
An example
64
Turn model: north-last routing algorithm
Going north is the last thing to do
Never possible to go north, at the beginning!!!
An example
65
Turn model: negative-first routing algorithm

y
Travel from negative start from negative
Never possible to go negative from positive!!!
An example
66
Issues in routing algorithms
Unbalanced traffic in DOR
North: top-right
West: top-left
South: bottom-left
East: bottom-right
67
NoC granulatity overview
Messages: composed of one or more packets

(NOTE:If message size is maximum packet size only one packet created)
Packets: composed of one or more flits
Flit: flow control digit
Phit: physical digit

(Subdivides flit into chunks = to link width)
Off-chip: channel width limited by pins

On-chip: abundant wiring means phit size == flit size
68
NoC microarchitecture based on granulatiry
Message-based: allocation made at message granularity
circuit switching
Packet-based: allocation made to whole packets
Store and forward (SaF)
Large latency and buffer required
Virtual Cut Through (VCT)
Improves SaF but still large buffers and latency

Flit-based: allocation made on a flit-by-flit basis
Wormhole
Efficient buffer utilization, low latency
Suffers Head of Line (HoL)
Virtual channels
Primary to face deadlock
Then face HoL
69
Switch/Router Wormhole Microarchitecture
Flit-based,i.e. Packet divided in flits
Pipelined in 4 stages
BW,RC,SA,ST,LT
Buffers organized on a flit basis
Single buffer per port
Buffer states:
G idle,routing,active waiting,
R output port (route)
C credit count P pointers to data
70
71
Switch/Router Virtual Channel Microarchitecture
Router components
Router components
Input buffers, route computation logic, virtual channel allocator, switch allocator,
crossbar switch
Most OCN routers are input buffered
Use single-ported memories
Buffer store flits for duration in router
Contrast with processor pipeline that latches between stages
Basic router pipeline (Canonical 5-stage pipeline)
BW: Buffer Write
RC: Routing computation
VA:Virtual Channel Allocation
SA: Switch Allocation
ST: Switch Traversal
LT: Link Traversal
72
Router components
Routing computation performed once per packet
Virtual channel allocated once per packet
Body and tail flits inherit this info from head flit
Router performance
Baseline (no load) dealy: 5 cycles + link delay x Hop + tserialization
How to reduce latency ?
73
Pipeline optimization: lookahead router
Overlap with BW
Precomputing route allows flits to compete for Vcs immediately after BW
RC decodes route header
Routing computation needed at next hop
Can be computed in parallel with VA
74
Pipeline optimization: speculation
Assume that Virtual Channel Allocation stage will be successful
Valid under low to moderate loads
Entire VA and SA in parallel
If VA unsuccessful (no virtual channel returned)
Must repeat VA/SA in next cycle
Prioritize non-speculative requests
75
Router Pipeline: module dipendencies
Dependence between output of one module and input of another
Determine critical path through router
Cannot bid for switch port until routing performed
Li-Shiuan Peh and William J. Dally. 2001. A Delay Model and Speculative Architecture for Pipelined Routers
76
Router Pipeline: delay model
Li-Shiuan Peh and William J. Dally. 2001. A Delay Model and Speculative Architecture for Pipelined Routers
77
Switch/Router Flow Control
Flow control
78
determines how a network resources, such as

bandwidth, buffer capacity and control state are allocated to packets
traversing the network
Resource allocation problem: from the resources point of view
Contention resolution: from the packet point of view
Bufferless, buffered
Switch/Router Bufferless Flow Control
No buffers
Allocate channels and bandwidth to competing packets
Two modes
Dropping flow control
Circuit switching flow control
William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
79
Bufferless Dropping Flow Control 1
Simplest flow control form

Allocate channel and bandwidth to
competing packets
In case of collisions we experience
packet drops
Collision can be signaled or not
using ack-nack messages
80
Bufferless Dropping Flow Control 2
With no ack messages the only viable way is timeout timers
Ack messages can reduce latency
81
Bufferless Circuit switching Flow Control 1
It allocates all needed resources before send the message
When no further packets must be sent, the circuit is deallocated
Head flit arbitrates for resources, and if stalled no resend needed
82
Switch/Router Buffered Flow Control
Buffers
More flexibility, with the possibility to decouple resource allocation in steps
Two modes
Wormhole flow control
Virtual channel flow control
83
Switch/Router Buffered Wormhole Flow Control
Allocate on a per flit basis
More efficient in buffer consumption
Head of Line (HOL) blocking issues
Buffered solutions allow to decouple

resource allocation
U uppuer outport, L lower outport
In port States (I,W,A) (idle, waiting, allocated)
Flits (H,B,T) (head, body, tail)
84
Switch/Router Virtual Channel Flow Control
Multiple buffers on the same input

port
Need for a state on each virtual
channel
More complex
wormhole
to
manage
than
Allows to manage different flows at

the same time
Solves the HoL issues
Deadlock avoidance property
85
Wormhole HoL issues
86
Buffer Management and Backpressure
87
How to manage buffers between neighbors (i.e. how can I know the downstream
destination router buffer is full?)
Three ways:
Credit based
The upstream router keeps track of the available flit slots

available in the downstream router
Upstream router decreases counter when sends a flit while

downstream router increases the couter (backward) when a flit
leave the router
Accurate fine grain control on flow control, but a lot of messages

On/off
Threshold mechanism with single bit low overhead to signal

upstream router the permission to send
Ack/nack
No state in the upstream node
Sends and wait for ack/nack, no net gain
Waist of bandwitdh, sending without ack guarantee
Credit-based flow control
88
On-off flow control
89
Ack-nack flow control
90
Evaluation metrics for NOCs

Performance
Network centric
Latency
Throughput
Application Centric
System throughput (Weighted Speedup)
Application throughput (IPC)
Power/Energy
Watts/Joules
Energy Delay Product (EDP)
Fault-Tolerance
Process variation/Reliability
Thermal
Temperature
91
Network-on-Chip power consumption
Network power
breakdown
- Buffer power, crossbar power and

link power are comparable
- Arbiter power is negligible
Source: Chita Das, ACACES summer school 2011
92
Bibliography 2
Dally, W. J., and B. Towles [2004]. Principles and Practices of Interconnection Networks,
Morgan Kaufmann Publishers, San Francisco.
C.A. Nicopoulos, N. Vijaykrishnan, and C.R. Das, Network-on-Chip Architectures: A Holistic
Design Exploration, Lecture Notes in Electrical Engineering Book Series, Springer, October 2009.
G. De Micheli, L. Benini, Networks on Chips: Technology and Tools, Morgan Kaufmann, 2006.
J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach,
Morgan Kaufmann, 2002.
R. Marculescu, U. Y. Ogras, L.-S. Peh, N. E. Jerger, Y. Hoskote, 'Outstanding Research Problems in
NoC Design: System, Microarchitecture, and Circuit Perspectives', IEEE Trans. on ComputerAided Design of Integrated Circuits and Systems (TCAD), vol. 28, pp. 3-21, Jan. 2009.
T. Bjerregaard and S. Mahadevan, A survey of research and practices of network-onchip, ACM
Comput. Surv., vol. 38, no. 1, pp. 151, Mar. 2006.
Natalie Enright-Jerger and Li-Shiuan Peh, "On-Chip Networks", Synthesis Lecture, Morgan-Claypool
Publishers, Aug. 2009
Agarwal, A. [1991]. Limits on interconnection network performance, IEEE Trans. on Parallel
and Distributed Systems 2:4 (April), 398412.
Dally, W. J., and B. Towles [2001]. Route packets, not wires: On-chip interconnection
networks, Proc. of the Design Automation Conference, Las Vegas (June).
Ho, R., K. W. Mai, and M. A. Horowitz [2001]. The future of wires, Proc. of the IEEE 89:4 (April).
Hangsheng Wang, Xinping Zhu, Li-Shiuan Peh and Sharad Malik, "Orion: A Power-Performance
Simulator for Interconnection Networks" , In Proceedings of MICRO 35, Istanbul, November 2002.
D. Brooks, R. Dick, R. Joseph, and L. Shang, "Power, thermal, and reliability modeling in
nanometer-scale microprocessors, " IEEE Micro , 2007.
93
94
Thank you
Any questions?

2 2 Mnnoc Document

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

2 2 Mnnoc Document

Încărcat de

Drepturi de autor:

Formate disponibile

Friday, October 12th, 2012

An introduction on the on-chip networks (NoC)

Davide Zoni PhD Student

Topologies and abstract metrics

Deadlock free property

Tiled multi-core architecture with shared memory

Source: Natalie Jerger, ACACES Summer School, 2012

Some slides adapted from ...

Timothy M. Pinkston, University of Southern California,

On-Chip Networks, Natalie E. Jerger and Li-Shiuan Peh

Chita R. Das Penn State NoC Research Group

Onur Mutlu, CMU

Karen Bergman, Columbia

Bill Dally, Stanford

Rajeev Balasubramoniam, Utah

Steve Keckler, UT Austin

Valeria Bertacco, University of Michigan

What about an interconnection network ?

Applications: low-latency, high-bandwidth, dedicated channels between logic and

What about an interconnection network ?

An Interconnection Network is a programmable

Technology: Interconnection network helps efficiently utilize scarce resources

Application: Managing communication can be critical to performance

What about a classification ?

Networks on Chip (NoCs or OCNs)

System/Storage Area Networks (SANs)

Multiprocessor and multicomputer systems

Interprocessor and processor-memory interconnections

Storage and I/O components

IBM Blue Gene/L supercomputer (64K nodes, each with 2 processors)

tens of meters (typical) to a few hundred meters

Examples (standards and proprietary): InfiniBand, Myrinet, Quadrics,

LANs and WANs

Interconnect autonomous computer systems

Machine room or throughout a building or campus

Hundreds of devices interconnected (1,000s with bridging)

Maximum interconnect distance

few kilometers to few tens of kilometers

Example (most popular): Ethernet, with 10 Gbps over 40Km

Wide Area Networks (WANs)

Interconnect systems distributed across globe

Internet-working support required

Many millions of devices interconnected

Max distance: many thousands of kilometers

Example: ATM (asynchronous transfer mode)

What about computing demands ?

The energy-performance wall

The energy performance wall

The energy-performance wall

The energy-performance wall

Why on-chip networks?

They provide external connectivity from system to outside world

Also, connectivity within a single computer system at many levels

I/O units, boards, chips, modules and blocks inside chips

Trends: high demand on communication bandwidth

Increased computing power and storage capacity

Switched networks are replacing buses

Integral part of many-core architectures

Energy consumed by communication will exceed that of computation in

Computer architects/engineers must understand interconnect problems

Supercomputers and Clusters of workstations

Leverage research and insight but...

Constraints are different