Sunteți pe pagina 1din 94

Friday, October 12th, 2012

An introduction on the on-chip networks (NoC)

Davide Zoni PhD Student


email: zoni@elet.polimi.it
webpage: home.dei.polimi.it/zoni

Outline

Introduction to Network-on-Chip

New challenges

Scenario

Cache implications

Topologies and abstract metrics

Routing algorithms

Types

Deadlock free property

Limitations

Router microarchitecture

Flit based

Optimization dimensions

Tiled multi-core architecture with shared memory

Source: Natalie Jerger, ACACES Summer School, 2012

Some slides adapted from ...


Specific References

Timothy M. Pinkston, University of Southern California,


http://ceng.usc.edu/smart/slides/appendixE.html

On-Chip Networks, Natalie E. Jerger and Li-Shiuan Peh

Principles and Practices of Interconnection Networks, William J. Dally and Brian Towles

Other people

Chita R. Das Penn State NoC Research Group

Li-Shiuan-Peh, MIT

Onur Mutlu, CMU

Karen Bergman, Columbia

Bill Dally, Stanford

Rajeev Balasubramoniam, Utah

Steve Keckler, UT Austin

Valeria Bertacco, University of Michigan

What about an interconnection network ?

Applications: low-latency, high-bandwidth, dedicated channels between logic and


memory
Technology: Dedicated channels too expensive in terms of area, power and
reliability

What about an interconnection network ?

An Interconnection Network is a programmable


system that transports data between terminals

Technology: Interconnection network helps efficiently utilize scarce resources

Application: Managing communication can be critical to performance

What about a classification ?


Interconnection networks can be grouped into four domains
depending on number and proximity of devices to be
connected

Networks on Chip (NoCs or OCNs)


Devices include: microarchitectural elements (functional units, register files), caches,
directories, processors
Current/Future systems: dozens, hundreds of devices
Ex: Intel TeraFLOPS research prototypes 80 cores
Intel Single-chip Cloud Computer 48 cores
Proximity: millimeters

System/Storage Area Networks (SANs)

Multiprocessor and multicomputer systems

Interprocessor and processor-memory interconnections


Server and data center environments

Storage and I/O components


Hundreds to thousands of devices interconnected

IBM Blue Gene/L supercomputer (64K nodes, each with 2 processors)


Maximum interconnect distance

tens of meters (typical) to a few hundred meters

Examples (standards and proprietary): InfiniBand, Myrinet, Quadrics,


Advanced Switching Interconnect

LANs and WANs


Local Area Networks (LANs)

Interconnect autonomous computer systems

Machine room or throughout a building or campus

Hundreds of devices interconnected (1,000s with bridging)

Maximum interconnect distance

few kilometers to few tens of kilometers

Example (most popular): Ethernet, with 10 Gbps over 40Km

Wide Area Networks (WANs)

Interconnect systems distributed across globe

Internet-working support required

Many millions of devices interconnected

Max distance: many thousands of kilometers

Example: ATM (asynchronous transfer mode)

Network scenario

10

Network scenario

11

Why networks ?

12

What about computing demands ?

13

The energy-performance wall

14

The energy performance wall

15

The energy-performance wall

16

The energy-performance wall

17

Why on-chip networks?

They provide external connectivity from system to outside world

Also, connectivity within a single computer system at many levels

I/O units, boards, chips, modules and blocks inside chips

Trends: high demand on communication bandwidth

Increased computing power and storage capacity

Switched networks are replacing buses

Integral part of many-core architectures

Energy consumed by communication will exceed that of computation in


future systems
Lots of innovation needed!

Computer architects/engineers must understand interconnect problems


and solutions in order to more effectively design and evaluate systems

18

On-chip vs off-chip
Significant research in multi-chassis interconnection networks (off-chip)

Supercomputers and Clusters of workstations

Internet routers

Leverage research and insight but...

Constraints are different

Pin-limited bandwidth

Mix of short and long packets on-chip

Inherent overheads of off-chip I/O transmission

New research area to meet performance, area, thermal, power and reliability
needs (On-chip)

Wiring constraints and metal layer limitations

Horizontal and vertical layout

Short, fixed length

Repeater insertion limits routing of wires

Avoid routing over dense logic

Impact wiring density

19

20

Some examples
BLUEGENE/L

Mellanox Server Blade

IP Routers

- Huge power
consumption
- One million Watts
- Complicated
network structure

- Total power budget


Constrained by packaging and
cooling costs <= 30W
- Network power consumption
~10 to 15 W

- Constrained by costs
+ regulatory limits
- ~200W line card
- ~60W
interconnection
network

IB
4X
CPU

System
logic

Alpha 21364 & its Thermal Profile

Intel SCC 48-core

Alpha 21364
- Packaging and
cooling costs
Dells law <= $25
- Router+link
~25W
MIT Raw CMP
- Complicated
communication
networks
- On-chip network
consumes about
36% of total chip
power

MIT Raw CMP

On-chip Networks

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

PE

21

On-chip Networks: outline

Topology

Routing

Properties

Deadlock avoidance

Router microarchitecture

Baseline model

Optimizations

Metrics

Power

Performance
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE

22

On-chip Network: Where we are ...

General Purpose
Multi-cores

Shared
Memory

Distributed memory
(or Message Passing)

23

On-chip Network: Where we are ...

Here we are

Shared
Memory

General Purpose
Multi-cores

Distributed memory
(or Message Passing)

24

25

Shared memory multi-core

Memory Model in CMPs

Message Passing

Explicit movement of data between nodes and address spaces

Programmers manage communication

Shared Memory

Communication occurs implicitly through loads/stores and accessing


instructions

Will focus on shared memory

Look at optimization for cache coherence protocols

26

Memory Model in CMPs

Logically

Practically...

All processors access some shared memory


cache hierarchies reduce access latency to improve performance

Requires cache coherence protocol

to maintain coherent view in presence of multiple shared copies


Consistency model: the behaviour of the memory model in multi-core
environment, i.e. what is allowed and what is not allowed
Coherence: shadow the cache hierarchy to the programmer (without
lose performance improvement)

27

28

Tiled multi-core architecture with shared memory

Source: Natalie Jerger, ACACES Summer School, 2012

Intel SCC

2D mesh State of the art VC routers

2Cores per each tiles

Multiple voltage islands

1 Vdd per each tile

1 NoC Vdd island

Source: Natalie Jerger, ACACES Summer School, 2012

29

Coherence Protocol on Network Performance

Coherence protocol shapes communication needed by system

Single writer, multiple reader invariant

Requires:

Data requests

Data responses

Coherence permissions

Suggested reading for a quick review of coherence:

A Primer on Memory Consistency and Cache Coherence, Daniel


Sorin, Mark Hill and David Wood. Morgan Claypool Publishers, 2011.

30

Hardware cache coherence

Rough goal:

all caches have same data at all times

Minimal flushing, maximum caches best performance

Two solutions:

Broadcast-based protocol:

All processors see all requests at the same time, same order.

Often relies on bus

But can broadcast on unordered interconnect


Directory-based protocol:

Order of the requests relies on a different mechanism than bus

Maybe better flexibility and scalability

Maybe higher latency

31

Broadcast-based coherence

Source: Natalie Jerger, ACACES Summer School, 2012

32

Coherence Bandwidth Requirements

How much address bus bandwidth does snooping need?

Well, coherence events generated on...

Misses (only in L2, not so bad)

Dirty replacements

Some parameters:

2 GHz CPUs, 2 IPC

33% memory operations, 2% of which miss in L2

50% of evictions are dirty

Some results:

(0.33 * 0.02) + (0.33 * 0.02 * 0.50)) = 0.01 events/insn

0.01 events/insns * 2 insn/cycle * 2 cycle/ns = 0.04 events/ns

Request: 0.04 events/ns * 4B/event = 0.16 GB/s = 160 MB/s

Data response: 0.04 events/ns * 64 B/event = 2.56 GB/s

What about scalability ? Thats 2.5 GB/s ... per processor

With 16 processors, that 40 GB/s!

With 128 processors, thats 320 GB/s!!

33

Scalable Cache Coherence

Two parts solution:

Bus-based interconnect:

Replace non-scalable bandwidth substrate (bus)...

... with scalable bandwidth substrate (point-to-point network,


e.g. mesh)
Processor'snooping'bandwidth:

Interesting most snoops result in no actions

Replace non scalable broadcast protocol (it spam


everyone)...with scalable directory protocol (it only spams
processors that care)

NOTE: physical address space statically partitioned (Still shared!!)

Can easily determine which memory module holds a given line

That memory module sometimes called home

Cant easily determine which processors have line in their caches

Bus-based protocol: broadcast events to all processors/caches

Simple and fast, but non-scalable

34

Scalable Cache Coherence

Source: Natalie Jerger, ACACES Summer School, 2012

35

Coherence Protocol Requirements

Different message types

Unicast, multicast, broadcast

Directory protocol

Majority of requests: Unicast

Lower bandwidth demands on network

More scalable due to point-to-point communication

Broadcast protocol

Majority of requests: Broadcast

Higher bandwidth demands

Often rely on network ordering

36

Impact of Cache Hierarchy

Sharing of injection/ejection port among cores and caches

Caches reduce average memory latency

Private caches

Multiple L2 copies

Data can be replicated to be close to processor


Shared caches

Data can only exist in one L2 to bank

Addresses striped across banks (Lots of different ways to do


this)
Aside: lots of research on cache block placement, replication and
migration

Serve as filter for interconnect traffic

37

Private vs. Shared Caches

Private caches

Reduce latency of L2 cache hits

keep frequently accessed data close to processor

Increase off-chip pressure

Shared caches

Better use of storage

Non-uniform L2 hit latency

More on-chip network pressure

all L1 misses go onto network

38

On-chip Network: Private L2 Cache Hit

Private L2
Cache

Hit A 3Tag
s

Router
A

Logic

Data

Controller

L1 I/D
Cache
2

Core

Miss
A

LD A

Memory Controller
Source: Chita Das, ACACES Summer School, 2011

39

On-chip Network: Private L2 Cache Miss


Format message
to memory
controller

Miss
A

Private L2
3 Cache
Tag
Data
s

(off-chip)

Router
Logic

6
Data received,
sent to L2

Controller

L1 I/D
Cache
2

Miss
A

Core

LD A

Source: Chita Das, ACACES Summer School, 2011

Memory Controller
Request sent offchip

40

On-chip Network: Shared L2 Local Cache Miss


Receive data, send to L1 and
core
Format request message
3 and sent to L2 Bank that
A maps to
Router

41

(on-chip)

Shared L2 Cache

Tags

Data

Logic

Send data to
6
requestor
Receive message
and sent to L2 4
Shared L2 Cache

L2 Hit

Controller

Tags

L1 I/D
Cache

Core

Data

Controller

1 LD A

Miss A

A
Memory
Controller

Source: Chita Das, ACACES Summer School, 2011

Router

L1 I/D
Cache

Logic

A
Core

42

Network-on-Chip details

43

Topology nomenclature 1

Two broad classes: Direct and Indirect Networks


Direct Networks: Every node is both a terminal and a switch
Examples: Mesh, Torus, k-ary-n-cubes
Indirect Networks: The network is basically composed of switches that

connect the end nodes


Examples: MIN, Crossbar, etc

Direct
Source: Natalie Jerger, ACACES Summer School, 2012

Indirect

44

Topology abstract metrics 1

Switch Degree: Number of links/edges incident on a node

Proxy for estimating cost

Higher degree requires more links and port counts at each router

Source: Natalie Jerger, ACACES Summer School, 2012

2,3,4

45

Topology abstract metrics 2

Hop Count: Number of hops a message takes from source to destination


Proxy for network latency
Every node, link incurs some propagation delay even when no contention
Network diameter: large min hop count in network
Average minimum hop count: average across all source/destination pairs
Minimal hop count: smallest hop count connecting two nodes
Implementation may incorporate non-minimal paths (increase avg hop count)

Max=4
Avg=2.2

Source: Natalie Jerger, ACACES Summer School, 2012

Max=4
Avg=1.77

Max=2
Avg=1.33

Topology abstract metrics implications

Abstract metrics are just proxies: Does not always correlate with the real metric
they represent
Example:
Network A with 2 hops, 5 stage pipeline, 4 cycle link traversal vs.
Network B with 3 hops, 1 stage pipeline, 1 cycle link traversal
Hop Count says A is better than B
But A has 18 cycle latency vs. 6 cycle latency for B

Topologies typically trade-off hop count and node degree

46

Traffic patterns

How to stress a NoC?


Synthetic traffic patterns
Uniform random
Optimistic, it allows to view a bad network as a good one
Matrix transpose
Many others based on probabilistic distributions and pattern selection
algorithms
Real traffic patterns
Real benchmarks executed on the simulated architecture
More accurate
Complete evaluation of the system performance
Time consuming simulation
Is the selected traffic suitable for my application?

47

Routing, Arbitration, and Switching


Routing

Defines the allowed path(s) for each packet (Which paths?)


Problems
Livelock and Deadlock

Arbitration

Determines use of paths supplied to packets (When allocated?)

Problems

Starvation
Switching

Establishes the connection of paths for packets (How allocated?)

Switching techniques

Circuit switching, Packet switching

48

49

Until now old wine in a new bottle...but for caches

Deadlock
Packets
Routing
algorithm

Flow control
Router/switch
Throughtput

Where is the difference?


Latency

50

Until now old wine in a new bottle...but for caches

Low power

Limited resources

High performance

High reliability

Thermal issues

On-chip network
criticalities

NoC granulatity overview

Messages: composed of one or more packets


(NOTE:If message size is maximum packet size only one packet created)

Packets: composed of one or more flits

Flit: flow control digit

Phit: physical digit


(Subdivides flit into chunks = to link width)

Off-chip: channel width limited by pins


On-chip: abundant wiring means phit size == flit size

51

Routing overview

Usually topology discussion assumes


ideal routing, while routing algorithm
are not ideal in practice
Once topology is fixed routing
determines the path from source to
destination

GOAL: distribute traffic evenly among paths

Avoid hot spots, contention

The more balanced algorithm is the closer to ideal throughput is

Keep complexity in mind

52

Routing algorithm attributes

Types

Oblivious: random without adaptiveness routing, that is very efficiently


implementable
Adaptive: the algorithm uses the network state to modify the routing
path for each packet even under the same source,destination pair

Routing path

Deterministic: all the packets from each couple (source,destination)


uses always the same path regardless the network state

Minimal: all packets uses the shortest path from source to destination
Non-minimal: packets may be routed to a longer path depending for
example on network state

Number of destinations

Unicast: typical and easy solution in NoC

Multicast: useful with cache coherence messages

Broadcast: typical in bus-based architectures

53

The deadlock avoidance property

Each packet is occupying a link and waiting for a link

Without routing restrictions, a resource cycle can occur

Leads to deadlock

This is because resource are shared

54

Deterministic routing

All messages from Source to Destination traverse the same path

Common example: Dimension Order Routing (DOR)

Message traverses network dimension by dimension

Aka XY routing

Cons:

Eliminates any path diversity provided by topology

Poor load balancing

Simple and inexpensive to implement

Deadlock-free (why???)

Pros:

55

Deterministic routing

aka X-Y Routing

Traverse network dimension by dimension

Can only turn to Y dimension after finished X

It removes a lot of turns to ensure deadlock free property

56

Adaptive routing

Exploits path diversity

Uses network state to make routing decisions

Buffer occupancies often used

Coupled with flow control mechanism

Local information readily available

Global information more costly to obtain

Network state can change rapidly

Use of local information can lead to non-optimal choices

Can be minimal or non-minimal

57

Minimal adaptive routing

Local information can result in sub-optimal choices

58

Non-minimal adaptive routing

Fully adaptive

Not restricted to take shortest path

Misrouting: directing packet along non-productive channel

Priority given to productive output

Some algorithms forbid U-turns

Livelock potential: traversing network without ever reaching destination

Limit number of misroutings

What about power consumption ?

59

Turn model for adaptive routing

DOR eliminates 4 turns in a 2d-mesh topology with two cycles

N to E, N to W, S to E, S to W

No adaptivity

It is possible to do better?

Hint: some models relax to eliminate 2 turns instead of 4 in 2d-mesh

Turn model

60

Turn model for adaptive routing 1

Basic steps

Partition channels according to the direction in which they route packets

Identify possible turns

Identify the cycles combining turns, i.e. the most single cycles

Break each simple cycle

61

Check if the combination of simple cycle allows the formation of


complex cycles

Example on a 2D-mesh

2 simple cycles

Turn model for adaptive routing 2

The DOR algorithm avoid 4 turns to ensure deadlock free property

What about removing just 1 turn per cycle ?

Maybe the deadlock property is still valid

62

Turn model for adaptive routing 3

Not all turns are valid to remove cycles and preserve deadlock free property

Theorem: The minimum number of turns that must be prohibited to prevent


deadlock in an n-dimensional mesh is n*(n-1) or a quarter of the possible turns
NOTE: However you have to choose carefully the prohibited turns

63

Turn model: west-first routing algorithm

The first direction to take is west, if any

Never possible to go west, after a while!!!

An example

64

Turn model: north-last routing algorithm

Going north is the last thing to do

Never possible to go north, at the beginning!!!

An example

65

Turn model: negative-first routing algorithm


y

Travel from negative start from negative

Never possible to go negative from positive!!!

An example

66

Issues in routing algorithms

Unbalanced traffic in DOR

North: top-right

West: top-left

South: bottom-left

East: bottom-right

67

NoC granulatity overview

Messages: composed of one or more packets


(NOTE:If message size is maximum packet size only one packet created)

Packets: composed of one or more flits

Flit: flow control digit

Phit: physical digit


(Subdivides flit into chunks = to link width)

Off-chip: channel width limited by pins


On-chip: abundant wiring means phit size == flit size

68

NoC microarchitecture based on granulatiry

Message-based: allocation made at message granularity

circuit switching
Packet-based: allocation made to whole packets

Store and forward (SaF)

Large latency and buffer required

Virtual Cut Through (VCT)

Improves SaF but still large buffers and latency


Flit-based: allocation made on a flit-by-flit basis

Wormhole

Efficient buffer utilization, low latency

Suffers Head of Line (HoL)

Virtual channels

Primary to face deadlock

Then face HoL

69

Switch/Router Wormhole Microarchitecture

Flit-based,i.e. Packet divided in flits

Pipelined in 4 stages

BW,RC,SA,ST,LT

Buffers organized on a flit basis

Single buffer per port

Buffer states:

G idle,routing,active waiting,

R output port (route)

C credit count P pointers to data

70

71

Switch/Router Virtual Channel Microarchitecture

Router components
Router components

Input buffers, route computation logic, virtual channel allocator, switch allocator,
crossbar switch

Most OCN routers are input buffered

Use single-ported memories

Buffer store flits for duration in router

Contrast with processor pipeline that latches between stages

Basic router pipeline (Canonical 5-stage pipeline)

BW: Buffer Write

RC: Routing computation

VA:Virtual Channel Allocation

SA: Switch Allocation

ST: Switch Traversal

LT: Link Traversal

72

Router components

Routing computation performed once per packet

Virtual channel allocated once per packet

Body and tail flits inherit this info from head flit

Router performance

Baseline (no load) dealy: 5 cycles + link delay x Hop + tserialization

How to reduce latency ?

73

Pipeline optimization: lookahead router

Overlap with BW

Precomputing route allows flits to compete for Vcs immediately after BW

RC decodes route header

Routing computation needed at next hop

Can be computed in parallel with VA

74

Pipeline optimization: speculation

Assume that Virtual Channel Allocation stage will be successful

Valid under low to moderate loads

Entire VA and SA in parallel

If VA unsuccessful (no virtual channel returned)

Must repeat VA/SA in next cycle

Prioritize non-speculative requests

75

Router Pipeline: module dipendencies

Dependence between output of one module and input of another

Determine critical path through router

Cannot bid for switch port until routing performed

Li-Shiuan Peh and William J. Dally. 2001. A Delay Model and Speculative Architecture for Pipelined Routers

76

Router Pipeline: delay model

Li-Shiuan Peh and William J. Dally. 2001. A Delay Model and Speculative Architecture for Pipelined Routers

77

Switch/Router Flow Control

Flow control

78

determines how a network resources, such as


bandwidth, buffer capacity and control state are allocated to packets
traversing the network

Resource allocation problem: from the resources point of view

Contention resolution: from the packet point of view

Bufferless, buffered

Switch/Router Bufferless Flow Control

No buffers

Allocate channels and bandwidth to competing packets

Two modes

Dropping flow control

Circuit switching flow control

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

79

Bufferless Dropping Flow Control 1

Simplest flow control form


Allocate channel and bandwidth to
competing packets
In case of collisions we experience
packet drops
Collision can be signaled or not
using ack-nack messages

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

80

Bufferless Dropping Flow Control 2

With no ack messages the only viable way is timeout timers

Ack messages can reduce latency

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

81

Bufferless Circuit switching Flow Control 1

It allocates all needed resources before send the message

When no further packets must be sent, the circuit is deallocated

Head flit arbitrates for resources, and if stalled no resend needed

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

82

Switch/Router Buffered Flow Control

Buffers

More flexibility, with the possibility to decouple resource allocation in steps

Two modes

Wormhole flow control

Virtual channel flow control

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

83

Switch/Router Buffered Wormhole Flow Control

Allocate on a per flit basis

More efficient in buffer consumption

Head of Line (HOL) blocking issues

Buffered solutions allow to decouple


resource allocation

U uppuer outport, L lower outport

In port States (I,W,A) (idle, waiting, allocated)

Flits (H,B,T) (head, body, tail)

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

84

Switch/Router Virtual Channel Flow Control

Multiple buffers on the same input


port
Need for a state on each virtual
channel
More complex
wormhole

to

manage

than

Allows to manage different flows at


the same time

Solves the HoL issues

Deadlock avoidance property

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

85

Wormhole HoL issues

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

86

Buffer Management and Backpressure

87

How to manage buffers between neighbors (i.e. how can I know the downstream
destination router buffer is full?)
Three ways:

Credit based

The upstream router keeps track of the available flit slots


available in the downstream router

Upstream router decreases counter when sends a flit while


downstream router increases the couter (backward) when a flit
leave the router

Accurate fine grain control on flow control, but a lot of messages


On/off

Threshold mechanism with single bit low overhead to signal


upstream router the permission to send

Ack/nack

No state in the upstream node

Sends and wait for ack/nack, no net gain

Waist of bandwitdh, sending without ack guarantee

Credit-based flow control

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

88

On-off flow control

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

89

Ack-nack flow control

William Dally and Brian Towles. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

90

Evaluation metrics for NOCs


Performance
Network centric
Latency
Throughput
Application Centric
System throughput (Weighted Speedup)
Application throughput (IPC)
Power/Energy
Watts/Joules
Energy Delay Product (EDP)
Fault-Tolerance
Process variation/Reliability
Thermal
Temperature

91

Network-on-Chip power consumption

Network power
breakdown

- Buffer power, crossbar power and


link power are comparable
- Arbiter power is negligible
Source: Chita Das, ACACES summer school 2011

92

Bibliography 2

Dally, W. J., and B. Towles [2004]. Principles and Practices of Interconnection Networks,
Morgan Kaufmann Publishers, San Francisco.
C.A. Nicopoulos, N. Vijaykrishnan, and C.R. Das, Network-on-Chip Architectures: A Holistic
Design Exploration, Lecture Notes in Electrical Engineering Book Series, Springer, October 2009.
G. De Micheli, L. Benini, Networks on Chips: Technology and Tools, Morgan Kaufmann, 2006.
J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: An Engineering Approach,
Morgan Kaufmann, 2002.
R. Marculescu, U. Y. Ogras, L.-S. Peh, N. E. Jerger, Y. Hoskote, 'Outstanding Research Problems in
NoC Design: System, Microarchitecture, and Circuit Perspectives', IEEE Trans. on ComputerAided Design of Integrated Circuits and Systems (TCAD), vol. 28, pp. 3-21, Jan. 2009.
T. Bjerregaard and S. Mahadevan, A survey of research and practices of network-onchip, ACM
Comput. Surv., vol. 38, no. 1, pp. 151, Mar. 2006.
Natalie Enright-Jerger and Li-Shiuan Peh, "On-Chip Networks", Synthesis Lecture, Morgan-Claypool
Publishers, Aug. 2009
Agarwal, A. [1991]. Limits on interconnection network performance, IEEE Trans. on Parallel
and Distributed Systems 2:4 (April), 398412.
Dally, W. J., and B. Towles [2001]. Route packets, not wires: On-chip interconnection
networks, Proc. of the Design Automation Conference, Las Vegas (June).
Ho, R., K. W. Mai, and M. A. Horowitz [2001]. The future of wires, Proc. of the IEEE 89:4 (April).
Hangsheng Wang, Xinping Zhu, Li-Shiuan Peh and Sharad Malik, "Orion: A Power-Performance
Simulator for Interconnection Networks" , In Proceedings of MICRO 35, Istanbul, November 2002.
D. Brooks, R. Dick, R. Joseph, and L. Shang, "Power, thermal, and reliability modeling in
nanometer-scale microprocessors, " IEEE Micro , 2007.

93

94

Thank you
Any questions?

S-ar putea să vă placă și