Documente Academic
Documente Profesional
Documente Cultură
Cycle Router
Nimmy Joseph, Ramesh Reddy C, Keshavan Varadarajan, Mythri Alle, Alexander Fell, S K Nandy
CAD Lab, SERC, Indian Institute of Science, Bangalore
{jnimmy, crreddy, keshavan, mythri, alefel, nandy}@cadl.iisc.ernet.in
Ranjani Narayan
Morphing Machines, Bangalore, India
ranjani.narayan@morphingmachines.com
Abstract
Metadata Orchestration
Support Logic
Execution
Fabric
...
A polymorphic ASIC is a runtime reconfigurable hardware substrate comprising compute and communication elements. It is a future proof custom hardware solution for
multiple applications and their derivatives in a domain. Interoperability between application derivatives at runtime is
achieved through hardware reconfiguration. In this paper
we present the design of a single cycle Network on Chip
(NoC) router that is responsible for effecting runtime reconfiguration of the hardware substrate. The router design
is optimized to avoid FIFO buffers at the input port and
loop back at output crossbar. It provides virtual channels
to emulate a non-blocking network and supports a simple
X-Y relative addressing scheme to limit the control overhead to 9 bits per packet. The 88 honeycomb NoC (RECONNECT) implemented in 130nm UMC CMOS standard
cell library operates at 500MHz and has a bisection bandwidth of 28.5GBps. The network is characterized for random, self-similar and application specific traffic patterns
that model the execution of multimedia and DSP kernels
with varying network loads and virtual channels. Our implementation with 4 virtual channels has an average network latency of 24 clock cycles and throughput of 62.5%
of the network capacity for random traffic. For application
specific traffic the latency is 6 clock cycles and throughput
is 87% of the network capacity.
...
solutions. While ASICs offer significant power and performance advantage over other programmable solutions, the
NRE costs of ASICs can only be amortized by high production rates. A flexible and reprogrammable ASIC would
be future proof and cost effective since a single ASIC
can now support multiple applications and their derivatives.
We refer to such a flexible ASIC as a polymorphic ASIC.
In FPGAs alteration of applications at runtime cannot be
achieved easily because of the latency caused by each configuration reload whenever there is an application switch.
On the other hand a polymorphic ASIC (an abstract model
is shown in Figure 1) is a programmable multiprocessor on
a chip, where both the data path and the control path are redefined during execution to meet the performance requirements of applications. Polymorphic ASIC is a hardware
substrate comprising a regular array of tiles. A tile consists
of a Compute Element (CE) and a network router. These
tiles are connected through a high performance Network on
Chip (NoC) which can be reconfigured at runtime.
...
Compute
Element
...
...
...
Compute
Element
Programmable Interconnect
Introduction
Figure 1. Abstract model of polymorphic ASIC
Given an application specification in a high level language, it is fragmented into substructures called HyperOps
[7]. A HyperOp is a sub graph of the application data flow
graph. At runtime the CEs are programmed according to the
251
HyperOp Launcher
399
399
(57 * 7;
57 = Header+48)
T
57
2.1
Express Lanes
Honeycomb topology is chosen as interconnection network on the fabric since it has a lower degree per node than
a 2-D Mesh [9]. This reduces the complexity and area of
the network router. A detailed comparison of the honeycomb and mesh topologies is provided in [10].
The interconnects are divided into two logical sets. The
first set of interconnections is called Express Lanes (Figure
2). These facilitates parallel instruction transfer from the
HyperOp launcher to boundary tiles and reduce the load on
the internal network. If the destination is not the boundary
tile, the instructions are routed from the boundary tiles to
the destination CEs through the internal network. The second set of wires connects the tiles, which are used for inter
CE data transfer in addition to the instruction transfer to the
114
(57 * 2)
Tile = CE + Router
252
2.2
New Data
X Relative
Y Relative
Indicator
Address
Address
56
Payload(48 bits)
The packet size (Figure 3) is taken as flit1 size. Network bandwidth between two adjacent routers is equal to
the packet size. Splitting the packet into multiple flits will
make the network inefficient in our case as the CE will be
idle till the whole packet is received. Further it requires additional logical circuitry for splitting the packets into flits
and their reassembly at the destination.
Each input port has a port status out bit (Figure 4) indicating whether one of the virtual channels is free. The
synchronization between two adjacent routers is based on
this bit. In the router, the received packets are stored in one
of the free virtual channels in a round robin fashion. Router
outputs packet only if one of virtual channels of the adjacent router is free. Otherwise, the packet is blocked at the
input of the current router. Thus packets are not dropped in
the network. The non availability of the virtual channels of
adjacent router increases the average latency of the overall
network. Once a packet is sent to an adjacent router, the
corresponding virtual channel status of the current router is
updated by the Port Update Logic to indicate that it is free to
receive another packet. A toggle in the MSB (Bit 0) of the
received packet indicates new packet. Packet is stored only
if the MSB is different from that of the previous packet.
Routing Algorithm: The routing algorithm is described
with respect to Figure 4. Packets are routed along the shortest path to the destination. Honeycomb topology has horizontal links on every alternate node. Therefore the routing algorithm prioritizes horizontal links over vertical ones
(Figure 2). The output request signal (output req from vc)
generated from the packets in the virtual channels indicates the direction in which the packet has to travel. The
K:1 virtual channel arbiter at each input port selects one
of the virtual channels for which the requested adjacent
router input port is free and sends the corresponding request
(req from east etc.) to the Request Decoder, which generates decoded req for Output Arbiter.
At each router, the output port to which the packet is to
be sent is determined based on the relative address. At the
source a packet is formed by concatenating the X, Y relative
addresses of destination and payload. 4-bit 2s complement
arithmetic is used for address updating in the router. If a
packet has to travel in the X direction when horizontal link
is not available and Y relative address is zero, then it takes
south (north) direction and Y relative address is modified
accordingly so that after traveling in X direction the packet
Router
1A
253
Data
Control Signal
0
1
000
111
0
0
1
1111
0
1
0 VC1000
1
0
1
0000
1111
0
0
1
000
1111
0
1
1110
000
0 VC2
1
1
111
000
0
1
0
1
0 VC3
1
1
0
0000
1111
0
1
1
0
0
1
0
1
11 VC4000
00
1111
000
111
0
11
00
demux_sel[1:0]
VC Arbiter
VC_status[3:0]
CLK
P R1 R2 R3 R4
VC_status[3:0]
P R1 R2 R3 R4
00
11
demux_sel[1:0]
VC_sel[1:0]
VC Arbiter
VC_status[3:0]
Output Arbiter
0
1
0
1
000
111
0
1
1
0
000
111
1
0
0
1
000
111
0
1
0
1
0
111
00
0
1
Req Decoder
0
1
000
111
0 VC1
1
00
11
000
111
00
11
00 VC2
11
00
11
00
11
000
111
00 VC3
11
0
1
000
111
0
1
000
111
0 VC4
1
11
00
11north_out[56:0]
00
mux_sel[1:0]
11
00
00
11
0
1
0
1
0
1
0
1
0
1
11
00
11south_out[56:0]
00
mux_sel[1:0]
Output Arbiter
VC_sel[1:0]
VC Arbiter
decoded_req
0
1
0
1
11
00
00
11
00
11
00
11
00
11
Req Decoder
0
1
000
111
1
0
0 VC1
1
000
1110
1
0
1
0
1
0
1
0
1
0000
1111
0
1
0 VC2
1
0
1
0
1
0
000
1111
000
111
0 VC3
1
0
1
0
1
1110
000
000
111
1
0 VC4000
1
111
1
0
0
1
demux_sel[1:0]
1111
0000
00
11
1111
east_in[56:0] 0000
00
11
port_status_out_east
VC_sel[1:0]
1
0
0
1
0
1
0
1
0
1
0
1
11
00
0
1
0
1
0
1
0
1
0
1
0
1
0
1
req_from_north[1:0]
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
req_from_south[1:0]
0
1
0
1
11
00
0
1
11
00
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
req_from_west[1:0]
0
1
00
11
0
1
00 1
11
11111111111111111111111111
00000000000000000000000000
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
req_from_east[1:0]
0
1
11
00
0
1
Output Arbiter
port_status_out_south
P R1 R2 R3 R4
0
1
0
1
1
0
0
1
0
1
0
1
Req Decoder
port_status_out_west
VC_status[3:0]
1
0
1
0
west_in[56:0]
VC_sel[1:0]
VC Arbiter
Address Update
Logic
Output Arbiter
south_in[56:0]
demux_sel[1:0]
1
0
000000000000001
11111111111111
0
0
1
0
1
00
11
00
11
0
1
0
1
0
1
Req Decoder
port_status_out_north
00
11
11
00
0000
1111
00
00
11
00 VC1 11
11
00
11
00
00
11
00
11
VC2 11
000
111
00
11
00
11
00
11
11
00
north_in[56:0] 000
111
00
11
1 VC3
0
1
0
000
111
0000
1111
1
0
1
0
0
1
0000
1111
11
00
00
11
0
1
00
0 VC4 11
1
11
00
11west_out[56:0]
00
mux_sel[1:0]
00
11
00
11
00
11
11
00
00
11
11
00
11
00
00
11
11
00
00east_out[56:0]
11
mux_sel[1:0]
P R1 R2 R3 R4
P = port_status_in[3:0] R1, R2, R3, R4 = output_ req_from_VC
Figure 4. Data and control path scheme of the router with four virtual channels(VC)
moves in north (south) direction. The relative address updating logic for four directions is described below.
North
South
West
East
:
:
:
:
Performance Analysis
We evaluated the proposed NoC for different traffic patterns that are representative of real applications. The NoC is
constructed using RTL code in Verilog HDL and simulated
using Mentor Graphics Modelsim by driving it through a
test-bench written in Verilog HDL. Synopsys Design Compiler is used for synthesis of the NoC.
The NoC is simulated using four types of traffic, viz.
random, self-similar and two application specific traffic patterns that model the execution of multimedia and DSP kernels on polymorphic ASIC. In random traffic, each tile generates packets with random destinations. Packets are injected into the network from all ports every clock cycle
whenever the ports are available.
Self-similar traffic has been observed in the bursty traffic
between on-chip modules in typical MPEG-2 video applications [4]. It has been shown that modeling of self-similar
traffic can be obtained by aggregating a large number of
254
ON-OFF message sources. The length of time each message spends in either the ON or the OFF state should be selected according to a distribution which exhibits long-range
dependence. The Pareto distribution (F (x) = 1x , with
l < < 2) has been found to fit well to this kind of traffic. A
1
packet train remains in the ON state for tON = (1 r) ON
1
0.9
0.8
1
OF F
3.1
0.7
0.6
0.5
0.4
0.3
0.2
DSP kernel
multimedia kernel
self-similar
random
0.1
0
1
45
DSP kernel
multimedia kernel
self-similar
random
40
We compare the throughput and latency for various traffic patterns on RECONNECT. Figure 5 shows the variation
of throughput with the number of virtual channels for different types of traffic. Throughput is the maximum traffic
accepted by the network and it relates to the peak data rates
sustainable by the system. The accepted traffic depends on
the rate at which data is injected into the network. Ideally accepted traffic should increase linearly with injection
load. However, due to the limitation of routing resources,
accepted traffic will saturate at a certain injection load.
Throughput expressed as a fraction of network capacity,
is measured by applying a saturation source to each port that
injects a new packet into the network whenever the port is
free, which results in insertion of more packets when router
contains more virtual channels. As it can be observed in
Figure 5, it saturates when the number of virtual channels
exceeds 4. Higher throughput is achieved for application
specific traffic as compared to random or self-similar traffic
patterns, due to higher near neighborhood communication
across tiles. The variation of throughput with virtual channels shows a similar trend for all four types of traffic.
Figure 6 shows the variation of latency with the number
of virtual channels. Network latency is measured by finding the average difference between arrival time at the destination and launching time at the source, including the time
spent buffered at the source. The average packet latency in-
35
30
25
20
15
10
5
0
1
255
Conclusion
100
2 virtual channels
4 virtual channels
8 virtual channels
90
80
Average Latency (cycles)
70
60
50
40
30
20
10
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
approaches infinity when injected traffic is beyond the saturation throughput. For example, in case of 4 virtual channels saturation throughput is 62% of the total network capacity (Figure 5). Beyond this point latency approaches
infinity (Figure 7) because of the limitation of routing resources.
3.2
References
[1] ITRS 2001. In International Technology Roadmap for Semiconductors.
[2] P. Bai et al. A 65nm Logic Technology featuring 35nm gate
lengths, Enhanced channel strain, 8 Cu Interconnect layers,
Low-k ILD and 0.57m2 SRAM cell. In IEDM 04, pages
657660, Dec 2004.
[3] W. J. Dally. Virtual-Channel Flow Control. IEEE Trans.
Parallel Distributed Syst., 3(2):194205, 1992.
[4] G.Varatkar and R.Marculescu. Traffic Analysis for On-Chip
Networks Design of Multimedia Applications. In Design
Automation Conference, pages 510217. IEEE, June 2002.
[5] Kees Goossens et al. Ethreal Network on Chip: Concepts,
Architectures and Implementation. In IEEE Design and Test
of Computers. IEEE Computer Society, 2005.
[6] Mikael Millberg et al. Guarenteed Bandwidth using Looped
Containers in Temporally Disjoint Networks within the Nostrun Network on Chip. In Proceedings of the Design, Automation and Test. IEEE Computer Society, 2004.
[7] Mythri Alle et al. Synthesis of Application Accelerators on
Runtime Reconfigurable Hardware. In ASAP 08, 2008.
[8] K. Park and W.Willinger. Self-Similar Network Traffic and
Performance Evaluation. In John Wiley & Sons, 2002.
[9] A. N. Satrawala et al. Redefine: Architecture of a SOC Fabric for Runtime Composition of Computation Structures. In
FPL 07, 2007.
[10] I. Stojmenovic. Honeycomb Networks: Topological Properties and Communication Algorithms. In IEEE 97: Parallel
Distributed Systems, volume 8, pages 10361042, 1997.
Topology
Honeycomb
Mesh
Latency
(cycles)
6
4
Area
(mm2 )
0.134481
0.153767
Power
(mW)
36.33
43.93
Frequency
(MHz)
500
500
256