Documente Academic
Documente Profesional
Documente Cultură
Abstract—This paper examines the feasibility of utilizing a over and over again. With the emergence of dynamically
2-dimensional (2-D) mesh of run-time reconfigurable modules and partially reconfigurable (DPR) FPGAs another degree
(RTRMs) on a dynamically and partially reconfigurable (DPR) of freedom was added to the resource utilization of config-
FPGA for throughput- and real-time-driven tasks. To utilize a
2-D mesh of RTRMs, efficient communication architectures
urable logic devices. Thereby distinct regions of the FPGA
(CA) are required, which will be presented in this work. can be reconfigured during run-time without affecting the
Such a 2-D mesh of RTRMs on a DPR-capable FPGA configuration and the functionality of other parts on the
can be utilized for throughput-driven tasks to dynamically FPGA. This feature of run-time reconfiguration (RTR)
offload compute functions on a host coupled system, providing allows to change parts of the design, i.e. the functionality,
multi-user and multi-context execution on behalf of user
demands. For embedded systems, it can be utilized as a highly
of an already configured FPGA during run-time to the
dynamical platform by providing functional enhancement by demands of the user or the environment. The adaptability is
module replacement during run-time. The exploration also increased in the way that functionality for tasks unknown
includes a CA for real-time communication between RTRMs during the design time, can be added later during the run-
in a 2-D mesh. The presented CA design is based on a time of the system. Applying RTR can lead to a higher
novel methodology by applying run-time reconfiguration to
increase the performance. The design, the implementation,
energy efficiency and lower costs when parts of the design
the performance and the resource utilization is shown for can be deployed on the FPGA by time division multiplexing
throughput- and real-time-driven CAs. As proof of concept, a (TDM) and therefore FPGAs with a lower amount of logic
case study is conducted for the presented CAs on state of the cells can be utilized. This paper is devoted to coarse grained
art Virtex-5 FPGAs. RTR based on run-time reconfigurable modules (RTRMs)
Keywords-dynamic reconfiguration; run-time reconfigura- on DPR capable FPGAs.
tion; FPGA; 2-D mesh; communication architecture; A RTR system architecture is investigated which allows
the arbitrary placements of RTRMs in a 2-D mesh at
I. I NTRODUCTION run-time without interfering other RTRMs. As multiple
Field programmable gate arrays (FPGAs), which are pro- RTRMs should be supported, a adequate communication
grammable logic devices, can be found in various fields architecture on the FPGA must be investigated. In this
of application, e.g. reconfigurable computing (RC) and architecture all RTRMs should operate and communicate to
embedded systems. The fundamental idea is based on the other RTRMs or static modules independently, i.e. poten-
creation of application-defined processing engines with a tially all RTRMs can act as an initiator (master) of commu-
programmable logic device in contrast to a microproces- nication. Two different communication architectures (CA)
sor program, which runs on a fixed instruction set. In for RTRMs based on FPGA logic and routing resources
a hybrid system, consisting of a processor and a FPGA, will be examined. The first one targets throughput-driven
the programmable logic device is able to accelerate user RTR systems, like RC systems, where multiple RTRMs
applications by taking advantage of creating self-defined, acting as offloading compute kernels can be run on a single
highly parallel and energy efficient hardware processing FPGA. These could be systems with host coupled FPGAs,
engines. Woods et al. [1] gained a speedup of more than where a user can create multiple accelerator modules on a
50 compared with a CPU when accelerating a Quasi-Monte single FPGA or different users are allowed to utilize the
Carlo Simulation. Zang et al. [2] reached a 25 times speedup same FPGA for computational tasks. The second field of
on another Monte Carlo Simulation. In embedded systems application targets real-time-driven embedded systems. The
FPGAs are capable to provide stand-alone replacement CA for this type of system needs to be designed in such
solutions for expensive silicon ASICs. Due to the fact that a way that the communication of RTRMs complies with
the configuration of the FPGAs is based on SRAM, they can real-time constraints. The exploration of efficient CAs for
be reconfigured to their specific task within milliseconds RTRM-based RTR systems is crucial for the implementation
50
requests should be served as soon as possible when they
appear.
A. Design
Packet switching is applied for the transmission from sender
to receiver. Each payload is preceded by a short header,
including source, destination and packet size occupying for
simplicity half a word each. A word is defined here as the
amount of bits representing the native data link width of
a router. Hence, every packet has a constant overhead of
two words for the header. As routing strategy deterministic
dimension-based X-Y routing is applied. This is mainly
due to the efficient implementation of X-Y routing in
contrast to more complex strategies. Data is forwarded in
a way combining the advantages of wormhole routing and
flit-based routing. However, in contrast to true wormhole
routing, packet data is propagated through the network as
the route is built, thus minimizing the startup time. When a
route is established by a router, it is reserved for the packet
until all data has passed. To improve the performance by
Figure 2: Architecture of a throughput (TP) driven router,
minimal buffer requirements, the routers are equipped with (N=North, E=East, S=South, M=Module (RTRM), W=West)
small FIFOs to store some flits in contrast to whole packets,
in case of true store-and-forward routing, which would
increase both latency and resource costs since a packet is resource utilization low. Since buffers are limited and packet
stored as a whole in each router before being propagated. data is transmitted on-line as the route is established, our
Two small FIFOs are utilized as a compromise between design requires a flow control. It is implemented using
throughput performance and costs of resource utilization. the valid-stop-protocol allowing every RTRM and router to
B. Implementation control the data flow. These signals are generated and used
by the channel state machines to control FIFO operation.
The main part of the throughput-driven router is the route
generation and arbitration logic as depicted in Figure 2. V. R EAL -T IME -D RIVEN C OMMUNICATION
The arbitration logic is responsible for a fair distribution of A RCHITECTURE
the router’s bandwidth using round-robin to poll the input
The term real-time-driven in this context describes a com-
ports. In every clock cycle the logic checks the currently
munication architecture providing predictable latencies for
selected input for a new request. If one is discovered, a
the communication between RTRMs or a RTRM with the
source-destination-tuple is generated describing the route
controller unit for memory, I/O or host access.
the request will take.
Fields of application are embedded systems with real-time
Routing requests are managed via two request FIFOs
demands or RTRMs, acting as processing elements, which
keeping them in order. New request-tuples are inserted into
have low buffer capabilities and therefore have to process
the first request FIFO from which they can be assigned
and deliver data just in-time. The goal is to provide a
to one of the channels described later in this section. If a
very resource-efficient implementation of a communication
request is not feasible since its route is currently blocked by
architecture for the real-time communication of RTRMs in
another packet being transmitted, it is moved from the first
a 2-D mesh. This is achieved by the application of the
to the second request FIFO. This raises its priority ensuring
following two techniques. First, the design is driven bottom-
it will be served as soon as the blocking transmission is
up by the device primitives of the FPGA. Secondly, the logic
finished while allowing other feasible requests to be served
complexity and dependencies is reduced by applying RTR
in the meantime.
not only for the RTRMs but also for updating routing paths
The data flow is managed by two independent channels
during the run-time.
consisting of a FIFO and control logic. These data FIFOs
are not to be confused with the request FIFOs used by the A. Design
route generation logic. FIFOs have been introduced to the
Real-time tasks based on RTRMs rely on a CA with
design to allow a continuous flow of data thus minimizing
deterministic behavior and low worst case latencies for
wait states. They are implemented using Virtex SliceM
the communication, whereas the resource utilization should
distributed RAM allowing fast operation while keeping
be kept at a minimal. As a basic approach, time division
51
multiplexing (TDM) in combination with circuit switching
(channel switching) is applied for each connection. A
minimal routing logic is applied, where for each output
(NO , EO , SO , MO , WO ), refer to Figure 3 and 4, of
a router an input is selected, based on a slot within a
communication cycle. Furthermore no message format, i.e.
header, is needed, which has to be processed. To notify
the receiver that new data is available, the additional signal
’valid’ is forwarded beside the data signals of the links.
This allows to reduce the protocol stack to a minimum.
A communication cycle is divided into a fixed number of
FPGAs, in the way that RTR is not only applied for the
configuration of RTRMs, but also for changing routes of a
channel for a slot during the run-time, which is done by
Figure 3: Example of an assignment of outputs (NO , EO , SO , the controller unit. Typically 49 clock cycles at 100 MHz,
MO , WO ) of a router, here R-10, to inputs originating from sender respectively 490 ns, including the CRC value, are required
RTRMs on a slot basis within a communication cycle for real-time on Virtex-5 FPGAs to change all slots within one router,
communication
due to the frame-based DPR behavior of Virtex FPGAs. It
should be noted, that changing routes during run-time is
slots. If there is no change of slots, the slots are equal not required in the standard case, because routes between
for the next cycle. For each slot, one input is asigned for communication partners are set up in a slot before or during
each output port of a router. Figure 3 depicts the division the configuration of a RTRM and are kept alive until the ter-
and an example of an assignment for a router. The route, mination of a communication partner. Removing the active
respectively channel, has to be established before the start routing algorithm, e.g. dimension-based, within the routers
of the communication of the RTRM to the selected target. dramatically decreases their complexity. Furthermore, an
This can be done before the RTRM is run-time reconfigured external route generation allows the utilization of more
in case of static communication relations or during the run- sophisticated routing strategies since powerful devices such
time for dynamic relations. The task of the establishment as embedded processors can be used for route generation.
of channels is assigned to the controller unit, which also
manages RTR. Due to the external route generation by B. Implementation
the controller unit different routing strategies are possible,
Due to the novel methodology of utilizing RTR even
whereas dimension X-Y based routing is chosen for sim-
for configuring the slots of a cycle, already mentioned
plicity. When a channel is set up, it can not be interrupted
in section V-A, the implementation does rely only on a
or disturbed by the communication of other RTRMs during
few components shown in Figure 4. This simplifies the
the duration of a slot, i.e. real-time constraints can be met
design and allows higher overall clock speeds. Two shift
with this design. The configurations for the switches within
registers, one for counting the clock cycles within a slot
the router are stored in distributed RAM, i.e. SliceMs for
and a second for the slots within a communication cycle,
Xilinx Virtex FPGAs. Run-time reconfiguration is taken to
are required. The latter addresses the memory, where the
the extreme on DPR-capable FPGAs, e.g. Xilinx Virtex
configuration of the routers is stored on a slot basis. Two
52
bits are required for the configuration of each switch, i.e. TP 32-bit distr-1
it takes a total of 10 bits for each slot and each router. For 290
TP 32-bit distr-2
a resource-efficient implementation, the shift registers are 280 TP 64-bit distr-1
instantiated as SRLC32E Virtex primitives. For the RAM, 270
TP 64-bit distr-2
two RAM32X8S (RAM32M) primitives are chosen, which
260
speed (MHz)
are based on distributed RAM available in every second
logic cell (SliceMs) in Virtex FPGAs. With the RAM32X8S 250
53
RT 32-bit distr-1
design complexity. The outstanding performance to resource
385 utilization ratio for the RT-CA is achieved by a consistent
RT 32-bit distr-2
375
365 RT 64-bit distr-1 bottom-up design based on FPGA primitives in combination
355 RT 64-bit distr-2 with the novel methodology of applying RTR. Due to this
345
335
high performance to resource utilization ratio the RT-CA
speed (MHz)
54