Sunteți pe pagina 1din 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

A 3-D CPU-FPGA-DRAM Hybrid Architecture


for Low-Power Computation
Xianmin Chen and Niraj K. Jha, Fellow, IEEE

Abstract The power budget is expected to limit the portion challenge of how the extra transistors can be translated to
of the chip that we can power ON at the upcoming technology better performance under the given power budget.
nodes. This problem, known as the utilization wall or dark Adding specialized accelerators to aid general-purpose
silicon, is becoming increasingly serious. With the introduction of
3-D integrated circuits (ICs), it is likely to become more severe. CPUs has been proven to be an effective method for reducing
Thus, how to take advantage of the extra transistors, made power consumption. The power saved by these accelerators
available by Moores law and the onset of 3-D ICs, within the can be used to power extra computational components
power budget poses a significant challenge to system designers. to deliver better performance. These accelerators include
To address this challenge, we propose a 3-D hybrid architecture coprocessors with special instructions, general-purpose
consisting of a CPU layer with multiple cores, a field-
programmable gate array (FPGA) layer, and a DRAM layer. graphic processing units (GPUs), field-programmable gate
The architecture is designed for low power without sacrific- arrays (FPGAs), and application-specific ICs (ASICs).
ing performance. The FPGA layer is capable of supporting As opposed to CPUs, they have specialized computation
a large number of accelerators. It is placed adjacent to the and control logic, thereby reducing the number of active
CPU layer, with a communication mechanism that allows it transistors and, hence, power consumption. Moreover, the
to access CPU data caches directly. This enables fast switches
between these two layers. This architecture reduces the power specialization may also offer a performance advantage over
and energy significantly, at better or similar performance. This traditional CPUs. This can be traded for further power
then alleviates the dark silicon problem by letting us power ON reduction by operating accelerators at a lower frequency.
more components to achieve higher performance. We evaluate Among different types of accelerators, ASICs are the
the proposed architecture through a new framework we have most power-efficient but offer almost no functional flexibility.
developed. Relative to the out-of-order CPU, the accelerators on
the FPGA layer can reduce function-level power by 6.9 and By contrast, FPGAs provide higher flexibility and reason-
energy-delay product (EDP) by 7.2, and application-level power able power efficiency, making them perfect candidates for
by 1.9 and EDP by 2.2, while delivering similar performance. low-power computation. Previous studies have shown the
For the entire system, this translates to a 47.5% power reduction advantage of FPGAs over CPUs and GPUs in this respect.
relative to a baseline system that consists of a CPU layer and a For example, Kestur et al. [3] show that using the FPGAs
DRAM layer. This also translates to a 72.9% power reduction
relative to an alternative system that consists of a CPU layer, for basic linear algebra subroutines can achieve 2.7 to
an L3 cache layer, and a DRAM layer. 293 better energy efficiency than using GPUs and CPUs.
Chen and Singh [4] show that using the FPGAs, based on high-
Index Terms 3-D integrated circuit (IC), CPU, DRAM,
field-programmable gate array (FPGA), hybrid architecture, level synthesis (HLS), for a document filtering algorithm can
low power. achieve 5.5 and 5.25 better energy efficiency than using
GPUs and CPUs, respectively. Other works also report similar
I. I NTRODUCTION energy efficiency improvements of FPGAs over GPUs and/or

P OWER consumption is among the top concerns of todays


chip designers. Although Moores law continues to enable
increased transistor count, the power budget constrains the
CPUs [5], [6]. Although these works establish the advantage
of FPGAs in reducing energy consumption, the accelerators
used in them are highly customized for certain applications.
portion of the chip we can power ON. This problem is known In addition, these FPGA accelerators are only suitable for
as the utilization wall or dark silicon [1], [2]. It is expected large computational jobs, since FPGAs are located off-chip
to be a major showstopper for high-performance designs at and switching computation to them is costly.
the upcoming technology nodes. Another recent trend toward In this paper, we propose a 3-D hybrid architecture for
3-D integrated circuits (ICs) aggravates this problem further. low-power computation to forestall the dark silicon problem
In the 3-D ICs, multiple dies are stacked together in while enabling the highest performance possible without hit-
one package, making more transistors available than predicted ting the power wall. The architecture is easy to use and meant
by Moores law. This means designers have more transistors for general-purpose computations. It consists of a CPU layer,
than they can afford to power ON. Thus, they face the an FPGA layer, and a DRAM layer. It takes advantage of
through-silicon via (TSV)-based 3-D IC technology to provide
Manuscript received July 7, 2015; revised September 17, 2015; accepted
September 23, 2015. This work was supported by the National Science vast FPGA resources close to CPUs. We use a mechanism to
Foundation under Grant CCF-1216457. allow FPGA accelerators to access CPU data caches directly.
The authors are with the Department of Electrical Engineering, Princeton This enables fast switches between a CPU and an FPGA
University, Princeton, NJ 08544 USA (e-mail: xianminc@princeton.edu;
jha@princeton.edu). accelerator. The FPGA layer is quite flexible in terms of
Digital Object Identifier 10.1109/TVLSI.2015.2483525 how it is used. It could be configured as a large number
1063-8210 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

of small accelerators, a small number of large accelerators, be placed near the CPU due to die area limitations. This
or something in between. Moreover, the use of a separate limits the complexity of accelerators implemented on the
FPGA layer allows it to be fabricated separately using the reconfigurable fabric. Another drawback is that this method
state-of-the-art FPGA design. Since the architecture is geared cannot take advantage of the latest developments in the
toward general-purpose computation, we use a state-of-the-art FPGA industry. In contrast, systems built using the second
HLS tool to generate the accelerators directly. This eliminates method use the state-of-the-art commodity FPGAs [12][16].
the costly manual design process. Note that the proposed The FPGA is off-chip and connects to the rest of the system
architecture is generic in the sense that it can be extended through a bus or other mechanisms. The off-chip FPGA offers
to include multiple FPGA layers and multiple DRAM layers. a considerable amount of reconfigurable resources. However,
We have developed an evaluation framework to simulate sys- its drawback is that the communication between the FPGA
tems with such an architecture. Our experimental results show and the CPU is costly in terms of both time and energy.
that, compared with the out-of-order CPU, the accelerators on We will show that our proposed architecture can overcome
the FPGA layer can reduce function-level power and EDP by these drawbacks using 3-D stacking. This not only enables
6.9 and 7.2, respectively, and application-level power and fast communication but also provides enough FPGA resources
EDP by 1.9 and 2.2, respectively, while delivering similar and takes advantage of the state-of-the-art FPGA designs.
performance. Compared with a baseline system consisting of Designing FPGA-based accelerators requires hardware
a CPU layer and a DRAM layer, the proposed architecture expertise, which is a barrier to wider adoption of reconfig-
can achieve 47.5% power reduction for the entire system. urable computing systems. Fortunately, recent developments
Compared with an alternative system consisting of a CPU in the HLS have significantly lowered this barrier. HLS tools
layer, an L3 cache layer, and a DRAM layer, the proposed directly convert functions written in high-level languages to
architecture can achieve 72.9% power reduction for the entire hardware accelerators [17][22]. Some researchers have also
system. studied direct compilation of compute unified device architec-
We summarize the contributions of this paper as follows. ture (CUDA) or OpenCL (designed for heterogeneous parallel
1) We propose a new 3-D architecture with the CPU, computing) to hardware [23], [24]. This takes advantage of
FPGA, and DRAM layers for low-power computation. the parallelism exposed by both languages and is suitable for
2) We demonstrate a communication mechanism that systems with off-chip FPGAs. In this paper, we use a state-
allows the FPGA layer to access CPU data caches, of-the-art HLS tool called Vivado HLS (known as AutoESL
directly crossing clock domains. before Xilinxs acquisition) [25]. A third-party report indicates
3) We develop a framework to evaluate the proposed archi- that it can achieve a quality level similar to that of manual
tecture. designs by hardware experts, for several applications [26].
4) We evaluate performance, power, and thermal impacts
of the architecture, both at the function and application
levels. B. 3-D ICs
The rest of this paper is organized as follows. Section II The 3-D ICs have multiple layers of dies, which are
reviews related work on reconfigurable computing and connected by the TSVs, stacked in a single package. This
3-D ICs. Section III presents our proposed architecture. makes the so-called More-than-Moore scaling possible. In the
Section IV describes the framework we have developed to 3-D ICs, different layers can be manufactured and tested
evaluate the architecture. Section V presents the experimental separately. This not only increases the yield but also allows
results. Finally, the conclusion is drawn in Section VI. different layers to be optimized and manufactured in different
technologies. Researchers have studied various approaches
II. R ELATED W ORK for using the 3-D IC architectures to improve performance
and/or reduce power consumption. One popular approach is
This section reviews related work on the reconfigurable
to put caches or memories in a layer separate from the CPU
computing and the 3-D ICs that will help place our work in
layer [27][32]. Although this approach is easy to implement,
the proper context.
it is confronted with diminishing returns after memory/cache
sizes exceed certain thresholds. Another approach for using
A. Reconfigurable Computing the 3-D ICs is to split computational components, such as
Reconfigurable computing uses reconfigurable fabrics to CPU cores, into multiple layers. This can reduce the wire
perform computations. A reconfigurable computing system length compared with the original 2-D designs, thus lead to
can be built using two methods: 1) using customized performance and power benefits. However, this approach has
reconfigurable fabrics and 2) using commodity FPGAs. two drawbacks: 1) it requires a very large number of TSVs,
Several designs, such as Garp, NAPA1000, and Chimaera, which are costly and 2) since computational components
were constructed using the first method [7][11]. Customized generally consume more power than caches or memories,
reconfigurable fabrics can usually run at higher frequencies stacking multiple layers of computational components may
than commodity FPGAs due to the customization. They create thermal problems [33]. Compared with the above two
can also be embedded very close to the CPU to enable approaches, as an alternative, adding an FPGA as a sepa-
fast communication. However, one drawback of this method rate layer has two unique benefits: 1) the FPGA layer adds
is that only a small amount of reconfigurable fabric can computational power and 2) since an FPGA layer consumes
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 3

have, associated with it, one or multiple accelerators, which


are implemented above the core on the FPGA layer, for the
given application. An accelerator accesses the main memory
via the data cache of the CPU core it belongs to. Such
a design has several advantages: 1) accelerators can use
existing infrastructure, including the CPU memory hierarchy
and memory controllers, which saves hardware resource;
2) caches on the CPU layer can expedite memory accesses and
increase accelerator performance; 3) sharing caches provides
a uniform view of the memory space among CPU cores and
accelerators, simplifying programming; and 4) sharing caches
Fig. 1. Proposed 3-D hybrid architecture with a CPU layer, an FPGA layer, eliminates the need for copying data, enabling fast switches
and a DRAM layer. between a CPU core and an accelerator.

much less power than a CPU layer, adding it is much less


likely to lead to thermal problems. Because of these benefits, B. FPGA Accelerator
a recent effort, the FlexTiles platform, makes a move in this The accelerators implemented on the FPGA layer are gen-
direction [34]. However, it uses a special embedded FPGA, erated directly from C/C++ code using an HLS tool. The
which does not take advantage of recent FPGA developments. tool we chose is Vivado HLS, a state-of-the-art tool from
In this platform, the FPGA accelerators are connected to CPUs Xilinx [25]. During HLS, the tool maps the input arguments
through a network-on-chip, incurring a large communication and return values of C/C++ functions to input and output
overhead. It also uses a dataflow programming model, which signals of the generated accelerator. If an input argument is
limits its use. As opposed to the FlexTiles platform, our a value, it is mapped to an input signal that passes the value
proposed architecture takes advantage of the state-of-the-art directly. If an input argument is a pointer, it is mapped to a
FPGA and HLS technologies. It is easy to use and suitable for memory access interface, which consists of the address, data,
general-purpose computation. We also conduct experiments to and handshaking signals. A return value, if it exists, is mapped
evaluate its power and thermal impact, which the FlexTiles to an output signal, passing the value directly. Besides these
work lacks. signals, Vivado HLS also adds necessary control signals to the
accelerator, including clock, reset, and handshaking signals.
III. H YBRID A RCHITECTURE Inside the accelerator, sequential C/C++ code is mapped
This section presents our hybrid 3-D architecture. to a hardware structure based on the data and control
We first introduce the system architecture and then flows. Fig. 2(a) and (b) shows an HLS example for an
present its implementation details, e.g., FPGA acceleration, N N matrix multiplication. Fig. 2(a) shows the original
instruction set architecture (ISA) extension, CPU-accelerator function in C, which multiplies matrices A and B and stores
communication, and memory access optimization. Finally, we the result as matrix C. All matrices are stored in row-major
discuss manufacture, test, cost, and reliability of the proposed order. In the function, the two outer loops iterate through
architecture. each element of matrix C, and the innermost loop iterates
through the corresponding row of matrix A and column of
matrix B to calculate the matrix C element. Fig. 2(b) shows
A. System Architecture the generated accelerator structure. Inputs A, B, and C are
As shown in Fig. 1, our proposed hybrid 3-D architecture pointers and mapped to three memory access interfaces. Input
consists of three layers: 1) CPU; 2) FPGA; and 3) DRAM. argument N is mapped directly to a value-passing signal. In the
An extended configuration may consist of multiple FPGA and generated accelerator, the data path contains four multipliers
DRAM layers. For simplicity, we will focus only on the basic to multiply four pairs of inputs and three adders to sum up
configuration in this paper. The cores located on the CPU layer the products. It is capable of handling four iterations of the
run at a high frequency (1 GHz or higher) and generate the original functions inner loop. The data path is controlled by
most heat. Therefore, the CPU layer is located closest to the generated finite-state machine.
the heat sink to expedite heat dissipation. Adjacent to the During accelerator generation, we use three types
CPU layer is the FPGA layer. It contains accelerators for low- of optimizations supported by Vivado HLS to increase
power computation. These accelerators run at a much lower accelerator performance: 1) pipelining; 2) loop unrolling;
frequency (several hundred megahertz) than the CPU cores. and 3) dataflow. The pipelining optimization splits the data
As a result, they generate much less heat and are placed farther path into stages by inserting pipeline registers, allowing it to
from the heat sink. Above the FPGA layer is the DRAM layer. tackle one batch of inputs per cycle to increase throughput.
It contains systems main memory. It usually generates the The loop unrolling optimization unrolls a loop to allow
least heat and is, hence, placed farthest from the heat sink. multiple iterations of the loop body to be finished in parallel.
Adjacency of the CPU and FPGA layers enables fast This increases the performance at the cost of extra hardware
communication between the CPU cores and the accelerators resources. For instance, in the N N matrix multiplication
via TSVs that connect the two layers. Each CPU core can example, the innermost loop of the original function is
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 2. Example of HLS-generated accelerator. (a) Matrix multiplication source code. (b) Generated accelerator.

TABLE I
A CCELERATOR I NSTRUCTIONS

but may have one or more data channels. With the wrapper
added, an accelerator can be easily connected to the rest of
the system.

Fig. 3. Accelerator interface. C. CPU ISA Extension


Using instructions to directly configure and call accelerators
unrolled four times. The dataflow optimization enables
also allows fast switches between a CPU and an accelerator.
task-level (functions and loops) pipelining by inserting
Therefore, we added two new instructions to the CPU ISA to
buffers, such as first-in first-out (FIFO) or ping-pong buffers.
access the accelerators: 1) WRITE_ACC and 2) RUN_ACC.
All these optimizations can be performed via directives,
WRITE_ACC writes configuration registers in the accelerator.
which do not require any source code change [35].
It has two operands: 1) the value to be written and 2) the
The generated accelerator usually works at a lower
register index. RUN_ACC starts the accelerator. It also
frequency than the CPU core. However, it may still perform
has two operands: the accelerator index and another that
computations faster than the CPU core due to the following
indicates where the execution result (the return value of the
reasons: 1) it can have parallel data paths; 2) it can use cus-
original function) will be stored. Table I summarizes the
tomized pipelines to achieve higher throughput; 3) it can merge
two instructions. In the ALPHA instruction format, each
multiple operations into a single operation in its data path;
instruction can have at most three operands, but only two are
and 4) it can have customized communication mechanisms
used in our design. As we will see, the ISA extension is simple
inside, which is not constrained by the memory bottleneck.
enough to implement, yet support complex accelerators.
To connect an accelerator to a CPU core working in another
A typical call to an accelerator has two steps: 1) the CPU
clock domain, we add a wrapper to the generated accelerator
uses the WRITE_ACC instruction to write accelerator registers
for interclock-domain communication, as shown in Fig. 3.
and 2) it uses the RUN_ACC instruction to start the accelerator
The wrapper has two parts: 1) configuration/status registers
and receive the execution result. During instruction execution,
and 2) interface protocol conversion. The configuration/status
an accelerator may access memory through the CPU data
registers store the inputs and internal states. The interface
cache. Meanwhile, the CPU waits until the accelerator finishes
protocol conversion part converts the generated accelerator
or a translation lookaside buffer (TLB) miss occurs (discussed
interface to an interface consisting of FIFOs to provide a
later). A more sophisticated implementation may support the
simple interface to connect to the rest of the system. Each
simultaneous running of the CPU and accelerator. However,
pair of an inbound FIFO and an outbound FIFO is grouped
we did not choose this implementation, since our main goal
into a channel. Based on the functionality, a channel belongs
is to reduce power.
to one of two possible categories: 1) control channel and
2) data channel. A control channel is responsible for the com-
munication of control information, including the information D. CPU-Accelerator Communication
to set up the accelerator and the execution result. A data An FPGA accelerator can access the CPU data cache
channel is responsible for memory access when the accelerator directly. To enable this sharing, a switch is added to
is running. The outbound FIFO of each data channel conveys enable multiplexing, as shown in Fig. 4. At any given time,
the memory access request, and the inbound FIFO conveys either the CPU core or the accelerator has access to the data
the response. Each accelerator has only one control channel cache. Since the switch only adds a 2-to-1 multiplexer between
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 5

Fig. 4. CPU-accelerator communication mechanism.

the CPU kernel and the data TLB (DTLB)/cache, it only incurs
a minor increase in path delay, which we analyze further
in Section V. The CPU and accelerator are connected through
a pair of FIFOs. These FIFOs are used as buffers to alleviate
the speed mismatch, because the CPU core operates at a higher
frequency and the accelerator at a lower one. On the CPU
side, we use an adapter, called CPU-side adapter, to connect
the FIFOs with the data cache switch. On the FPGA side,
another adapter, called accelerator-side adapter, is used to
connect the FIFOs to accelerator channels. Each accelerator
may have more than one channel. Therefore, the accelerator-
side adapter embeds both multiplexing and demultiplexing
Fig. 5. Packet format.
functions inside. Because the function of the accelerator-side
adapter is fixed, it can be prefabricated directly into the FPGA, value and data size of the return value (if it exists),
as in the case of the 3-D network interface [36]. This enables respectively. Binary codes 00, 01, 10, and 11 indicate
the accelerator-side adapter to work at the same frequency as 1-, 2-, 4-, and 8-byte data size, respectively.
that of the CPU core. As a result, the CPU data cache and both 5) Memory Read (Reading Memory): The address and size
adapters work at a higher frequency than the accelerator, thus fields correspond to the address and data size of the read
providing enough bandwidth for the accelerator. This is similar operation, respectively. Data size encoding is the same
to the idea of multipumping [37], where the cache operates as above.
at a higher frequency than the FPGA accelerator, allowing 6) Memory Write (Writing Memory): The address, value,
multiple memory accesses per cycle by the accelerator. Note and size fields of the packet correspond to the address,
that our communication mechanism does not require any extra value, and data size, respectively. Data size encoding is
resources for caches and memory controllers. As a result, the same as above.
it saves resources compared with the multipumping design. Fig. 6(a) shows an example of CPU-accelerator commu-
Six types of packets are used for CPU-accelerator nication using the packets introduced above. First, the CPU
communication, as shown in Fig. 5. Three go from the CPU executes a WRITE_ACC instruction, which sends a Write
to the accelerator and the other three in the opposite direction. packet (packet a) to the accelerator. Then, the CPU executes a
In each packet, the first two bits indicate the packet type. The RUN_ACC instruction, sending a Start packet (packet b) to the
remaining bits provide other information, such as the address. accelerator. During accelerator execution, it accesses memory
Details of each type of packet are given as follows. twice: for a memory read and a memory write. Memory read
1) Start (Starting an Accelerator): The index field of leads to a Memory read packet (packet c) from the accelerator
the packet indicates the index of the accelerator (in a to the CPU and a Memory response packet (packet d) in the
multiaccelerator scenario). opposite direction as a response. Memory write leads to only
2) Write (Writing an Accelerator Register): The index and a Memory write packet (packet e) from the accelerator to
value fields of the packet indicate the register index and the CPU. When the accelerator has finished, it sends a Done
value to be written, respectively. packet (packet f ) to the CPU. In our design, the accelerator
3) Memory Response (Returning Memory Read Result): accesses the virtual memory space directly. This means that a
The value field in the packet is the memory read result. TLB miss may occur during memory accesses. Fig. 6(b) shows
4) Done (Indicating Accelerator Execution is Finished): an example of such a scenario where the Memory read packet
The value and size fields in the packet indicate the (packet i ) triggers a TLB miss exception. In this situation,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 7. Memory access optimization. (a) Duplicate data channels. (b) Insert
scratchpad memories.

cache at the beginning of accelerator execution, provide these


data to the accelerator and store results during execution,
and store the execution results back to the CPU data cache
at the end. They have multiple ports to deliver sufficient
bandwidth during accelerator execution. Fig. 7(b) shows an
example of a scratchpad memory added for the data channel
for A. Choi et al. [37] and Putnam et al. [38] show that
inserting caches or scratchpad memories is possible with the
help of HLS tools. In our implementation, since we do not
Fig. 6. CPU-accelerator communication examples. (a) Without TLB miss.
(b) With TLB miss. have access to the source code of Vivado HLS, instead of
modifying the HLS tool, we modify the source code of the
the CPU immediately suspends the RUN_ACC instruction and accelerated functions to direct the HLS tool to synthesize the
starts the TLB miss exception handling procedure. At the desired scratchpad memories. We use this to mimic direct HLS
same time, the CPU-side adapter stores the packet that triggers support.
the exception and stops receiving any new packet. When the
CPU is handling an exception, the accelerator keeps running
F. Manufacture, Test, Cost, and Reliability
until it can no longer continue. Once the CPU has finished
exception handling, it re-executes the RUN_ACC function, In this section, we provide a qualitative analysis of manu-
which resends a Start packet (packet j ) to the accelerator. facture, test, cost, and reliability of our proposed architecture.
When the CPU-side adapter receives it, it does not forward First, to manufacture a chip based on our proposed architec-
the packet to the accelerator. Instead, it deletes the packet and ture, the die for each layer can be manufactured independently
sends the stored packet (renamed as packet k) to the CPU to and then bonded together using face-to-back bonding. The
reread the address that caused the TLB miss. At this time, number of TSVs needed for CPU-accelerator communication
the memory read succeeds and the accelerator continues its is determined by the data width of the FIFOs. Based on the
execution. Since packets j and k only exist between the CPU packet sizes targeted in Section III-D, the two FIFOs used
and the CPU-side adapter, we have marked them with a by each CPU core are 71- and 132-bit wide, respectively.
dotted line in Fig. 6(b). For a typical design with eight cores, this translates to
1624 [(71 + 132) 8] TSVs. Together with 608 TSVs
(512 bits for data, 64 bits for address, and 32 bits for
E. Memory Access Optimization control) needed by the memory controller, the total number
Besides the optimizations performed with directives men- of TSVs needed is 2232. A TSV redundancy scheme may
tioned in Section III-B, we also perform optimizations specif- require a slightly higher number to cope with TSV failures
ically targeted at speeding up memory access. During HLS, in manufacture. However, since the increase is minor and
data channels (memory interfaces) are generated for pointer- depends on the manufacturing process (e.g., the redundancy
passed arguments in the original function. Since the accelerator scheme presented in [39] achieves a 95% recovery rate by
operates at a lower frequency than the CPU core and each data adding one redundant TSV per 25-TSV block), we have not
channel can only access one batch of data per cycle, memory modeled TSV redundancy in this paper.
access may become a performance bottleneck. Note that the Second, 3-D ICs pose significant test challenges [40].
root of this problem is the low-frequency nature of FPGA To achieve a better yield, the die for each layer needs
accelerators. In our architecture, the accelerator accesses to be tested before bonding. In our proposed architecture,
the CPU data cache directly, which operates at a higher since each layer contains complete functionality, prebond
frequency and provides enough bandwidth. To tap into the testing is easier compared with the case when fine-grain
bandwidth, we need to duplicate the data channels to access partitioning of functionality is employed across multiple die
the cache data in parallel. Fig. 7(a) shows an example of an layers. In particular, we can add an independent test controller
accelerator with duplicated channels for pointer argument A. to each layer for this purpose, and each module on each
As mentioned earlier, the CPU data cache has enough band- layer can be treated as an isolated test island and tested
width to feed the duplicated channels. If this bandwidth is still separately [41].
not sufficient, we can create multiport scratchpad memories Third, the cost of a 3-D IC consists of wafer cost, bonding
inside the FPGA. They load needed data from the CPU data cost, packaging cost (depends on both pin count and die size),
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 7

TABLE II
S YSTEM C ONFIGURATION

Fig. 8. Evaluation framework. as wrappers generated by our scripts, into gem5 code and
simulate them together. We choose to use the SystemC version,
and cooling cost [42]. Our proposed architecture incurs little since it can be compiled with gem5 source code (written
overhead for the first three costs, given the relatively small mainly in C++ and Python). Since the gem5 simulator and the
number of TSVs it uses. Since it reduces power consumption SystemC simulation engine have two separate event queues,
and alleviates hotspots on chip, as shown in Section V-F, it we employ the method described in [45] to synchronize the
also reduces the cooling cost. two event queues and enable joint simulation.
Finally, in a 3-D IC, the difference in thermal expansion After the integration, gem5 yields performance results,
coefficients between the TSV copper and the surrounding including component-level utilization statistics and wave-
silicon causes thermal-induced stresses that lead to mechanical forms for the accelerator (output of the SystemC simulation
reliability issues. We can use the techniques and tools proposed engine). For most system components except accelerators,
in [43] to deal with such issues. TSVs, adapters, and DRAM, we use McPAT to estimate
power [46]. McPAT takes the component-level utilization
IV. E VALUATION F RAMEWORK statistics, together with the system configuration, to estimate
power for each component. As for the accelerator, we first
We designed a customized framework to evaluate our synthesize, place, and route the accelerators Verilog HDL
proposed architecture. Fig. 8 shows the structure of the code to get its low-level implementation details. Then, we use
framework. It consists of four parts: 1) HLS; 2) performance them with the dumped waveform to estimate the accelerator
simulation; 3) power simulation; and 4) thermal simulation. power. This is done using the Xilinx Vivado tool [25]. For
The HLS part contains Vivado HLS, which converts a TSVs and adapters, we build power macromodels by adding up
C/C++ function into accelerator hardware at the register- the power consumptions of their internal components, similar
transfer level. The function to be converted is chosen based to the way McPAT is built. We use the macromodels, along
on software profiling results. The generated accelerator with their utilization statistics, to estimate their power. For
has multiple versions with the same underlying hardware DRAM, we use Microns DRAM power calculator to estimate
structure but described in different languages: Verilog power (not shown in the figure) [47]. After we have all the
hardware description language (HDL), VHDL, and SystemC. power data, we use HotSpot (with 3-D IC support extension
Then, a script parses the accelerator interface and changes from [31]) to perform thermal simulation [48]. We have
the benchmark source code to use the generated accelerator automated the entire process, shown in Fig. 8, to expedite
instead of the original function. The new source code is experimentation.
cross-compiled into an executable binary for performance
simulation. 1 We assume a 50-cycle memory controller latency (half of the 100-cycle
The performance simulator we use is gem5 [44], an open- latency assumed in an off-chip 2-D version), the same as that in [31].
source full-system simulator. Its open-source nature allows Memory delay is calculated by summing up tRAS and tRP (35 and 13.75 ns,
respectively, for Micron DDR3-1600 synchronous dynamic random-access
us to add accelerator simulation support. In particular, we memory (SDRAM) [49]), leading to 98 cycles at 2-GHz frequency. In all,
integrate the SystemC version of accelerator code, as well the memory delay is 148 cycles.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE III
B ENCHMARKS

V. E XPERIMENTAL R ESULTS TABLE IV


FPGA R ESOURCE U TILIZATION
This section presents the experimental results. We first
discuss the experimental setup. Then, we show the resource
utilization rate, function-level/application-level performance,
and power of FPGA accelerators. Then, we provide results
for performance and power optimization. Finally, we evaluate
the system-level power and thermal impact of the proposed
architecture.
A. Experimental Setup
To evaluate the proposed architecture, we use an
implementation with an eight-core CPU layer, an FPGA
layer, and a DRAM layer. We refer to it as CPU + Acc
henceforth. To understand how accelerators in the proposed 8.54 12.05 mm2 [53]. We scale it to the same size as
architecture impact the performance and power, we compare the L3 cache layer (87.76 mm2 ) of CPU + L3 for a fair
CPU + Acc with a baseline system (referred to as CPU-only) comparison. For the DRAM layers of all three systems, since
consisting of a CPU layer and a DRAM layer, but no FPGA the die size information of commercial DRAMs is not easily
layer. Since CPU-only has one less layer than CPU + Acc, accessible, we assume 6F 2 area per memory cell and 56.08%
we also compare CPU + Acc with another system that area efficiency to estimate the die size [54], where F is
has the same number of layers. The system, referred to as the feature size. Following this method, we estimate the
CPU + L3, consists of a CPU layer, an L3 cache layer, die size of a 1-GB DRAM to be 94.11 mm2 . In all three
and a DRAM layer. Table II shows the configurations of systems, the CPU cores run at 2 GHz. In CPU + Acc, the
the three systems. They are all assumed to be implemented FPGA accelerators run at 300 MHz, which is the best we
in the 28-nm CMOS technology except the DRAM layers, can get on the FPGA. The switch between the CPU kernel
which are implemented in the 32-nm CMOS technology. and the DTLB/cache adds 16.5-ps delay to the path delay
We further assume that memory cells of all caches are (based on McPAT), which amounts to 3.3% of the CPU clock
implemented with low-standby-power transistors to reduce period. We assume that a margin exists in CPU + Acc to
leakage power, while the rest of the system is implemented tolerate this delay increase and allows the CPU cores to run
with high-performance transistors to improve performance. at 2 GHz. Another method to estimate the delay overhead is
Such a feature is supported by the cache-modeling tool, to estimate both the clock period and the switch delay with
CACTI [50], which is used inside McPAT. fanout-of-4 inverter delay, as in [55]. This method also yields
We assume 1/2/6-m dimensions for TSV diameter/pitch/ similar results.
depth, which are the minimum ITRS diameter/pitch/depth We use the benchmarks listed in Table III to test
guidelines for years 20122014 (they are also within the CPU-only, CPU + L3, and CPU + Acc. The first two bench-
predicted range for years 20152018) [51]. We calculate the marks are microbenchmarks: 1) integer and 2) floating-point
TSV capacitance using the method given in [52] for power 512512 matrix multiplications. Both split 512512 matrices
estimation. For the CPU layer of CPU + Acc, the estimated into 16 16 submatrices to increase cache performance. The
die size is 84.07 mm2 of which TSVs consume 0.02%. next five benchmarks (gsm to bzip2) are from the CHStone
In CPU-only and CPU + L3, the CPU layers have slightly suite plus bzip2 [56]. Since several of these benchmarks
smaller (0.62%) die sizes than that of CPU + Acc, because are very short, their main functions are repeated multiple
they do not have the CPU-side adapters. For the FPGA layer of times using loops to extend their duration in our experiments.
CPU + Acc, we use Xilinx Kintex FPGA (xc7k160tfbg484-3; The remaining benchmarks, blackscholes and 264, are from
implemented in the 28-nm technology). Its die size is the PARSEC benchmark suite [57], [58]. Table III includes
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 9

Fig. 9. Function-level power, energy, execution time, and EDP comparison. (a) Function power. (b) Function energy. (c) Function execution time.
(d) Function EDP.

descriptions of the benchmarks. The function and percentage block RAM (BRAM), and DSP blocks. Most accelerators
columns show one or two of the most time-consuming func- use more than 10% of the LUTs except 264. For 264,
tions in each benchmark and their percentages of the total the accelerated functions, 264_pixel_sad_4_16 16 and
execution time, respectively. pixel_satd_w h, are small functions and thus consume only
2.3% and 8.2% of the LUTs, respectively.
B. Resource Utilization
Table IV shows the percentage of resources used by the C. CPU Versus Accelerator: Function-Level Power,
generated accelerators. In the name column, the -1 and -2 Energy, Performance, and EDP
suffixes indicate the first and second functions when two func- In this section, we compare the power, energy, performance,
tions are accelerated in the same benchmark. These suffixes are and EDP of the out-of-order CPU core and the FPGA
also used in subsequent figures in this section. Four types of accelerators when running the same functions. The first two
resources are listed in the table: lookup table (LUT), register, bars, marked as CPU and Acc, in Fig. 9(a)(d) compare
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 10. Function-level power (absolute values).

the power, energy, execution time, and EDP of the CPU


(in CPU-only) and the FPGA accelerators (in CPU + Acc).
All results are normalized to that of the CPU.2 When esti-
mating CPU power and energy, we include data cache/TLB
and instruction cache/TLB inside the core. When estimat-
ing accelerator power and energy, we only include the data
cache (D$) and DTLB, since accelerators only access them.
From Fig. 9(a), we can see that the accelerators consume sig-
nificantly less power than the CPU. This is because the accel-
erators contain dedicated computational structures, whereas
the out-of-order CPU contains components that do not con-
tribute directly to computation, such as components dedicated
to instruction fetching and decoding, branch prediction, and
register renaming. When accelerators are used, the data cache
and DTLB consume a very small amount of power, owing
to the fact that the leakage power of memory cells inside
the data cache is reduced using low-standby-power transistors.
To better understand the impact of using accelerators, we also
show the absolute power values in Fig. 10. On an average, the
Fig. 11. Cache performance. (a) Data cache miss rate. (b) Data cache
accelerators consume only 22.8% of the power consumed by accesses.
the CPU. The power reduction translates to energy reduction
in Fig. 9(b), where the accelerators can be seen to consume To evaluate the benefit of accessing data directly from the
only 20.2% of the energy consumed by the CPU on an CPU data cache, we measure the data cache miss rate of
average. As mentioned earlier, although a CPU core runs at a the CPU and accelerators. The result is shown in Fig. 11(a).
much higher frequency than an accelerator, an accelerator may For most benchmarks, the accelerators have higher miss
still deliver better performance due to its dedicated structure. rates than the CPU. This is because the CPU runs software
This is evident from Fig. 9(c), where, compared with the functions, which have intermediate variables. These variables
CPU, accelerators deliver similar or even better performance usually end up in the data cache and drive the average
for many benchmarks except fmmul, aes, and bzip2. On an miss rate down. By contrast, accelerators only access the
average, the execution time of the accelerators is 82.9% of that cache for pointer-passed arguments. These are inputs from
of the CPU. When combining energy and execution time into outside and less likely to be in the cache. This makes the
one metric, EDP, we see that the accelerators outperform the cache miss rates higher. To show this difference, Fig. 11(b)
CPU on all benchmarks. The average EDP of the accelerators compares the amount of data cache accesses by the CPU and
is 21.5% of that of the CPU. Note that in accelerator power accelerators. On an average, the amount of data cache accesses
consumption calculation, the leakage power of the entire by accelerators is only 23.6% of that of the CPU. In summary,
FPGA is included, which makes the comparison unfair to the all accelerator miss rates are <9%, and the average is 3.6%.
accelerators. When multiple accelerators exist on the FPGA, This means that the CPU data cache provides a very good
this leakage power gets amortized. Thus, for each accelerator, caching for accelerators.
we should only count the portion of power consumed by the
accelerator based on the portion of resources (LUTs) it uses.
In Fig. 9(a)(d), the third bar, labeled +Partial L, shows the D. CPU Versus Accelerator: Application-Level Power,
result after this adjustment. On an average, the power, energy, Energy, Performance, and EDP
and EDP drop to 14.5%, 13.2%, and 13.9% of the CPU, In this section, we evaluate the power, energy, performance,
translating to 6.9, 7.6, and 7.2 reductions, respectively. and EDP of the out-of-order CPU core and the FPGA accel-
The execution time remains the same after the adjustment. erators when running entire benchmark applications on the
proposed architecture. When using accelerators, the CPU is
2 The operating temperature is assumed to be 65 C for power estimation. responsible for the unaccelerated parts of each benchmark.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 11

Fig. 12. Application-level power, energy, execution time, and EDP comparison. (a) Benchmark power. (b) Benchmark energy. (c) Benchmark execution time.
(d) Benchmark EDP.

Due to this reason, we use CPU + Acc to label results, the the case of function-level results, we also include results when
same as the system name. Similarly, we use CPU-only and the FPGA leakage power is scaled according to resources used
CPU + L3 to label the results when running benchmarks on and show them as the bar labeled +Partial L. in these figures.
CPUs of CPU-only and CPU + L3, respectively. Fig. 12(a)(d) On an average, the power, energy, and EDP of CPU + Acc
shows the power, energy, execution time, and EDP, respec- are now 52.3%, 46.3%, and 46.3% of CPU-only, translating
tively, when running each benchmark. On an average, power, to 1.9, 2.2, and 2.2 reduction, respectively. As expected,
energy, execution time, and EDP of CPU + Acc are 59.7%, the power reduction at the application level is less than at the
53.5%, 93.8%, and 53.5% of CPU-only, respectively. As in function level. This is because the accelerators only account
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 13. Benchmark execution time versus frequency.

for part of the execution of each benchmark, and the CPU


is still responsible for the rest. To deal with this issue, we
could use multiple accelerators to cover a larger part of the
execution of each benchmark. This has already been addressed
in our results by accelerating multiple functions for certain
benchmarks.
From Fig. 12(a)(d), we found that CPU + L3 performs Fig. 14. Comparison of system power and maximum temperature. (a) System
very similarly to CPU-only. The average execution time of power. (b) Maximum temperature.
CPU + L3 is 99.8% of the CPU-only. This is because the
L2 cache of CPU + L3 is big enough to accommodate the results based on voltage and frequency obtained using the
working data sets, rendering little advantage from adding method given in [59]. We assume that the frequency changes
an L3 cache. It confirms the fact that adding caches has from 2 to 1.5 GHz in 0.1-GHz step, and Vdd changes linearly
diminishing returns after cache sizes exceed certain thresholds. with frequency in this range. The bar labeled +Auto. Sel. in
As a result, the comparison results between CPU + L3 and Fig. 12(a)(d) shows the results of using automatic selection
CPU + Acc are very similar to those between CPU-only and and DVFS. In Fig. 12(a), we can see that the average power
CPU + Acc. On an average, the power, energy, execution time, increases to 58.9% of CPU-only, from 52.3% before. This is
and EDP of CPU + Acc (with +Partial L.) is 52.2%, 46.3%, because the CPU is chosen to run some benchmarks due to
94%, and 46.3% of CPU + L3, respectively. the execution time constraint. Similarly, the average energy
per benchmark in Fig. 12(b) increases to 55.5% of CPU-only,
E. Performance and Power Optimization from 46.3% before. In Fig. 12(c), all the execution times can
be seen to be equal to or shorter than those of CPU-only. This
As performance results indicate, CPU + Acc runs slower ensures no adverse impact on execution time. On an average,
than CPU-only for some benchmarks. This may not be allowed the execution time is reduced to 86.7% of CPU-only, from
when execution time is a top concern. To solve this problem, 93.8% before. From Fig. 12(d), we see that the average EDP
we exploit an automatic selection mechanism to direct the increases to 53.9% of CPU-only, from 46.3% before.
benchmark to run on the CPU if the accelerator is slower.
This mechanism ensures that the execution time is better than
or the same as that of CPU-only. Note that once an accelerator F. System Power and Thermal Impact Evaluation
is generated, whether it is faster or slower than the CPU In this section, we evaluate the system power and thermal
can be determined by simulations beforehand. For situations impact of the proposed architecture. The system power
where execution speed is data-dependent, a dynamic selection consists of power of all the components in the system,
method can be used, which chooses the CPU or accelerator including all caches and main memory. To fully take
dynamically, epoch by epoch, using the history data. In our advantage of the FPGA resources, we evaluate a scenario
experiments, we assume that the time budget for a benchmark where multiple homogeneous accelerators exist on the FPGA
is the same as its execution time on CPU-only. We further layer of CPU + Acc. We use multithread settings for the
assume that the execution time using the CPU or accelerators benchmarks with multithread support (blackscholes and
can be predetermined by simulations. 264), and duplicate single-thread version for those without
With the power budget assumed above, for benchmarks on this support. A CPU core and an accelerator (two accelerators
which CPU + Acc runs faster, we also apply a dynamic if two functions are accelerated) are allocated to run each
voltage and frequency scaling (DVFS) to the CPU core to thread. Note that the number of threads is limited by the
further reduce its power. To show that this is applicable, FPGA resource utilization shown in Table IV and the number
Fig. 13 shows the benchmark execution time versus the CPU of CPU cores. We keep increasing the number of threads until
frequency when using CPU + Acc. The values are normalized we exhaust at least one type of FPGA resource or reach the
to CPU-only. For six out of the nine benchmarks, even when maximum number of CPU cores. The final thread/accelerator
we lower the CPU frequency, CPU + Acc can still deliver numbers are given in the last column of Table IV. Fig. 14(a)
better performance than CPU-only. This means we can use shows the comparison of system power of CPU-only,
DVFS to achieve more power reduction for these benchmarks. CPU + L3, and CPU + Acc. We found that CPU + Acc
To incorporate DVFS into power estimation, we scale McPAT consumes much less power than CPU-only. On an average,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 13

CPU + Acc reduces power consumption by 47.5% relative [7] J. R. Hauser and J. Wawrzynek, Garp: A MIPS processor with
to CPU-only. By contrast, CPU + L3 consumes much higher a reconfigurable coprocessor, in Proc. 5th Annu. IEEE Symp.
Field-Program. Custom Comput. Mach., Apr. 1997, pp. 1221.
power than CPU-only and CPU + Acc. This is because the L3 [8] C. R. Rupp et al., The NAPA adaptive processing architecture, in Proc.
cache of CPU + L3 consumes considerable amount of power, IEEE Symp. FPGAs Custom Comput. Mach., Apr. 1998, pp. 2837.
even though we choose to use low-standby-power transistors [9] S. C. Goldstein et al., PipeRench: A coprocessor for streaming
multimedia acceleration, in Proc. 26th Int. Symp. Comput. Archit.,
for memory cells in the L3 cache. On an average, CPU + Acc May 1999, pp. 2839.
reduces power consumption by 72.9% relative to CPU + L3. [10] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, CHIMAERA:
The differences of power consumption lead to differences A high-performance architecture with a tightly-coupled reconfigurable
in maxium temperatures, or hotspots, on the chip. Fig. 14(b) functional unit, in Proc. 27th Int. Symp. Comput. Archit., Jun. 2000,
pp. 225235.
shows comparisons of maximum temperatures. We can see that [11] V. Govindaraju, C.-H. Ho, and K. Sankaralingam, Dynamically spe-
the maximum temperature of CPU + Acc is much lower than cialized datapaths for energy efficient computing, in Proc. IEEE 17th
those of both CPU-only and CPU + L3. On an average, the Int. Symp. High Perform. Comput. Archit., Feb. 2011, pp. 503514.
[12] P. Bertin, D. Roncin, and J. Vuillemin, Introduction to programmable
maximum temperature of CPU + Acc is 49.7 C, 6.4 C lower active memories, Paris Res. Lab., Paris, France, Tech. Rep., pp. 18,
than that of CPU-only (56.1 C), and 9.5 C lower than that of Jun. 1989.
CPU + L3 (59.2 C). This indicates that CPU + Acc does not [13] P. M. Athanas and H. F. Silverman, Processor reconfiguration through
instruction-set metamorphosis, Computer, vol. 26, no. 3, pp. 1118,
aggravate the thermal problem, which is a major problem in Mar. 1993.
the 3-D ICs. On the contrary, it helps to alleviate the thermal [14] Cray, Inc. (2004). Cray XD1 Supercomputer. [Online]. Available:
problem. http://www.cray.com
[15] C. Chang, J. Wawrzynek, and R. W. Brodersen, BEE2: A high-end
reconfigurable computing system, IEEE Des. Test Comput., vol. 22,
VI. C ONCLUSION no. 2, pp. 114125, Mar./Apr. 2005.
In this paper, we proposed a 3-D CPU-FPGA-DRAM hybrid [16] M. Lavasani, H. Angepat, and D. Chiou, An FPGA-based in-line
accelerator for Memcached, Comput. Archit. Lett., vol. 13, no. 2,
architecture for low-power computation. The architecture fea- pp. 5760, Jul./Dec. 2014.
tures an FPGA layer between the CPU and the DRAM layers. [17] J. L. Tripp, M. B. Gokhale, and K. D. Peterson, Trident: From
The FPGA layer implements accelerators that work with CPU high-level language to hardware circuitry, Computer, vol. 40, no. 3,
pp. 2837, Mar. 2007.
cores to achieve low-power computation. This helps address
[18] A. Putnam, D. Bennett, E. Dellinger, J. Mason, P. Sundararajan, and
the dark silicon problem. We also take advantage of HLS S. Eggers, CHiMPS: A C-level compilation flow for hybrid CPU-FPGA
to make the proposed architecture easy to use and suitable architectures, in Proc. Int. Conf. Field Program. Logic Appl., Sep. 2008,
for general-purpose computation. To evaluate the architecture, pp. 173178.
[19] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,
we created an evaluation framework and conducted various High-level synthesis for FPGAs: From prototyping to deployment,
experiments. At the function level, experimental results show IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 30, no. 4,
that the accelerators on the FPGA layer can reduce power and pp. 473491, Apr. 2011.
[20] A. Canis et al., LegUp: An open-source high-level synthesis tool
EDP by 6.9 and 7.2, respectively, compared with an out- for FPGA-based processor/accelerator systems, ACM Trans. Embedded
of-order CPU while delivering similar performance. At the Comput. Syst., vol. 13, no. 2, Sep. 2013, Art. ID 24.
application level, the accelerators on the FPGA layer can [21] Calypto Design Systems. Catapult. [Online]. Available: http://calypto.
com/en/products/catapult/overview, accessed Mar. 2015.
reduce power and EDP by 1.9 and 2.2, respectively. For
[22] Cadence, Inc. C-to-Silicon Compiler. [Online]. Available: http://www.
the entire system, this translates to 47.5% power reduction cadence.com/products/sd/silicon_compiler, accessed Mar. 2015.
compared with the baseline system without the FPGA layer [23] A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong, and
and 72.9% power reduction compared with an alternative W.-M. W. Hwu, FCUDA: Enabling efficient compilation of CUDA
kernels onto FPGAs, in Proc. IEEE 7th Symp. Appl. Specific Processors,
system which has the FPGA layer replaced with an L3 cache Jul. 2009, pp. 3542.
layer. These results demonstrate that the proposed architecture [24] Altera Corporation. (Nov. 2012). Implementing FPGA Design With
is effective in achieving the power reduction goal. the OpenCL Standard. [Online]. Available: http://www.altera.com/
literature/wp/wp-01173-opencl.pdf
[25] Xilinx, Inc. (2014). Vivado Design Suite System Edition
R EFERENCES (Version 2014.1). [Online]. Available: http://www.xilinx.com
[1] G. Venkatesh et al., Conservation cores: Reducing the energy of mature [26] Berkeley Design Technology, Inc. (2010). An Independent Evaluation of:
computations, in Proc. 15th Ed. ASPLOS Archit. Support Int. Conf. The AutoESL AutoPilot High-Level Synthesis Tool. [Online]. Available:
Support Program. Lang. Oper. Syst., 2010, pp. 205218. http://www.bdti.com/MyBDTI/pubs/AutoPilot.pdf
[2] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and [27] C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, Bridging the
D. Burger, Dark silicon and the end of multicore scaling, in Proc. processor-memory performance gap with 3D IC technology, IEEE Des.
38th Annu. Int. Symp. Comput. Archit., 2011, pp. 365376. Test Comput., vol. 22, no. 6, pp. 556564, Nov./Dec. 2005.
[3] S. Kestur, J. D. Davis, and O. Williams, BLAS comparison on FPGA, [28] G. H. Loh, 3D-stacked memory architectures for multi-core proces-
CPU and GPU, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, sors, in Proc. 35th Int. Symp. Comput. Archit., Jun. 2008, pp. 453464.
Jul. 2010, pp. 288293. [29] D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee, An optimized
[4] D. Chen and D. Singh, Using OpenCL to evaluate the efficiency of 3D-stacked memory architecture by exploiting excessive, high-density
CPUs, GPUs and FPGAs for information filtering, in Proc. 22nd Int. TSV bandwidth, in Proc. IEEE 16th Int. Symp. High Perform. Comput.
Conf. Field Program. Logic Appl., Aug. 2012, pp. 512. Archit., Jan. 2010, pp. 112.
[5] E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai, Single-chip [30] J. Zhao, X. Dong, and Y. Xie, An energy-efficient 3D CMP design
heterogeneous computing: Does the future include custom logic, with fine-grained voltage scaling, in Proc. IEEE Design, Autom., Test
FPGAs, and GPGPUs? in Proc. 43rd Annu. IEEE/ACM Int. Symp. Europe Conf., Mar. 2011, pp. 14.
Microarchitecture, Dec. 2010, pp. 225236. [31] J. Meng, K. Kawakami, and A. K. Coskun, Optimizing energy effi-
[6] B. Betkaoui, D. B. Thomas, and W. Luk, Comparing performance and ciency of 3-D multicore systems with stacked DRAM under power and
energy efficiency of FPGAs and GPUs for high productivity computing, thermal constraints, in Proc. 49th ACM/EDAC/IEEE Design Autom.
in Proc. Int. Conf. Field-Program. Technol., Dec. 2010, pp. 94101. Conf., Jun. 2012, pp. 648655.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

[32] G. Sun, H. Yang, and Y. Xie, Performance/thermal-aware design of [57] C. Bienia, Benchmarking modern multiprocessors, Ph.D. dissertation,
3D-stacked L2 caches for CMPs, ACM Trans. Design Autom. Electron. Dept. Comput. Sci., Princeton Univ., Princeton, NJ, USA, Jan. 2011.
Syst., vol. 17, no. 2, Apr. 2012, Art. ID 13. [58] M. Gebhart, J. Hestness, E. Fatehi, P. Gratz, and S. W. Keckler, Running
[33] B. Black et al., Die stacking (3D) microarchitecture, in Proc. 39th PARSEC 2.1 on M5, Dept. Comput. Sci., Univ. Texas Austin, Austin,
Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2006, pp. 469479. TX, USA, Tech. Rep. TR-09-32, Oct. 2009.
[34] R. Brillu, S. Pillement, A. Abdellah, F. Lemonnier, and P. Millet, [59] R. Miftakhutdinov, E. Ebrahimi, and Y. N. Patt, Predicting performance
FlexTiles: A globally homogeneous but locally heterogeneous many- impact of DVFS for realistic memory systems, in Proc. 45th Annu.
core architecture, in Proc. 6th Workshop Rapid Simul. Perform. Eval., IEEE/ACM Int. Symp. Microarchitecture, Dec. 2012, pp. 155165.
Methods Tools, 2014, Art. ID 3.
[35] Xilinx, Inc. (May 2014). Vivado Design Suite User Guide: High-Level
Synthesis. [Online]. Available: http://www.xilinx.com
[36] F. Lemonnier et al., Towards future adaptive multiprocessor systems-
on-chip: An innovative approach for flexible architectures, in Proc. Int.
Conf. Embedded Comput. Syst., Jul. 2012, pp. 228235. Xianmin Chen received the B.S. degree in informa-
[37] J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski, tion engineering from Shanghai Jiao Tong Univer-
Impact of cache architecture and interface on performance and sity, Shanghai, China, in 2007, and the M.S. degrees
area of FPGA-based processor/parallel-accelerator systems, in Proc. from Shanghai Jiao Tong University and the Georgia
IEEE 20th Annu. Int. Symp. Field-Program. Custom Comput. Mach., Institute of Technology, Atlanta, GA, USA, in 2010.
Apr./May 2012, pp. 1724. He is currently pursuing the Ph.D. degree in com-
[38] A. Putnam et al., Performance and power of cache-based reconfigurable puter engineering with Princeton University, Prince-
computing, in Proc. 36th Annu. Int. Symp. Comput. Archit., 2009, ton, NJ, USA. His current research interests include
pp. 395405. FinFET-based system design, reconfigurable com-
[39] A.-C. Hsieh and T. Hwang, TSV redundancy: Architecture and design puting, 3-D integrated circuit, and network-on-chip.
issues in 3-D IC, IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 20, no. 4, pp. 711722, Apr. 2012.
[40] H.-H. S. Lee and K. Chakrabarty, Test challenges for 3D integrated
circuits, IEEE Des. Test, vol. 26, no. 5, pp. 2635, Sep. 2009.
[41] D. L. Lewis and H.-H. S. Lee, A scan-island based design enabling
pre-bond testability in die-stacked microprocessors, in Proc. IEEE Int.
Test Conf., Oct. 2007, pp. 18. Niraj K. Jha (S85M85SM93F98) received
[42] X. Dong, J. Zhao, and Y. Xie, Fabrication cost analysis and cost- the B.Tech. degree in electronics and electrical
aware design space exploration for 3-D ICs, IEEE Trans. Comput.- communication engineering from IIT Kharagpur,
Aided Design Integr. Circuits Syst., vol. 29, no. 12, pp. 19591972, Kharagpur, India, in 1981, the M.S. degree in
Dec. 2010. electrical engineering from SUNY at Stony Brook,
[43] D. Z. Pan et al., Design for manufacturability and reliability for Stony Brook, NY, USA, in 1982, and the
TSV-based 3D ICs, in Proc. IEEE 17th Asia South Pacific Design Ph.D. degree in electrical engineering from the
Autom. Conf., Jan./Feb. 2012, pp. 750755. University of Illinois at UrbanaChampaign,
[44] N. Binkert et al., The gem5 simulator, ACM SIGARCH Comput. Archit. Champaign, IL, USA, in 1985.
News, vol. 39, no. 2, pp. 17, Aug. 2011. He is currently a Professor of Electrical
[45] M. Yu, J. Song, F. Fu, S. Sun, and B. Liu, A fast timing-accurate Engineering with Princeton University, Princeton,
MPSoC HW/SW co-simulation platform based on a novel synchroniza- NJ, USA. He has co-authored and co-edited five books entitled Testing and
tion scheme, in Proc. Int. MultiConf. Eng. Comput. Sci., Mar. 2010, Reliable Design of CMOS Circuits (Kluwer, 1990), High-Level Power Analysis
pp. 15. and Optimization (Kluwer, 1998), Testing of Digital Systems (Cambridge
[46] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and University Press, 2003), Switching and Finite Automata TheoryThird Edition
N. P. Jouppi, McPAT: An integrated power, area, and timing modeling (Cambridge University Press, 2009), and Nanoelectronic Circuit Design
framework for multicore and manycore architectures, in Proc. 42nd (Springer, 2010). He has authored 15 book chapters. He has authored or
Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp. 469480. co-authored over 400 technical papers. He holds 14 U.S. patents. His current
[47] Micron Technology, Inc. DDR3 SDRAM System-Power Calculator. research interests include FinFETs, low power hardware/software design,
[Online]. Available: http://www.micron.com, accessed Mar. 2015. computer-aided design of integrated circuits and systems, digital system
[48] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, testing, quantum computing, and secure computing.
and D. Tarjan, Temperature-aware microarchitecture, in Proc. 30th Prof. Jha is a fellow of the Association for Computing Machinery.
Annu. Int. Symp. Comput. Archit., Jun. 2003, pp. 213. He received the Distinguished Alumnus Award from IIT Kharagpur. He
[49] Micron Technology, Inc. 1GB DDR3 SDRAM Datasheet. [Online]. is a recipient of the AT&T Foundation Award and the NEC Preceptorship
Available: http://www.micron.com, accessed Mar. 2015 Award for research excellence, the NCR Award for teaching excellence, and
[50] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, Architecting the Princeton University Graduate Mentoring Award. A paper of his was
efficient interconnects for large caches with CACTI 6.0, IEEE Micro, selected for The Best of ICCAD: A collection of the best IEEE International
vol. 28, no. 1, pp. 6979, Jan./Feb. 2008. Conference on Computer-Aided Design papers of the past 20 years, two
[51] International Roadmap Committee. (2012). International Technology papers by the IEEE Micro Magazine as one of the top picks from the 2005 and
Roadmap for Semiconductors (2012 Update). [Online]. Available: 2007 Computer Architecture Conferences, and two others as being among the
http://www.itrs.net most influential papers of the last ten years at the IEEE Design Automation
[52] D. H. Kim, S. Mukhopadhyay, and S. K. Lim, Fast and accurate and Test in Europe Conference. He has received nine other best paper awards
analytical modeling of through-silicon-via capacitive coupling, IEEE and six best paper award nominations. He has served as the Editor-in-Chief
Trans. Compon., Packag., Manuf. Technol., vol. 1, no. 2, pp. 168180, of the IEEE T RANSACTIONS ON VLSI S YSTEMS and an Associate Editor
Feb. 2011. of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS I AND II, the
[53] Xilinx, Inc. (Mar. 2014). 7 Series FPGAs Packaging and Pinout (v1.11). IEEE T RANSACTIONS ON VLSI S YSTEMS , the IEEE T RANSACTIONS
[Online]. Available: http://www.xilinx.com ON C OMPUTER -A IDED D ESIGN , and the Journal of Electronic Testing:
[54] Semiconductor Industries Association. (2008). International Technology Theory and Applications. He serves as an Associate Editor of the IEEE
Roadmap for Semiconductors (2008 Update). [Online]. Available: T RANSACTIONS ON C OMPUTERS , the Journal of Low Power Electronics,
http://http://www.itrs.net and the Journal of Nanotechnology. He served as the Program Chairman
[55] Z. Chishti and T. N. Vijaykumar, Optimal power/performance pipeline of the 1992 Workshop on Fault-Tolerant Parallel and Distributed Systems,
depth for SMT in scaled technologies, IEEE Trans. Comput., vol. 57, the 2004 International Conference on Embedded and Ubiquitous Computing,
no. 1, pp. 6981, Jan. 2008. and the 2010 International Conference on VLSI Design. He has served as the
[56] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, Proposal and quan- Director of the Center for Embedded System-on-a-Chip Design funded by
titative analysis of the CHStone benchmark program suite for practical the New Jersey Commission on Science and Technology. He has served on
C-based high-level synthesis, J. Inf. Process., vol. 17, pp. 242254, the program committees of more than 150 conferences and workshops. He has
Oct. 2009. given several keynote speeches in the area of nanoelectronic design and test.

S-ar putea să vă placă și