Documente Academic
Documente Profesional
Documente Cultură
Abstract The power budget is expected to limit the portion challenge of how the extra transistors can be translated to
of the chip that we can power ON at the upcoming technology better performance under the given power budget.
nodes. This problem, known as the utilization wall or dark Adding specialized accelerators to aid general-purpose
silicon, is becoming increasingly serious. With the introduction of
3-D integrated circuits (ICs), it is likely to become more severe. CPUs has been proven to be an effective method for reducing
Thus, how to take advantage of the extra transistors, made power consumption. The power saved by these accelerators
available by Moores law and the onset of 3-D ICs, within the can be used to power extra computational components
power budget poses a significant challenge to system designers. to deliver better performance. These accelerators include
To address this challenge, we propose a 3-D hybrid architecture coprocessors with special instructions, general-purpose
consisting of a CPU layer with multiple cores, a field-
programmable gate array (FPGA) layer, and a DRAM layer. graphic processing units (GPUs), field-programmable gate
The architecture is designed for low power without sacrific- arrays (FPGAs), and application-specific ICs (ASICs).
ing performance. The FPGA layer is capable of supporting As opposed to CPUs, they have specialized computation
a large number of accelerators. It is placed adjacent to the and control logic, thereby reducing the number of active
CPU layer, with a communication mechanism that allows it transistors and, hence, power consumption. Moreover, the
to access CPU data caches directly. This enables fast switches
between these two layers. This architecture reduces the power specialization may also offer a performance advantage over
and energy significantly, at better or similar performance. This traditional CPUs. This can be traded for further power
then alleviates the dark silicon problem by letting us power ON reduction by operating accelerators at a lower frequency.
more components to achieve higher performance. We evaluate Among different types of accelerators, ASICs are the
the proposed architecture through a new framework we have most power-efficient but offer almost no functional flexibility.
developed. Relative to the out-of-order CPU, the accelerators on
the FPGA layer can reduce function-level power by 6.9 and By contrast, FPGAs provide higher flexibility and reason-
energy-delay product (EDP) by 7.2, and application-level power able power efficiency, making them perfect candidates for
by 1.9 and EDP by 2.2, while delivering similar performance. low-power computation. Previous studies have shown the
For the entire system, this translates to a 47.5% power reduction advantage of FPGAs over CPUs and GPUs in this respect.
relative to a baseline system that consists of a CPU layer and a For example, Kestur et al. [3] show that using the FPGAs
DRAM layer. This also translates to a 72.9% power reduction
relative to an alternative system that consists of a CPU layer, for basic linear algebra subroutines can achieve 2.7 to
an L3 cache layer, and a DRAM layer. 293 better energy efficiency than using GPUs and CPUs.
Chen and Singh [4] show that using the FPGAs, based on high-
Index Terms 3-D integrated circuit (IC), CPU, DRAM,
field-programmable gate array (FPGA), hybrid architecture, level synthesis (HLS), for a document filtering algorithm can
low power. achieve 5.5 and 5.25 better energy efficiency than using
GPUs and CPUs, respectively. Other works also report similar
I. I NTRODUCTION energy efficiency improvements of FPGAs over GPUs and/or
of small accelerators, a small number of large accelerators, be placed near the CPU due to die area limitations. This
or something in between. Moreover, the use of a separate limits the complexity of accelerators implemented on the
FPGA layer allows it to be fabricated separately using the reconfigurable fabric. Another drawback is that this method
state-of-the-art FPGA design. Since the architecture is geared cannot take advantage of the latest developments in the
toward general-purpose computation, we use a state-of-the-art FPGA industry. In contrast, systems built using the second
HLS tool to generate the accelerators directly. This eliminates method use the state-of-the-art commodity FPGAs [12][16].
the costly manual design process. Note that the proposed The FPGA is off-chip and connects to the rest of the system
architecture is generic in the sense that it can be extended through a bus or other mechanisms. The off-chip FPGA offers
to include multiple FPGA layers and multiple DRAM layers. a considerable amount of reconfigurable resources. However,
We have developed an evaluation framework to simulate sys- its drawback is that the communication between the FPGA
tems with such an architecture. Our experimental results show and the CPU is costly in terms of both time and energy.
that, compared with the out-of-order CPU, the accelerators on We will show that our proposed architecture can overcome
the FPGA layer can reduce function-level power and EDP by these drawbacks using 3-D stacking. This not only enables
6.9 and 7.2, respectively, and application-level power and fast communication but also provides enough FPGA resources
EDP by 1.9 and 2.2, respectively, while delivering similar and takes advantage of the state-of-the-art FPGA designs.
performance. Compared with a baseline system consisting of Designing FPGA-based accelerators requires hardware
a CPU layer and a DRAM layer, the proposed architecture expertise, which is a barrier to wider adoption of reconfig-
can achieve 47.5% power reduction for the entire system. urable computing systems. Fortunately, recent developments
Compared with an alternative system consisting of a CPU in the HLS have significantly lowered this barrier. HLS tools
layer, an L3 cache layer, and a DRAM layer, the proposed directly convert functions written in high-level languages to
architecture can achieve 72.9% power reduction for the entire hardware accelerators [17][22]. Some researchers have also
system. studied direct compilation of compute unified device architec-
We summarize the contributions of this paper as follows. ture (CUDA) or OpenCL (designed for heterogeneous parallel
1) We propose a new 3-D architecture with the CPU, computing) to hardware [23], [24]. This takes advantage of
FPGA, and DRAM layers for low-power computation. the parallelism exposed by both languages and is suitable for
2) We demonstrate a communication mechanism that systems with off-chip FPGAs. In this paper, we use a state-
allows the FPGA layer to access CPU data caches, of-the-art HLS tool called Vivado HLS (known as AutoESL
directly crossing clock domains. before Xilinxs acquisition) [25]. A third-party report indicates
3) We develop a framework to evaluate the proposed archi- that it can achieve a quality level similar to that of manual
tecture. designs by hardware experts, for several applications [26].
4) We evaluate performance, power, and thermal impacts
of the architecture, both at the function and application
levels. B. 3-D ICs
The rest of this paper is organized as follows. Section II The 3-D ICs have multiple layers of dies, which are
reviews related work on reconfigurable computing and connected by the TSVs, stacked in a single package. This
3-D ICs. Section III presents our proposed architecture. makes the so-called More-than-Moore scaling possible. In the
Section IV describes the framework we have developed to 3-D ICs, different layers can be manufactured and tested
evaluate the architecture. Section V presents the experimental separately. This not only increases the yield but also allows
results. Finally, the conclusion is drawn in Section VI. different layers to be optimized and manufactured in different
technologies. Researchers have studied various approaches
II. R ELATED W ORK for using the 3-D IC architectures to improve performance
and/or reduce power consumption. One popular approach is
This section reviews related work on the reconfigurable
to put caches or memories in a layer separate from the CPU
computing and the 3-D ICs that will help place our work in
layer [27][32]. Although this approach is easy to implement,
the proper context.
it is confronted with diminishing returns after memory/cache
sizes exceed certain thresholds. Another approach for using
A. Reconfigurable Computing the 3-D ICs is to split computational components, such as
Reconfigurable computing uses reconfigurable fabrics to CPU cores, into multiple layers. This can reduce the wire
perform computations. A reconfigurable computing system length compared with the original 2-D designs, thus lead to
can be built using two methods: 1) using customized performance and power benefits. However, this approach has
reconfigurable fabrics and 2) using commodity FPGAs. two drawbacks: 1) it requires a very large number of TSVs,
Several designs, such as Garp, NAPA1000, and Chimaera, which are costly and 2) since computational components
were constructed using the first method [7][11]. Customized generally consume more power than caches or memories,
reconfigurable fabrics can usually run at higher frequencies stacking multiple layers of computational components may
than commodity FPGAs due to the customization. They create thermal problems [33]. Compared with the above two
can also be embedded very close to the CPU to enable approaches, as an alternative, adding an FPGA as a sepa-
fast communication. However, one drawback of this method rate layer has two unique benefits: 1) the FPGA layer adds
is that only a small amount of reconfigurable fabric can computational power and 2) since an FPGA layer consumes
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 3
Fig. 2. Example of HLS-generated accelerator. (a) Matrix multiplication source code. (b) Generated accelerator.
TABLE I
A CCELERATOR I NSTRUCTIONS
but may have one or more data channels. With the wrapper
added, an accelerator can be easily connected to the rest of
the system.
CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 5
the CPU kernel and the data TLB (DTLB)/cache, it only incurs
a minor increase in path delay, which we analyze further
in Section V. The CPU and accelerator are connected through
a pair of FIFOs. These FIFOs are used as buffers to alleviate
the speed mismatch, because the CPU core operates at a higher
frequency and the accelerator at a lower one. On the CPU
side, we use an adapter, called CPU-side adapter, to connect
the FIFOs with the data cache switch. On the FPGA side,
another adapter, called accelerator-side adapter, is used to
connect the FIFOs to accelerator channels. Each accelerator
may have more than one channel. Therefore, the accelerator-
side adapter embeds both multiplexing and demultiplexing
Fig. 5. Packet format.
functions inside. Because the function of the accelerator-side
adapter is fixed, it can be prefabricated directly into the FPGA, value and data size of the return value (if it exists),
as in the case of the 3-D network interface [36]. This enables respectively. Binary codes 00, 01, 10, and 11 indicate
the accelerator-side adapter to work at the same frequency as 1-, 2-, 4-, and 8-byte data size, respectively.
that of the CPU core. As a result, the CPU data cache and both 5) Memory Read (Reading Memory): The address and size
adapters work at a higher frequency than the accelerator, thus fields correspond to the address and data size of the read
providing enough bandwidth for the accelerator. This is similar operation, respectively. Data size encoding is the same
to the idea of multipumping [37], where the cache operates as above.
at a higher frequency than the FPGA accelerator, allowing 6) Memory Write (Writing Memory): The address, value,
multiple memory accesses per cycle by the accelerator. Note and size fields of the packet correspond to the address,
that our communication mechanism does not require any extra value, and data size, respectively. Data size encoding is
resources for caches and memory controllers. As a result, the same as above.
it saves resources compared with the multipumping design. Fig. 6(a) shows an example of CPU-accelerator commu-
Six types of packets are used for CPU-accelerator nication using the packets introduced above. First, the CPU
communication, as shown in Fig. 5. Three go from the CPU executes a WRITE_ACC instruction, which sends a Write
to the accelerator and the other three in the opposite direction. packet (packet a) to the accelerator. Then, the CPU executes a
In each packet, the first two bits indicate the packet type. The RUN_ACC instruction, sending a Start packet (packet b) to the
remaining bits provide other information, such as the address. accelerator. During accelerator execution, it accesses memory
Details of each type of packet are given as follows. twice: for a memory read and a memory write. Memory read
1) Start (Starting an Accelerator): The index field of leads to a Memory read packet (packet c) from the accelerator
the packet indicates the index of the accelerator (in a to the CPU and a Memory response packet (packet d) in the
multiaccelerator scenario). opposite direction as a response. Memory write leads to only
2) Write (Writing an Accelerator Register): The index and a Memory write packet (packet e) from the accelerator to
value fields of the packet indicate the register index and the CPU. When the accelerator has finished, it sends a Done
value to be written, respectively. packet (packet f ) to the CPU. In our design, the accelerator
3) Memory Response (Returning Memory Read Result): accesses the virtual memory space directly. This means that a
The value field in the packet is the memory read result. TLB miss may occur during memory accesses. Fig. 6(b) shows
4) Done (Indicating Accelerator Execution is Finished): an example of such a scenario where the Memory read packet
The value and size fields in the packet indicate the (packet i ) triggers a TLB miss exception. In this situation,
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 7. Memory access optimization. (a) Duplicate data channels. (b) Insert
scratchpad memories.
CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 7
TABLE II
S YSTEM C ONFIGURATION
Fig. 8. Evaluation framework. as wrappers generated by our scripts, into gem5 code and
simulate them together. We choose to use the SystemC version,
and cooling cost [42]. Our proposed architecture incurs little since it can be compiled with gem5 source code (written
overhead for the first three costs, given the relatively small mainly in C++ and Python). Since the gem5 simulator and the
number of TSVs it uses. Since it reduces power consumption SystemC simulation engine have two separate event queues,
and alleviates hotspots on chip, as shown in Section V-F, it we employ the method described in [45] to synchronize the
also reduces the cooling cost. two event queues and enable joint simulation.
Finally, in a 3-D IC, the difference in thermal expansion After the integration, gem5 yields performance results,
coefficients between the TSV copper and the surrounding including component-level utilization statistics and wave-
silicon causes thermal-induced stresses that lead to mechanical forms for the accelerator (output of the SystemC simulation
reliability issues. We can use the techniques and tools proposed engine). For most system components except accelerators,
in [43] to deal with such issues. TSVs, adapters, and DRAM, we use McPAT to estimate
power [46]. McPAT takes the component-level utilization
IV. E VALUATION F RAMEWORK statistics, together with the system configuration, to estimate
power for each component. As for the accelerator, we first
We designed a customized framework to evaluate our synthesize, place, and route the accelerators Verilog HDL
proposed architecture. Fig. 8 shows the structure of the code to get its low-level implementation details. Then, we use
framework. It consists of four parts: 1) HLS; 2) performance them with the dumped waveform to estimate the accelerator
simulation; 3) power simulation; and 4) thermal simulation. power. This is done using the Xilinx Vivado tool [25]. For
The HLS part contains Vivado HLS, which converts a TSVs and adapters, we build power macromodels by adding up
C/C++ function into accelerator hardware at the register- the power consumptions of their internal components, similar
transfer level. The function to be converted is chosen based to the way McPAT is built. We use the macromodels, along
on software profiling results. The generated accelerator with their utilization statistics, to estimate their power. For
has multiple versions with the same underlying hardware DRAM, we use Microns DRAM power calculator to estimate
structure but described in different languages: Verilog power (not shown in the figure) [47]. After we have all the
hardware description language (HDL), VHDL, and SystemC. power data, we use HotSpot (with 3-D IC support extension
Then, a script parses the accelerator interface and changes from [31]) to perform thermal simulation [48]. We have
the benchmark source code to use the generated accelerator automated the entire process, shown in Fig. 8, to expedite
instead of the original function. The new source code is experimentation.
cross-compiled into an executable binary for performance
simulation. 1 We assume a 50-cycle memory controller latency (half of the 100-cycle
The performance simulator we use is gem5 [44], an open- latency assumed in an off-chip 2-D version), the same as that in [31].
source full-system simulator. Its open-source nature allows Memory delay is calculated by summing up tRAS and tRP (35 and 13.75 ns,
respectively, for Micron DDR3-1600 synchronous dynamic random-access
us to add accelerator simulation support. In particular, we memory (SDRAM) [49]), leading to 98 cycles at 2-GHz frequency. In all,
integrate the SystemC version of accelerator code, as well the memory delay is 148 cycles.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE III
B ENCHMARKS
CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 9
Fig. 9. Function-level power, energy, execution time, and EDP comparison. (a) Function power. (b) Function energy. (c) Function execution time.
(d) Function EDP.
descriptions of the benchmarks. The function and percentage block RAM (BRAM), and DSP blocks. Most accelerators
columns show one or two of the most time-consuming func- use more than 10% of the LUTs except 264. For 264,
tions in each benchmark and their percentages of the total the accelerated functions, 264_pixel_sad_4_16 16 and
execution time, respectively. pixel_satd_w h, are small functions and thus consume only
2.3% and 8.2% of the LUTs, respectively.
B. Resource Utilization
Table IV shows the percentage of resources used by the C. CPU Versus Accelerator: Function-Level Power,
generated accelerators. In the name column, the -1 and -2 Energy, Performance, and EDP
suffixes indicate the first and second functions when two func- In this section, we compare the power, energy, performance,
tions are accelerated in the same benchmark. These suffixes are and EDP of the out-of-order CPU core and the FPGA
also used in subsequent figures in this section. Four types of accelerators when running the same functions. The first two
resources are listed in the table: lookup table (LUT), register, bars, marked as CPU and Acc, in Fig. 9(a)(d) compare
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 11
Fig. 12. Application-level power, energy, execution time, and EDP comparison. (a) Benchmark power. (b) Benchmark energy. (c) Benchmark execution time.
(d) Benchmark EDP.
Due to this reason, we use CPU + Acc to label results, the the case of function-level results, we also include results when
same as the system name. Similarly, we use CPU-only and the FPGA leakage power is scaled according to resources used
CPU + L3 to label the results when running benchmarks on and show them as the bar labeled +Partial L. in these figures.
CPUs of CPU-only and CPU + L3, respectively. Fig. 12(a)(d) On an average, the power, energy, and EDP of CPU + Acc
shows the power, energy, execution time, and EDP, respec- are now 52.3%, 46.3%, and 46.3% of CPU-only, translating
tively, when running each benchmark. On an average, power, to 1.9, 2.2, and 2.2 reduction, respectively. As expected,
energy, execution time, and EDP of CPU + Acc are 59.7%, the power reduction at the application level is less than at the
53.5%, 93.8%, and 53.5% of CPU-only, respectively. As in function level. This is because the accelerators only account
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
CHEN AND JHA: 3-D CPU-FPGA-DRAM HYBRID ARCHITECTURE FOR LOW-POWER COMPUTATION 13
CPU + Acc reduces power consumption by 47.5% relative [7] J. R. Hauser and J. Wawrzynek, Garp: A MIPS processor with
to CPU-only. By contrast, CPU + L3 consumes much higher a reconfigurable coprocessor, in Proc. 5th Annu. IEEE Symp.
Field-Program. Custom Comput. Mach., Apr. 1997, pp. 1221.
power than CPU-only and CPU + Acc. This is because the L3 [8] C. R. Rupp et al., The NAPA adaptive processing architecture, in Proc.
cache of CPU + L3 consumes considerable amount of power, IEEE Symp. FPGAs Custom Comput. Mach., Apr. 1998, pp. 2837.
even though we choose to use low-standby-power transistors [9] S. C. Goldstein et al., PipeRench: A coprocessor for streaming
multimedia acceleration, in Proc. 26th Int. Symp. Comput. Archit.,
for memory cells in the L3 cache. On an average, CPU + Acc May 1999, pp. 2839.
reduces power consumption by 72.9% relative to CPU + L3. [10] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee, CHIMAERA:
The differences of power consumption lead to differences A high-performance architecture with a tightly-coupled reconfigurable
in maxium temperatures, or hotspots, on the chip. Fig. 14(b) functional unit, in Proc. 27th Int. Symp. Comput. Archit., Jun. 2000,
pp. 225235.
shows comparisons of maximum temperatures. We can see that [11] V. Govindaraju, C.-H. Ho, and K. Sankaralingam, Dynamically spe-
the maximum temperature of CPU + Acc is much lower than cialized datapaths for energy efficient computing, in Proc. IEEE 17th
those of both CPU-only and CPU + L3. On an average, the Int. Symp. High Perform. Comput. Archit., Feb. 2011, pp. 503514.
[12] P. Bertin, D. Roncin, and J. Vuillemin, Introduction to programmable
maximum temperature of CPU + Acc is 49.7 C, 6.4 C lower active memories, Paris Res. Lab., Paris, France, Tech. Rep., pp. 18,
than that of CPU-only (56.1 C), and 9.5 C lower than that of Jun. 1989.
CPU + L3 (59.2 C). This indicates that CPU + Acc does not [13] P. M. Athanas and H. F. Silverman, Processor reconfiguration through
instruction-set metamorphosis, Computer, vol. 26, no. 3, pp. 1118,
aggravate the thermal problem, which is a major problem in Mar. 1993.
the 3-D ICs. On the contrary, it helps to alleviate the thermal [14] Cray, Inc. (2004). Cray XD1 Supercomputer. [Online]. Available:
problem. http://www.cray.com
[15] C. Chang, J. Wawrzynek, and R. W. Brodersen, BEE2: A high-end
reconfigurable computing system, IEEE Des. Test Comput., vol. 22,
VI. C ONCLUSION no. 2, pp. 114125, Mar./Apr. 2005.
In this paper, we proposed a 3-D CPU-FPGA-DRAM hybrid [16] M. Lavasani, H. Angepat, and D. Chiou, An FPGA-based in-line
accelerator for Memcached, Comput. Archit. Lett., vol. 13, no. 2,
architecture for low-power computation. The architecture fea- pp. 5760, Jul./Dec. 2014.
tures an FPGA layer between the CPU and the DRAM layers. [17] J. L. Tripp, M. B. Gokhale, and K. D. Peterson, Trident: From
The FPGA layer implements accelerators that work with CPU high-level language to hardware circuitry, Computer, vol. 40, no. 3,
pp. 2837, Mar. 2007.
cores to achieve low-power computation. This helps address
[18] A. Putnam, D. Bennett, E. Dellinger, J. Mason, P. Sundararajan, and
the dark silicon problem. We also take advantage of HLS S. Eggers, CHiMPS: A C-level compilation flow for hybrid CPU-FPGA
to make the proposed architecture easy to use and suitable architectures, in Proc. Int. Conf. Field Program. Logic Appl., Sep. 2008,
for general-purpose computation. To evaluate the architecture, pp. 173178.
[19] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang,
we created an evaluation framework and conducted various High-level synthesis for FPGAs: From prototyping to deployment,
experiments. At the function level, experimental results show IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 30, no. 4,
that the accelerators on the FPGA layer can reduce power and pp. 473491, Apr. 2011.
[20] A. Canis et al., LegUp: An open-source high-level synthesis tool
EDP by 6.9 and 7.2, respectively, compared with an out- for FPGA-based processor/accelerator systems, ACM Trans. Embedded
of-order CPU while delivering similar performance. At the Comput. Syst., vol. 13, no. 2, Sep. 2013, Art. ID 24.
application level, the accelerators on the FPGA layer can [21] Calypto Design Systems. Catapult. [Online]. Available: http://calypto.
com/en/products/catapult/overview, accessed Mar. 2015.
reduce power and EDP by 1.9 and 2.2, respectively. For
[22] Cadence, Inc. C-to-Silicon Compiler. [Online]. Available: http://www.
the entire system, this translates to 47.5% power reduction cadence.com/products/sd/silicon_compiler, accessed Mar. 2015.
compared with the baseline system without the FPGA layer [23] A. Papakonstantinou, K. Gururaj, J. A. Stratton, D. Chen, J. Cong, and
and 72.9% power reduction compared with an alternative W.-M. W. Hwu, FCUDA: Enabling efficient compilation of CUDA
kernels onto FPGAs, in Proc. IEEE 7th Symp. Appl. Specific Processors,
system which has the FPGA layer replaced with an L3 cache Jul. 2009, pp. 3542.
layer. These results demonstrate that the proposed architecture [24] Altera Corporation. (Nov. 2012). Implementing FPGA Design With
is effective in achieving the power reduction goal. the OpenCL Standard. [Online]. Available: http://www.altera.com/
literature/wp/wp-01173-opencl.pdf
[25] Xilinx, Inc. (2014). Vivado Design Suite System Edition
R EFERENCES (Version 2014.1). [Online]. Available: http://www.xilinx.com
[1] G. Venkatesh et al., Conservation cores: Reducing the energy of mature [26] Berkeley Design Technology, Inc. (2010). An Independent Evaluation of:
computations, in Proc. 15th Ed. ASPLOS Archit. Support Int. Conf. The AutoESL AutoPilot High-Level Synthesis Tool. [Online]. Available:
Support Program. Lang. Oper. Syst., 2010, pp. 205218. http://www.bdti.com/MyBDTI/pubs/AutoPilot.pdf
[2] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and [27] C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, Bridging the
D. Burger, Dark silicon and the end of multicore scaling, in Proc. processor-memory performance gap with 3D IC technology, IEEE Des.
38th Annu. Int. Symp. Comput. Archit., 2011, pp. 365376. Test Comput., vol. 22, no. 6, pp. 556564, Nov./Dec. 2005.
[3] S. Kestur, J. D. Davis, and O. Williams, BLAS comparison on FPGA, [28] G. H. Loh, 3D-stacked memory architectures for multi-core proces-
CPU and GPU, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, sors, in Proc. 35th Int. Symp. Comput. Archit., Jun. 2008, pp. 453464.
Jul. 2010, pp. 288293. [29] D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee, An optimized
[4] D. Chen and D. Singh, Using OpenCL to evaluate the efficiency of 3D-stacked memory architecture by exploiting excessive, high-density
CPUs, GPUs and FPGAs for information filtering, in Proc. 22nd Int. TSV bandwidth, in Proc. IEEE 16th Int. Symp. High Perform. Comput.
Conf. Field Program. Logic Appl., Aug. 2012, pp. 512. Archit., Jan. 2010, pp. 112.
[5] E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai, Single-chip [30] J. Zhao, X. Dong, and Y. Xie, An energy-efficient 3D CMP design
heterogeneous computing: Does the future include custom logic, with fine-grained voltage scaling, in Proc. IEEE Design, Autom., Test
FPGAs, and GPGPUs? in Proc. 43rd Annu. IEEE/ACM Int. Symp. Europe Conf., Mar. 2011, pp. 14.
Microarchitecture, Dec. 2010, pp. 225236. [31] J. Meng, K. Kawakami, and A. K. Coskun, Optimizing energy effi-
[6] B. Betkaoui, D. B. Thomas, and W. Luk, Comparing performance and ciency of 3-D multicore systems with stacked DRAM under power and
energy efficiency of FPGAs and GPUs for high productivity computing, thermal constraints, in Proc. 49th ACM/EDAC/IEEE Design Autom.
in Proc. Int. Conf. Field-Program. Technol., Dec. 2010, pp. 94101. Conf., Jun. 2012, pp. 648655.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[32] G. Sun, H. Yang, and Y. Xie, Performance/thermal-aware design of [57] C. Bienia, Benchmarking modern multiprocessors, Ph.D. dissertation,
3D-stacked L2 caches for CMPs, ACM Trans. Design Autom. Electron. Dept. Comput. Sci., Princeton Univ., Princeton, NJ, USA, Jan. 2011.
Syst., vol. 17, no. 2, Apr. 2012, Art. ID 13. [58] M. Gebhart, J. Hestness, E. Fatehi, P. Gratz, and S. W. Keckler, Running
[33] B. Black et al., Die stacking (3D) microarchitecture, in Proc. 39th PARSEC 2.1 on M5, Dept. Comput. Sci., Univ. Texas Austin, Austin,
Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2006, pp. 469479. TX, USA, Tech. Rep. TR-09-32, Oct. 2009.
[34] R. Brillu, S. Pillement, A. Abdellah, F. Lemonnier, and P. Millet, [59] R. Miftakhutdinov, E. Ebrahimi, and Y. N. Patt, Predicting performance
FlexTiles: A globally homogeneous but locally heterogeneous many- impact of DVFS for realistic memory systems, in Proc. 45th Annu.
core architecture, in Proc. 6th Workshop Rapid Simul. Perform. Eval., IEEE/ACM Int. Symp. Microarchitecture, Dec. 2012, pp. 155165.
Methods Tools, 2014, Art. ID 3.
[35] Xilinx, Inc. (May 2014). Vivado Design Suite User Guide: High-Level
Synthesis. [Online]. Available: http://www.xilinx.com
[36] F. Lemonnier et al., Towards future adaptive multiprocessor systems-
on-chip: An innovative approach for flexible architectures, in Proc. Int.
Conf. Embedded Comput. Syst., Jul. 2012, pp. 228235. Xianmin Chen received the B.S. degree in informa-
[37] J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski, tion engineering from Shanghai Jiao Tong Univer-
Impact of cache architecture and interface on performance and sity, Shanghai, China, in 2007, and the M.S. degrees
area of FPGA-based processor/parallel-accelerator systems, in Proc. from Shanghai Jiao Tong University and the Georgia
IEEE 20th Annu. Int. Symp. Field-Program. Custom Comput. Mach., Institute of Technology, Atlanta, GA, USA, in 2010.
Apr./May 2012, pp. 1724. He is currently pursuing the Ph.D. degree in com-
[38] A. Putnam et al., Performance and power of cache-based reconfigurable puter engineering with Princeton University, Prince-
computing, in Proc. 36th Annu. Int. Symp. Comput. Archit., 2009, ton, NJ, USA. His current research interests include
pp. 395405. FinFET-based system design, reconfigurable com-
[39] A.-C. Hsieh and T. Hwang, TSV redundancy: Architecture and design puting, 3-D integrated circuit, and network-on-chip.
issues in 3-D IC, IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,
vol. 20, no. 4, pp. 711722, Apr. 2012.
[40] H.-H. S. Lee and K. Chakrabarty, Test challenges for 3D integrated
circuits, IEEE Des. Test, vol. 26, no. 5, pp. 2635, Sep. 2009.
[41] D. L. Lewis and H.-H. S. Lee, A scan-island based design enabling
pre-bond testability in die-stacked microprocessors, in Proc. IEEE Int.
Test Conf., Oct. 2007, pp. 18. Niraj K. Jha (S85M85SM93F98) received
[42] X. Dong, J. Zhao, and Y. Xie, Fabrication cost analysis and cost- the B.Tech. degree in electronics and electrical
aware design space exploration for 3-D ICs, IEEE Trans. Comput.- communication engineering from IIT Kharagpur,
Aided Design Integr. Circuits Syst., vol. 29, no. 12, pp. 19591972, Kharagpur, India, in 1981, the M.S. degree in
Dec. 2010. electrical engineering from SUNY at Stony Brook,
[43] D. Z. Pan et al., Design for manufacturability and reliability for Stony Brook, NY, USA, in 1982, and the
TSV-based 3D ICs, in Proc. IEEE 17th Asia South Pacific Design Ph.D. degree in electrical engineering from the
Autom. Conf., Jan./Feb. 2012, pp. 750755. University of Illinois at UrbanaChampaign,
[44] N. Binkert et al., The gem5 simulator, ACM SIGARCH Comput. Archit. Champaign, IL, USA, in 1985.
News, vol. 39, no. 2, pp. 17, Aug. 2011. He is currently a Professor of Electrical
[45] M. Yu, J. Song, F. Fu, S. Sun, and B. Liu, A fast timing-accurate Engineering with Princeton University, Princeton,
MPSoC HW/SW co-simulation platform based on a novel synchroniza- NJ, USA. He has co-authored and co-edited five books entitled Testing and
tion scheme, in Proc. Int. MultiConf. Eng. Comput. Sci., Mar. 2010, Reliable Design of CMOS Circuits (Kluwer, 1990), High-Level Power Analysis
pp. 15. and Optimization (Kluwer, 1998), Testing of Digital Systems (Cambridge
[46] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and University Press, 2003), Switching and Finite Automata TheoryThird Edition
N. P. Jouppi, McPAT: An integrated power, area, and timing modeling (Cambridge University Press, 2009), and Nanoelectronic Circuit Design
framework for multicore and manycore architectures, in Proc. 42nd (Springer, 2010). He has authored 15 book chapters. He has authored or
Annu. IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp. 469480. co-authored over 400 technical papers. He holds 14 U.S. patents. His current
[47] Micron Technology, Inc. DDR3 SDRAM System-Power Calculator. research interests include FinFETs, low power hardware/software design,
[Online]. Available: http://www.micron.com, accessed Mar. 2015. computer-aided design of integrated circuits and systems, digital system
[48] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, testing, quantum computing, and secure computing.
and D. Tarjan, Temperature-aware microarchitecture, in Proc. 30th Prof. Jha is a fellow of the Association for Computing Machinery.
Annu. Int. Symp. Comput. Archit., Jun. 2003, pp. 213. He received the Distinguished Alumnus Award from IIT Kharagpur. He
[49] Micron Technology, Inc. 1GB DDR3 SDRAM Datasheet. [Online]. is a recipient of the AT&T Foundation Award and the NEC Preceptorship
Available: http://www.micron.com, accessed Mar. 2015 Award for research excellence, the NCR Award for teaching excellence, and
[50] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, Architecting the Princeton University Graduate Mentoring Award. A paper of his was
efficient interconnects for large caches with CACTI 6.0, IEEE Micro, selected for The Best of ICCAD: A collection of the best IEEE International
vol. 28, no. 1, pp. 6979, Jan./Feb. 2008. Conference on Computer-Aided Design papers of the past 20 years, two
[51] International Roadmap Committee. (2012). International Technology papers by the IEEE Micro Magazine as one of the top picks from the 2005 and
Roadmap for Semiconductors (2012 Update). [Online]. Available: 2007 Computer Architecture Conferences, and two others as being among the
http://www.itrs.net most influential papers of the last ten years at the IEEE Design Automation
[52] D. H. Kim, S. Mukhopadhyay, and S. K. Lim, Fast and accurate and Test in Europe Conference. He has received nine other best paper awards
analytical modeling of through-silicon-via capacitive coupling, IEEE and six best paper award nominations. He has served as the Editor-in-Chief
Trans. Compon., Packag., Manuf. Technol., vol. 1, no. 2, pp. 168180, of the IEEE T RANSACTIONS ON VLSI S YSTEMS and an Associate Editor
Feb. 2011. of the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS I AND II, the
[53] Xilinx, Inc. (Mar. 2014). 7 Series FPGAs Packaging and Pinout (v1.11). IEEE T RANSACTIONS ON VLSI S YSTEMS , the IEEE T RANSACTIONS
[Online]. Available: http://www.xilinx.com ON C OMPUTER -A IDED D ESIGN , and the Journal of Electronic Testing:
[54] Semiconductor Industries Association. (2008). International Technology Theory and Applications. He serves as an Associate Editor of the IEEE
Roadmap for Semiconductors (2008 Update). [Online]. Available: T RANSACTIONS ON C OMPUTERS , the Journal of Low Power Electronics,
http://http://www.itrs.net and the Journal of Nanotechnology. He served as the Program Chairman
[55] Z. Chishti and T. N. Vijaykumar, Optimal power/performance pipeline of the 1992 Workshop on Fault-Tolerant Parallel and Distributed Systems,
depth for SMT in scaled technologies, IEEE Trans. Comput., vol. 57, the 2004 International Conference on Embedded and Ubiquitous Computing,
no. 1, pp. 6981, Jan. 2008. and the 2010 International Conference on VLSI Design. He has served as the
[56] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, Proposal and quan- Director of the Center for Embedded System-on-a-Chip Design funded by
titative analysis of the CHStone benchmark program suite for practical the New Jersey Commission on Science and Technology. He has served on
C-based high-level synthesis, J. Inf. Process., vol. 17, pp. 242254, the program committees of more than 150 conferences and workshops. He has
Oct. 2009. given several keynote speeches in the area of nanoelectronic design and test.