Sunteți pe pagina 1din 8

2008 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems

A PLD Architecture for High Performance Computing


Masayuki Sato
Graduated School of Engineering
Tokyo Metropolitan University
Tokyo, Japan
m sato@iel.sd.tmu.ac.jp

Naoki Hirakawa, Masanori Yoshihara,


Kazuya Tanigawa, Tetsuo Hironaka
Graduated School of Information Sciences
Hiroshima City University
Hiroshima, Japan
mpld@csys.ce.hiroshima-cu.ac.jp

MPLD for HPC are shown and the MPLD architecture is


presented. After that, the structure of prototype MPLD chip
for evaluation is described and the evaluation results are
shown. Finally, we conclude our paper.

AbstractIn recent years, Field Programmable Gate Arrays


(FPGAs) have been used for High Performance Computing
(HPC). Because there is a signicantly difference between
conguration speed of FPGA and execution speed of Central
Processing Unit (CPU), the difference causes performance
degradation. To resolve of this problem, we proposed MPLD
as a new Programmable Logic Device (PLD) architecture with
high speed reconguration. The merits of the MPLD in HPC
are high speed conguration and easy partial conguration.
This is achieved by the conguration method which is same as
write memory access of conventional parallel memory. In this
paper, we describe the problems of FPGA on using it in HPC,
and present the MPLD architecture which solves the problems.
Some evaluation results of the prototype MPLD chip which
implemented by using ve metal layers ROHM 0.18m CMOS
technology are also presented. As results, memory capacity of
the prototype MPLD was 49152bit, and the core area was
1767.541690.96m2 and the number of metal layers used for
wiring was three. The achieved conguration time is about
6.6sec for whole prototype MPLD. The conguration speed
of the prototype MPLD is about 11.7 times higher than AS
conguration used for Altera FPGAs

II. P ROBLEMS OF HPC USING FPGA S


Conventional HPC by using plural CPUs for parallel
computing achieves high speed large scale calculation. But,
the cost of HPC is high. To resolve this problem, HPC with
FPGA as recongurable device has been focused in past few
years. Using FPGA which congure various types of circuits
on HPC has possibility to achieve the reduction of area
size on HPC, low power consumption and low cost. FPGA
executes some part of application in hardware speed, so it
achieves high speed execution than conventional HPC. But,
FPGA needs conguration time for execution of application.
This means performance of HPC with FPGA depends on
conguration time. So, signicance of conguration time of
FPGA is explained in the following subsection.

Keywords-PLD; FPGA; MPLD; high speed conguration;


easy partial conguration;

A. Signicance of Conguration time on HPC


Figure1 shows the simplest model of HPC system with
FPGA. In gure1, this model consists of CPU, I/O controller,
memory and FPGA. In the conguration of FPGA, conguration data stored in the memory is transported from memory
to FPGA by I/O controller. Memory has two regions. One is
for execution of CPU, another one is for preservation of conguration data. Signicance of conguration time on HPC
is shown in gure2. In case the conguration time needs
less time than execution time of CPU, hiding performance
degradation is possible. But, in case the conguration needs
more time than execution time of CPU. Hiding performance
degradation is difcult.

I. I NTRODUCTION
Field Programmable Gate Arrays (FPGAs) has been used
for implementing various types of logic functions. Especially, with the appearing high performance FPGA and PLD
like Virtex series of Xilinx[1] and Stratix series of Altera[2],
research of High Performance Computing by using FPGA is
coming out to be more exciting eld than before. However,
FPGA on HPC is used as accelerator of Central Processing
Unit (CPU). But, there is a signicant performance gap
between FPGA and CPU caused by the difference between
the conguration speed of FPGA and execution speed of
CPU. To resolve this problem, we proposed MPLD as a new
Programmable Logic Device (PLD) architecture with high
speed reconguration[3]. The merits of the MPLD in HPC
are high speed conguration and easy partial conguration.
This is achieved by the conguration method which is same
as write memory access of conventional parallel memory.
In this paper, the problem of FPGA on HPC and the
structure of FPGA are described at rst. Next, merits of
0-7695-3770-7/70 $25.00 3770 IEEE
DOI 10.1109/IWIA.2008.12

III. FPGA
To verify the usage of MPLD, the structure of FPGA is
described and shown the problems of FPGA here. Figure3
shows the basic structure of FPGA. In the gure, FPGA
consists of LB, CB, SB and IOB[4].
Logic Block(LB)
LB is a fundamental element for logic. There are
35

I/O Block
(IOB)
Switch Block(SB)

Figure 1.

Logic Block
(LB)

The Simplest Model of HPC with FPGA

Figure 3.

Figure 2.

Connection Block
(CB)

Signicance of Conguration Time on HPC

Basic Structure of The Conventional FPGA

the method of conguration. The method of conguration of


FPGA is serial conguration and partial reconguration is
difcult because of the structure of conguration memory.
The connection of conguration memory is done by daisy
chain. Because of this structure of conguration memory,
the entire area of FPGA has to be recongured for partial
reconguration of the FPGA.

various types of structure like Congurable Logic


Block(CLB), Logic Element(LE) and so on. CLB and
LE are implemented by using LUT method and MUX
method. LUT method is a method using SRAM cells.
In the method, SRAM cells store the truth table of
logic. Input data refer to the truth table in SRAM
cells and output data are obtained. MUX method is a
method by using combinational circuit like AND, OR
and MUX. Small combinational circuit which consists
of AND, OR and MUX is called as fundamental gate.
Fundamental gate is not programmable, but the wires
which connect to fundamental gates are programmable.
Using the programmable wires achieve the functions of
LB.
Connection Block(CB)
CB connects to wires between LB and SB programmable.
Switch Block(SB)
SB makes inputs to outputs arbitrary directions. SB
consists of programmable switches.
I/O Block(IOB)
IOB is an interface between the external I/O pin and
internal logic. Each IOB controls each I/O pin and make
I/O pin to behave as input, output or input-output.

IV. MPLD A RCHITECTURE

Figure 4.

Basic Structure of MPLD

We proposed MPLD architecture as a replacement of


FPGA for HPC. MPLD is a PLD architecture with high
speed partial reconguration. Merits of MPLD are the following.
Conguration speed is fast because conguration
method of MPLD is same as write access of the
conventional parallel memory.
Partial reconguration is easy and fast because the
method of conguration is same as write access of the
conventional parallel memory.
MPLD can be recongured while acting as a logic
circuit because MPLD is based on 2-port memory.

A. Problems of FPGA
Switch Matrix consists of SBs and CBs and achieves the
exibility of wiring on FPGA. But, gure3 shows problems
that the area of Switch Matrix includes CBs and SBs is
huge and it means the cost of FPGA is not cheap. As
another problem, conguration time is not fast because of
36

MPLD can behave as the conventional parallel memory


because of the structure of MPLD.
The basic structure of MPLD is shown in gure4. In the
gure, MPLD consists of MLUTs which are fundamental
elements of MPLD. In the following subsection, we describe
the basic structure and usage of MPLD.

A. Structure of MLUT

Figure 7.

MLUT with 4 Address-Data pairs

Figure 7 is an example of MLUT with 4 Address Data pairs.


Figure 8 shows the basic structure of MPLD which consists
of MLUTs with 4 Address Data pairs. In Figure 8, each
Address Data pair is connected to the MLUTs at the upper,
lower, right and left respectively.
Figure 5.

2-port memory

The fundamental element of MPLD is MLUT with following functions.


LUT
Switch Matrix
Memory
To meet the requirements, the basic design of MLUT is
based on 2-port memory. Since 2-port memory is possible
to read and write simultaneously, this means functions as
LUT and memory can be realized simultaneously. Figure
5 shows the conventional 2-port memory.In Figure 5, let us
assume that input LAD and output LDATA of port1 are used
for logic, and input MAD and MDATA of port2 are used for
memory access. The key idea to achieve the requirements of
the MLUT is to pair LAD and LDATA as Figure 6. In Figure
6, each bit of LAD and LDATA are grouped as Address
Data pairs, such as a pair of LAD1 and LDATA1, a pair of
LAD2 and LDATA2,!D and a pair of LADN and LDATAN.
Each Address Data pair works as a I/O port for the MLUT.

Figure 6.

Figure 8.

Basic Structure of MPLD

B. Behavior of MLUT
To explain the behavior of MLUT, the MLUT with 4
address data pairs shown in Figure 7 is used as an example.
Conguration
Figure 9 shows wires for conguration in MLUT. Since
method of conguration is same as write access of the
conventional parallel memory, MAD can be used to
select the MLUT to recongure, with the data provided
from MDATA in memory access. Conguration of the
MLUT is provided as the truth table like the conventional LUT.
Behavior as logic circuit
Let us assume the MLUT behaves as a logic circuit like
Figure 10. The logic circuit in Figure 10 is expressed
as the truth table. The truth table of this logic circuit
is shown in Table I. In Table I, each Input and Output
are correspond to each bit of the address and data port
of the conventional parallel memory. On conguration,
this truth table is written to the MLUT, then the MLUT

Address-Data pair

37

Conguring this truth table into the MLUT, the MLUT


works as shown in Figure 13.

Figure 9.

I/O to congure MLUT


Figure 12.

works as the logic circuit as shown in Figure 10. Figure


11 shows the MLUT after conguration. In the Figure,
LAD works as the input of the truth table, and LDATA
works as the output of the truth table.

Figure 10.

Behavior as Logic

Figure 11.

Behavior as Logic

LAD0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1

Table I
T RUTH TABLE FOR L OGIC
LAD0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1

Input
LAD1
LAD2
0
0
0
0
0
1
0
1
1
0
1
0
1
1
1
1
0
0
0
0
0
1
0
1
1
0
1
0
1
1
1
1

LAD3
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1

LDATA0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1

Output
LDATA1 LDATA2
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*

Switch Matrix

Table II
T RUTH TABLE FOR S WITCH M ATRIX
Input
LAD1
LAD2
0
0
0
0
0
1
0
1
1
0
1
0
1
1
1
1
0
0
0
0
0
1
0
1
1
0
1
0
1
1
1
1

LAD3
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1

Figure 13.
LDATA3
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*

LDATA0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1

Output
LDATA1 LDATA2
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1

LDATA3
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1

Behavior as Switch Matrix in MLUT

C. Compilation of MPLD
Currently, the compiler environment for MPLD is now
on developed. The structure of MPLD is different to that of
FPGA. So, all conventional method of compiling to FPGA
cannot be used for MPLD. Figure14 shows the conventional
method of compiling to FPGA. To implement applications
on FPGA, writing HDL description of algorithm is done
at rst. After functional simulation, HDL description is
transformed to netlist by executing synthesis. From the
netlist, mapping, placing and routing are done. After that,
information of implementation is transported to FPGA as
bitstream. Synthesis includes independent processes and
dependent processes. The following processes from dependent processes of synthesis are dependent process also.
Independent process is not dependent on the technology.
This means the conventional process can be used in the

Switch Matrix
Assuming the MLUT behaves as the switch matrix
like Figure 12, the switch matrix in Figure 12 is
expressed as the truth table. The truth table of this
switch matrix is shown in Table II. In Table II, Input
LAD0, LAD1, LAD2 and LAD3 are corresponded to
Output LDATA1, LDATA2, LDATA3 and LDATA0.
38

Figure 14.

Conguration to FPGA
Figure 15.

independent process. Dependent process is dependent on the


technology. This means the dependent process is differ in the
types of devices. For implementing compilation of MPLD,
independent process has to be considered. Independent processes include a part of synthesis, mapping, placing and
routing.FPGA uses LB for LUT and CBs and SBs for Switch
Matrix. While, MLUT which is a fundamental element of
MPLD behaves as LUT and Switch Matrix to implement
logic circuits. Because of the elements for Switch Matrix
and LUT is same in MPLD, current algorithm of FPGA for
routing cannot be adopted to MPLD. Besides in case a lot of
MLUTs in MPLD are used as Switch Matrix, wires between
MLUTs are crowded and cause degradation the efciency
of wiring. So, researching the method for reducing wires
between MLUTs is important.

Figure 16.

Structure of Prototype MPLD

Connection of MLUTs with 6 Address-Data pairs

A. MLUT
V. D ESIGN OF P ROTOTYPE MPLD

Figure 17 shows the structure of MLUT with six address


data pairs. As seen in Figure 17, the MLUT with six address
data pairs consists of six 2-port memory blocks, decoder for
logic and MUXs. Each memory block is implemented as
a 16x8 bit 2-port memory. This means that the prototype
MPLD has extra memory capacity for single cycle dynamic
context switch. So, the capacity of this MLUT is twice of
the context size.
decoder for logic
Function of decoder for logic is selecting row. In
gure17, input for logic is used as address to select
row. SEL is for context switch.
MUX
MUX selects output from 2-port memory block. Figure18 shows the structure of 8 to 1 MUX. In the gure,

To evaluate the MPLD architecture, prototype MPLD is


designed[5]. Figure 15 shows the structure of prototype
MPLD. In gure15, prototype MPLD consists of the 16x4
MLUTs array, decoders for conguration and D-FFs for
making MPLD possible to behave as the sequential circuit.
Each MLUT of the prototype MPLD has six address data
pairs. TableIII shows I/O of the prototype MPLD. In the
table, LAD is 24 bit, LDATA is 24 bit, MAD is 8 bit, and
MDATA is 48 bit. Each MLUT which is neighboring the
D-FF has its output connected to the D-FF, and also has
its input connected to output of the D-FF, so the MLUT
can feedback its output to itself. Each component of the
prototype MPLD is shown in the following subsections.
39

Table III
I/O OF P ROTOTYPE MPLD
Name
MAD
MDATA
LAD
LDATA
WE
RE
PRE
SEL
CLK

I/O
Input
Input/Output
Input
Output
Input
Input
Input
Input
Input

cells. Because MLUT has SEL for context switch, 2port memory block has two areas for conguration.
Figure19 shows the structure of 2-port SRAM cell in
the MLUT and tableIV shows the I/O of 2-port SRAM
cell. BL is input-output for memory, qBL is opposite
signal to BL and qRL is output for logic. In this 2port SRAM cell, one nMOS transistor is reduced in
contrast with the conventional 2-port SRAM cell. This
is because logic function of MLUT needs output for
logic only.

Explanation
Input for conguration
Input/output for conguration
Input for logic
Output for logic
Input to control writing
Input to control reading
Input to control precharge
Input for context switch
Clock for D-FF

Output for Logic(6bit)


RE

qRL

MUX
IN0

IN1

MUX

IN7

IN0

IN1

IN7

Decoder for Logic

16*8

WL

16*8

Input from decoders


for configuration

Figure 19.

2-port SRAM cell

16

8*2

Table IV
I/O OF 2- PORT SRAM CELL
Name
WL
BL
qBL
LE
qRL

8*2

I/Os for Configuration


8bit * 6*2

Figure 17.

WL

SEL

16

qBL

BL

2-port momory block

2-port momory block


Input for Logic

Input for Logic

I/O
Input
Input/Output
Input-output
Input
Output

Explanation
Input for conguration
Input/output for conguration
Opposite input/output of BL
Input for logic
Output for logic

Structure of MLUT with 6 Address-Data pairs

B. Row Decoder
Row decoder behaves as same as decoders of the conventional parallel memory to select row. Figure20 shows the
structure of row decoder. Address input of row decoder is
divided as upper address and lower address. Upper address
is used for selecting MLUT. Lower address is used for
selecting internal row of MLUT.

input of MUX is eight and output of MUX is one. 3


bit input signal is decoded to 8bit and they are used
as control signal. IN0!AIN7 is selected by C0!AC7 and
output DX is amplied by inverter.
IN0
C0

C. Column Decoder
Figure21 shows the structure of column decoder. In the
gure, column decoder consists of decoder to select column
and R/W unit for reading and writing data. Figure22 shows
the structure of R/W unit. In the gure, R/W unit consists
of CM unit, PRE unit, Read unit and Write unit.Functions
of each unit are the following.
CM Unit
CM Unit is sense amplier. Sense amplier is a current
mirror type and amplify the voltage of BL.
PRE Unit
PRE Unit controls precharge. PRE Unit adopted the
VD /2 precharge method and precharge BL and qBL to
VD /2.

DX

IN7
C7

Figure 18.

8 to 1 MUX

2-port memory block


2-port memory block consists of 168 2-port SRAM
40

COL_AD7

COL_AD4

COL_AD3

COL_AD0

BL

CM
Unit

OUT255

OUT240

OUT239

Write
Unit

qBL

WE

enable
from Decorder

Figure 22.

OUT015

WE

PRE

RE

Basic Block of R/W unit

OUT000

Row decoder

Figure 23.

Figure 21.

IN/OUT

RE

Figure 20.

PRE
Unit

OUT224

Read
Unit

Column Decoder

and the number of metal layers used for wiring was three.
From evaluation results, latency of memory write access was
12.8nsec. This means that the conguration speed of MPLD
is about 78.1 MHz because it depends on memory write
access speed. Since conguration on each MLUT requires
sixteen times of memory write accesses and prototype
MPLD consists of 64 MLUTs, the achieved conguration
time is about 6.6sec for whole prototype MPLD. Transport
quantity of the conguration data in the prototype MPLD
is 48bit/12.8nsec = 3.75 Mbit per second(bps). Transport
quantity of the conguration data in Altera FPGAs using
Active Parallel(AS) conguration is 16bit/50nsec = 0.32
Mbps[6]. So, the conguration speed of the prototype MPLD
is about 11.7 times higher than AS conguration used for
Altera FPGAs.

Read Unit
Read Unit consists of 3-state buffer and powerful buffer
to amplify the output signal for reading.
Write Unit
Write Unit consists of 3-state buffer and powerful
buffer to amplify the input signal for writing.
VI. E VALUATION
Table V
E VALUATION R ESULTS
Behavior
Read
Write
32bit Counter
32bit Full Adder

Prototype MPLD

Latency
16.4nsec
12.8nsec
9.35nsec
121.6nsec

VII. C ONCLUSIONS AND F UTURE W ORK


In this paper, we presented MPLD architecture as the low
cost PLD with high speed partial reconguration. MPLD
does not use switch matrix, which means it can be implemented with low cost. Since method of conguration is
same as write access of the conventional parallel memory,
conguration speed is faster than that of the conventional
FPGA and on the same time ne grain partial reconguration
is possible too.
In the future, we plan to improve the structure of MPLD

We implemented a prototype MPLD to conrm its function by using ve metal layers ROHM 0.18m CMOS
technology, and conrmed its functions as memory and PLD
by conguring it as a 32bit counter as an example application. Evaluation results are shown in Figure 23 and Table
V. As results, memory capacity of the prototype MPLD
was 49152bit, and the core area was 1767.541690.96m2
41

from the results of prototype MPLD chip and also develop


the compiler of the MPLD for practical use.

Acknowledgement
The VLSI chip in this study has been fabricated in the
chip fabrication program of VLSI Design and Education
Center(VDEC), the University of Tokyo in collaboration
with Rohm Corporation and Toppan Printing Corporation.
R EFERENCES
[1] http://www.xilinx.com/
[2] http://www.altera.com/
[3] Naoki Hirakawa, Masanori Yoshihara, Masayuki Sato, Kazuya
Tanigawa and Tetsuo Hironaka, Low Cost PLD with High
Speed Partial Reconguration, ITC-CSCC 2008, July 6-9,
2008, to appear
[4] Toshinori Sueyosi, Hideharu Amano, Recongurable System, Ohmsya (in Japanese), 2005
[5] Masanori Yoshihara!$Naoki Hirakawa, Kazuya Tanigawa,
Tetsuo Hironaka and Masayuki Sato!$Implementation of
Memory(MPLD) with the Ability to Work as a Recongurable Device, IEICE Technical Report RECONF2007-16 (in
Japanese)!$pp.7-12, 2007
[6] http://www.altera.com/support/devices/conguration/cfgcompare.html

42

S-ar putea să vă placă și