A Dual-Threaded Architecture For Interval Arithmetic Coprocessor With Shared Floating Point Units

A Dual-Threaded Architecture for Interval Arithmetic
Coprocessor with Shared Floating Point Units

Virgil E. Petcu Alexandru Amaricai Mircea Vladutiu
Department of Computer Science and Engineering
University Politehnica of Timisoara
virgil.e.petcu@gmail.com; alexandru.amaricai@cs.upt.ro; mvladgcs.upt.ro
Abstract-This paper presents a new type of coprocessor using a new scheduling technique. This technique is able to
architecture suited for both conventional floating point and identify all the available FU parallelism, at the FU pipeline
interval arithmetic. The coprocessor is composed of two logical . '
processors (LP). The floating point units are shared between stage level.The novelty of the schedulng technique consists
these two LPs in order to reduce the area overhead. Some in the ability to dynamically verify if an FUs resources
functional units implement two or more operations (for example needed for a certain instruction are free. The availability of
the multiply-add fused (MAF) unit can be used for addition, those resources varies dynamically with the instructions
multiplication or multiply-add fused). The set of functional units currently executed it's micro FUs.
can thus help reduce the number of structural hazards and This paper is organized as follows: in Section lIthe overall
increase the resource utilization (for example, if addition occurs -
on both LPs, one can be executed on the adder, while the other architecture is presented, Section III is dedicated to the
on the MAF). In order to further reduce the data and structural scheduling mechanism, while in Section IV the floating point
hazards a scheduler for this architecture is also proposed. units are presented. The last section is dedicated to the
Keywords - Simultaneous multi-threading, interval arithmetic, concluding remarks.
parallel architectures
I. INTRODUCTION II. OVERALL ARCHITECTURE
Interval arithmetic represents a more reliable alternative to The goal was to devise an architecture that can balance
conventional floating point (FP) arithmetic. A wide range of high functional unit usage and low overhead. To achieve this
applications in a wide range of fields have been developed in goal we chose a baseline simultaneous dual-threaded
the last decades. Several approaches for dedicated interval architecture similar to [4]. Simultaneous multi-threading
arithmetic units have been proposed, like the ones in [1][5]. (SMT) and its advantages are presented in [7][9][10]. The
This paper presents a coprocessor architecture with support for basic idea that is of interest to us is SMT's ability to raise
interval arithmetic which can be used for improving the functional unit usage (and hence instructions per cycle) by
performance of these applications, issuing instructions from multiple threads in the same clock
The basic architecture is the one of a dual threaded cycle. This feature increases the level of parallelism available.
., ~~~Asshown in Fig.g 1 thet physical
y processor is made up of
prp
processor. The architecture of each LPs is similar to a MIPS
two logical processors (LPs), each having its own pipeline.
type architecture [3]. We chose this architecture to take The issue logic and functional units (FUs) are shared between
advantage of the particular structure of the functional unit set. thloiprcsr.EahLfeuesamxum
the logi1c p rocessors. Each LP features a maximum of one
fon
Each of the processor's threads has in-order issue, and each instruction per cycle in-order fetch and issue. An independent
thread potentially issues an instruction every clock cycle, thread runs on each logic processor. In our context, by thread
Thus, the difficulty of identifying data hazards is the same as we understand a contiguous set of FP and/or interval
in standard one thread in-order issue RISC, yet the potential arithmetic instructions. Inside the coprocessor, the
IPC is double. independency of the threads is implicit since the two LPs have
The functional units of this processor are: one adder, one separate register files and the arithmetic operands can only be
multiplier, one multiply-add fused (MAF), one comparator located in the register file.
and one divide-add fused (DAF). Some units can perform The coprocessor's structure is based on standard RISC
multiple operations: the multiplier can be used also for architecture, more precisely on the MIPS architecture [3].
comparisons; the MAF can implement additions and Each pipeline stage takes one clock cycle. The exception is the
multiplications, while the DAF can be used for divisions and execution stage which can take a variable number of clock
additions. By having the 3 FUs that can execute an addition, cycles. The proposed pipeline contains four pipeline stages:
and 2 that can execute comparison or multiplication, we can instruction fetch (IF), instruction decode (ID), execute (EX)
reduce the number of structural hazards and increase resource and write back (WB). IF and WB are duplicated for each LP.
utilization. Since the latencies of the addition and Also, duplicated are the register files of both LPs.
multiplication are different on different FUs, a complication in
the scheduling scheme appears. We try to solve this problem
978-1 -4244-2277-7/08/$25.00 ©D2008 IEEE

Fig. 1 - The overall architecture
The proposed architecture can perform 14 types of FP I EXRB I

and interval arithmetic operations. The operations directlyI AT
implemented by the instruction set are addition (ADD), The problems encountered when trying to schedule
subtraction (SUB), multiplication (MUL), division (DIV), instructions for issue in a classic in-order fetch and issue
multiply and accumulate (MAF), divide and accumulate machine are detailed in various works [3][8]. In this section
(DAF) and comparison (COM). The FP and interval we present our approach to solving those problems, namely all
arithmetic variations ofthe RISC load and store operations are the possible hazards. Another type of problem is present
not implemented since we focused on the arithmetic because of the possibility that one instruction can be executed
capabilities ofthe processor. on different functional units, with different latencies. Thus
The coprocessor has 5 pipelined arithmetic FUs. These selecting the best allocation of FUs for the two instructions
are the ADD, MULT, COMP, MAF and DAF units. Each of waiting to be issued becomes more complicated compared to a
the FUs is optimized for executing the operation that its name standard machine where instruction latencies are fixed.
suggests. Four of the 5 FUs are implemented as to permit the The purpose of our scheduler is to exploit every bit of
execution of more than one type of operation. A complete
OPERATIONS MAPPING ONNCTIONFUUNIT
parallelism available in FUs. Since the FUs are pipelined, each
mapping of which instructions can be executed on each FU is has a set of stages (micro FUs). These micro FUs can be used
provided in Table 1. TABLE ~ISU 1OM by different instructions in different combinations, as long as
Addition and multiplication are the most common there are not 2 instructions trying to use the same micro FU in
operations found in interval arithmetic and FP arithmetic. The the same clock cycle (CC). Each instruction that can be
proposed implementation of the FUs permits the simultaneous executed on a certain FU can use different subsets of micro
issue of two additions (or subtractions), two multiplications FUs , and in a different order. Our scheduler checks available
and also two comparisons.\The fact that there are effectively 3 FU resources at the micro FU level.
FUs that can issue additions further increases the chance to Another important feature of our issue logic is the
avoid an FU structural hazard when the workload of the reusability of the structure, since any modification to the
coprocessor features many addition (subtraction) operations. micro FUs can be easily updated in the scheduler by simply
Since the LPs have separate hardware resources, apart modifying the initialization of a number of look-up tables.
from the functional units, inter-thread structural hazard can The instruction issue block can be seen as logically divided in
only appear as FU structural hazard. This single source of two stages. The first stage is represented by the WB hazard
additional hazard induced by issuing two threads on the same check, WAW (write-after-write) & RAW (read-after-write)
set of FUs is further reduced by the above mentioned multi- hazard check and micro FUs hazard check blocks (Fig.2). The
operation design of the FUs. An LP's register file is organized information contained by the signals fed out to the second
as a set of entries, each of which can accommodate either an logical stage completely identifies all hazards. The second
FP number or 2 of them (representing an interval).Apart from logical stage is represented by the actual FU selection block.
the FUs, the issue logic is the only shared resource of the two This stage generates control signal for FUs and stall signals
LPs. This means the issue logic is the only hardware resource for LPs.
that needs serious modification from the standard (single- The WB hazard check block verifies whether the two
threaded, in-order issue) RISC architecture. instructions waiting to be issued will write-back to the register
file in the same clock cycle as a previously issued instruction
Operation FUs ~~~~number for each valid FU - instruction combination, and has
Addition ADMF,AFtwo read ports, addressed by an opcode for each thread. The
Multiplication MUL, MAF information in the LUT is used to check if a WB hazard exists,
Comparisson COM, MUL or to update the clock cycles in which there is WB. A shift
Division DAF register per LP (with each bit corresponding to a future clock
Multiply- add fused MAF cycle) is used to hold the clock cycles in which there is WB.
Divide-add fused DAF
rT r
Selected:FUs *\d=instr t t
ADD*2
IR JlR2
~~~~~~~~Selected Opcode 10 FU CAN EXEC LUTI
IRI ADD 2 a0
AfUT; FU 1 < 3-| dd|mul| *-- Div
1 212code,MUL*2 F~~~~~~~~~~___ U OP LUTAD
WB lHazard COPFUSel MUL_
Op d .Check MAF*2 M MSi6l.Hazadrc
icro FUs 5mL1m
D[A1F*2 DAFdal Iiformnation
FU
Hazard in Lp|FUOSR
Stall ~~~~~~IRI&IR2
''"iE R' a
Hazard
Check
Check RAWI'VAN ~~~Selection -~Signals O
ADD FU MD. FU COMPF
Shift Rea Shift Reg Shift Reg
Group Group Group
Selected
IRI G FLJ FU 1F
Mi[cro
. lFllU
7 Control = =
1R2
OP,,de Hazard
Ch=ck
COMk
AF,
|- D
M Signalts MAFPFU
i Re
Shift
DAFFUL
Shift Re- ADO DAF
Group Group 2* . . .*2
Fig. 2 - General structure of operation scheduler Fg2 _Gnrltutro eat sFig 3 - Structure of microFU hazard check unit
The WB hazard check block verifies whether the two The opcode selects its corresponding shift register from
instructions waiting to be issued will write to the register file every group (if it can be issued on the groups FU) and outputs
in the same clock cycle as a previously issued instruction from from these registers the bits corresponding to the next clock
the same thread. A LUT, contains the WB clock cycle number cycle. For input, the registers are addressed by the selected
for each valid FU -instruction combination, and has two read functional unit signals, and the whole groups selected this way
are OR-ed with the output of the FU OP LU.
ports, addressed by an opcode for each thread. The
information in the LUT is used to check if a WB hazard exists The second logic stage of the scheduler (actual FU
or to update the clock cycles in which there is WB. A shift selection block) uses all the information resulted after the first
register per LP (with each bit corresponding to a future clock stage to generate FU selection signals and stall signals for the
two threads. Since on proposed architecture various operations
cycle) is used to hold the clock cycles in which there is WB. can be issued on different FUs, with different latencies, a
The RAW & WAW hazard check block verifies the existence
of the hazards specified in its name. We use one register per selection criterion is needed for the choice of FUs for the 2
LP to hold information about the registers to which the instructions waiting issue in the ID stage. Our chosen criterion
instructions in execution will write-back. The information is is to match each instruction with an FU optimized for it. For
used to check if any of the operands or the result register of this purpose we use a priority encoding scheme. If a match is
the operations waiting to beissuedareusedbfound for both instructions on the same FU at the same
th oerations
in execution. priority level, an LP priority flip-flop makes the choice of
The micro FUs hazard check block (Fig. 3) verifies the which LP uses the FU. The flip-flop then changes value to
potential hazards if the instructions in ID stage are to be issued give the other LP high priority. The highest priority matches
on each possible FU. Two LUTs and a structure based on shift are selected.
registers (FU operation shift register - FUOSR) are used for The employed scheme takes as inputs all the outputs of
this verification. The first LUT (FU CAN EXEC OP LUT) the first logic stage. The outputs are the selected FUs and the
contains information about possible operation - selected operations (if an LP pipeline is stalled, the selected
combainatins. The.in addressisaopoderantho FU
a instruction for that LP is set to NOP), and also the stall signals
sofbintsin Lut The toe
tatrespo nd Fu - for the two LPs. The selected FUs and instructions signals are
input opcodee combination. The OPs
containsuto iformbation.on sconf The second LUT (FU LUT) fed to the ID/EX register and also to the inputs of the blocks in
LcTsbet-we - Tw the first logic stage, so that the machine's state can be
instruc tionsothahcoudnstrtteecutonea uany
mbro c updated. Each of the LUTs in the scheduler can be initialized
cycles apart on the same FU. This LUT is addressed by a with different values, permitting the modification of every
coding the two
of the selected FUs and by
by the FUs set of micro microconfigurable
FUs.ofAlso FUs used inthis
an way is the
codin of two selected FUs and the two selected
combination and order FU for the
operations. The output consists of the lines of information that execution of a certain instruction.
indicate the possible conflicts between the selected operation
and any possible future operation on the selected FU. The
FUOSR is divided into groups of shift register, one group for III. ARITHMETIC UNITS
each functional unit. Each of the groups contains as many shift
registers as there are operations that can be executed on the Imoanfetrsfthscpcsorreherimtc
corsodn fucioa uni.Alo eac of th hitrgitr
. . . .
floating point units (FU). The FUs are designed for both
1interval and conventional floating point arithmetic.
Six units
>'.
cotan as man bit as th nube ofclckcylerqure of FU are used: interval adder, interval multiplier, floating
forth leghis oprtin
iS addressed with an opcode. Fo'uptpross
h US point
DAF.
comparison unit, floating point MAF and floating point
The interval adder design is the one proposed in [2]. This operations, although less optimally. Therefore, it is possible to
adder is based on a classical double path adder. The main execute simultaneously, the same two types operations if they
characteristic of this type of adder is the two floating point occur on the two LPs. Thus, an increase in the throughput is
operations needed for an interval addition can be done obtained. A specialized operation scheduler was designed,
simultaneously, each on a different path. This adder design is which tries to select for each operation the FU best suited for
also suitable for SMT architectures, because two floating point it. It also parallelizes tasks at the FU stage level.
additions can be performed in parallel. The next step in the development and optimization of the
The interval multiplier is based on a design presented in coprocessor will be the construction and analysis of an
[1] and follows and algorithm suitable for pipeline structures. interval benchmark (similar to SPEC-FPU). This way, a
The structure of the interval multiplier is based on a dual rigorous performance analysis can be made. Furthermore,
result multiplier (a floating point multiplier with two further optimization can be achieved. Improvements can be
differently rounded results for the same multiplication) and on made both in the operation scheduler and the functional units.
two floating point comparators. Therefore, comparisons can Regarding the operation scheduler, further research direction
also be performed using this type of multiplier. of this project is to minimize the area occupied by the LUTs
Comparisons are very important in interval arithmetic, and shift registers and to experiment with different degrees of
because they are used in interval set operations, like the configurability. Regarding the functional units, improvements
interval hull, interval intersection and interval inclusion. can be made on the DAF unit, by including more performing
Two FU for combined operations are used: MAF and SRT based dividers.
DAF. For MAF unit, a structure based on the [6] design is ACKNOLEDGEMENTS
used, which is a high performance floating point MAF unit.
For interval MAF, a combination of the interval addition and This work was supported by the Romanian Second Research
interval multiplication has to be done. Thus, the interval NIAF and Development National Plan (PNII) grants IDEI-17/2007
is defined as: and TD-26/2007.
[Xlo,Xh] *[Y1o,yhi]+[Zlo,Zhi]=[fin(XIoYIo+Z/o;XI/Yp +Z16;XhiY10+Z1o;
XhiYhi +Zlo),ax(Xloylo i+Zi;X1XYiy +ZhZ;XY Xhiyhi+Zi)] REFERENCES
[1] A. Amaricai, M. Vladutiu, L.Prodan, M. Udrescu, 0. Boncalo
[X/O,XX ] *[y1,yhi] -[Z1,Zhi] = [min(X10Y Zhi;XloYhi -Zhi;Xhiylo Zhi;
-
"Hardware Support for Combined Interval and Floating Point
Xhiyhi -Zhi),rax(X1oY16 -Zlo;XloYhi -Zlo;Xhiyo -Zlo;Xhiyhi -Z1)] Multiplication" Proceedings 14th Mixed Design Of Integrated
n order to decrease the number of operations, a sign Circuits and Systems, 2007, pp 278-282
' is.....done ....... isis [2] A. Amaricai, M. Vladutiu, L.Prodan, M. Udrescu, 0. Boncalo
i5. Tse
examining for the multiplication This
examining forthe mtii orands iso "Exploiting Parallelism in Double Path Adders' Structure for
similarv multiplication.
interval multipliteign .
forIncreased Throughput of Floating Point Addition" Proceedings
10th EUROMICRO Conference on Digital System Design,
A new unit was designed for this coprocessor: the DAF Architectures, Methods and Tools, 2007, pp 132-137
unit. The floating point DAF was meant to improve the [3] J. L. Hennessy, D. A. Patterson "Computer Architecture, Fourth
Newton's interval method for equations/systems of equation Edition: A Quantitative Approach" Morgan-Kaufmann, 2006
solving, which is one of the high performance interval [4] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A.
arithmetic methods [5]. The Newton's interval method relies Nishimura, Y. Nakase, T. Nishizawa "An Elementary Processor
on a division followed by a subtraction at every iteration, thus Architecture with Simultaneous Instruction Issuing from Multiple
justifying such a combined unit. Because interval division is Threads ", Proceedings of the 19th Annual International
simpler than interval multiplication, so the interval DAF is Symposium on Computer Architecture, 1992, pages 136 - 145
simpler than interval MAF, requiring only two efloating
uiigonytwlatn[5] U.W. Kulisch "Advanced Arithmetic for the Digital Computer",
simpler thnineva A
point Springer-Verlag, 2002
DAF operations. [6] T. Lang, J. Bruguera "Floating-Point Multiply-Add-Fused with
Reduced Latency" EEE Transaction on Computers, Vol. 53, No.
IV. CONCLUSIONS AND FUTURE WORK 8, 2004, pp 988-1003
[7] H. Levy, J. Lo, J. Emer, R. Stamm, S. Eggers , D. M. Tullsen
This paper presents a new type of SMT coprocessor with "Exploiting Choice: Instruction Fetch and Issue on an
in-order fetch and issue, suitable for both interval and Implementable Simultaneous Multithreading Processor",
conventional floating point arithmetic. This coprocessor uses a Proceedings 23rd Annual International Symposium on Computer
specialized set of* functional
1 * 1 * r r 1 floating
rl * * units,
point * some ofr ~~Architecture (ISCA'96) 1996, p. 191-203
which can perform two or three interval and floating which
pointcan p m t[8] Computing
C. V. Ramamoorthy, H.F. Li
Surveys (CSUR), 9, Issue Architecture"
Vol."Pipeline 1, 1977 pp 61 - ACM
102
operations specific to other units. The specialized units [9] D. M. Tullsen, S. J. Eggers, H. Levy "Simultaneous
designed are: the interval adder, the interval multiplier, the multithreading: maximizing on-chip parallelism" Proceedings
comparator, floating point MAF and floating point DAF. The of the 22nd Annual International Symposium on Computer
coprocessor design has two logic processors. In order to Architecture (ISCA'95), 1995, pp 392 -403
reduce the area overhead, the functional unit set is common to [10] T. Ungerer, B. Robic, J. Silc "A survey of processors with
both logical processors. The units in this set are each explicit multithreading" ACM Computing Surveys (CSUR),
specialized for one operation, but can execute others Vol. 35, Issue 1, 2003, pages 29 - 63

A Dual-Threaded Architecture For Interval Arithmetic Coprocessor With Shared Floating Point Units

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Dual-Threaded Architecture For Interval Arithmetic Coprocessor With Shared Floating Point Units

Încărcat de

Drepturi de autor:

Formate disponibile

A Dual-Threaded Architecture for Interval Arithmetic

Coprocessor with Shared Floating Point Units

978-1 -4244-2277-7/08/$25.00 ©D2008 IEEE

The proposed architecture can perform 14 types of FP I EXRB I

S-ar putea să vă placă și