Sunteți pe pagina 1din 6

A Decoupled Scheduled Dataflow Multithreaded Architecturey

Krishna M. Kavi Hyong-Shik Kim Joseph Arul


Dept of Electrical and Computer Engineering
University of Alabama in Huntsville
Huntsville, AL 35899
fkavi, hskim, aruljg@ece.uah.edu
Abstract
In this paper we propose a new approach to building
multithreaded uniprocessors that become building blocks
in high-end computing architectures. Our innovativeness
stems from a multithreaded architecture with non-blocking
threads where all memory accesses are decoupled from the
thread execution. Data is pre-loaded into the thread context
(registers), and all results are post-stored after the completion of the thread execution. The decoupling of memory accesses from thread execution requires a separate unit
to perform the necessary pre-loads and post-stores, and to
control the allocation of hardware thread contexts to enabled threads. This separation facilitates for achieving high
locality and minimizing the impact of distribution and hierarchy in large memory systems. The non-blocking nature of
threads eliminates the need for thread switching, thus improving the overhead in scheduling threads. The functional
execution paradigm eliminates complex hardware required
for scheduling instructions for modern superscalar architectures. We will present our preliminary results obtained
from Monte Carlo simulations of the proposed architectural
features.

1 Introduction
Multithreading has been touted as the solution to the ever
increasing performance gap between processors and memory systems (i.e., the memory wall). The memory latency
incurred by an access to memory can be tolerated by performing useful work on a different thread. While there is no
single best approach to multithreading applications, there
is a consensus that on conventional superscalar based architectures, using conventional single threaded or coarsegrained multithreading models, the performance peaks for
a small number of threads (24). This implies that adding
y This work is supported in part by the following NSF grants:
MIPS 9796310, EIA 9729889 and EIA 9895216.

Ali R. Hurson
Dept of Computer Science and Engineering
Pennsylvania State University
University Park, PA 16802
hurson@cse.psu.edu

more pipelines, functional units or hardware contexts is not


cost effective, since the instruction issue width is limited
by the available parallelism (viz., 24). It is our contention
that the advantages of superscalar and wide-issue architecture can only be realized by finding an appropriate multithreaded model and implementation to achieve the best possible performance. We believe that the use of non-blocking
and fine-grained threads will improve the performance of
multithreaded systems for much larger number of threads
than 4. In our ongoing research, we have been investigating
architectural innovations for improving the performance of
multithreaded dataflow systems[9]. This paper describes a
new architecture that utilizes decoupled memory/execution
units, and scheduling of dataflow instructions somewhat
like control-flow execution models. We will present our
preliminary results.

2 Decoupling Memory Accesses From Execution Pipeline


Jim Smith[15] presented an architecture that separated
memory accesses and execution of instructions. It is interesting to observe that this approach was advanced in an
attempt to overcome the ever-increasing processor-memory
communication cost. Since then the memory latency problem was alleviated by using cache memories. However, increasing cache capacities, while consuming an increasingly
large silicon area on processor chips, have only resulted in
diminishing returns. The gap between processor speed and
average memory access speed is once again the major limitation in achieving high performance. Decoupled architectures may yet present a solution in leaping over the memory wall. In addition, combining the decoupled architecture with multithreading allows for a wide-range of implementations for next-generation architectures. In this section
we describe two multithreaded architectures that support
decoupled memory accesses (Rhamma and PL/PS). In the
next section we show how the decoupled access/execution
units can be utilized with our new dataflow architecture.

Memory Processor
IF

ID

OF

EX/WB

Data
Cache

Scoreboard
Instr
Cache
Register
Contexts

IF

ID

OF

EX/WB

Execute Processor

Figure 1. Rhamma Processor

2.1 Rhamma Processor


A multithreaded architecture (called Rhamma) that implements the decoupled memory access/execution was designed in Germany[6]. Figure 1 shows the overall structure
of Rhamma processors. Rhamma used two separate processors: A Memory Processor that performs all Load and
Store instructions and an Execution Processor which executes other instructions. A single sequence of instructions
(thread) is generated for both processors: when a memory
access instruction is decoded by the Execution Processor, a
context switch is utilized to return the thread to the Memory Processor; and when the Memory Processor decodes a
non-memory access instruction, a context-switch causes the
thread to be handed over to the Execute Processor. Threads
are blocking and additional context switches due to data
dependencies may be incurred during the execution of a
thread.

2.2 PL/PS Architecture


Another multithreaded architecture that uses a decoupled
memory accesses from execution can be found in [10]. In
this architecture, threads are non-blocking, and all memory
accesses are done by the Memory Processor, which delivers enabled threads to the Execute Processor. Each thread
is enabled when the required inputs are available and all
operands are pre-loaded into a register context. Once enabled, a thread executes to completion without blocking
where the instructions belonging to a thread will execute on
the Execute Processor. The results from completed threads
are post-stored by the Memory Processor.

3 Scheduled Dataflow
Even though the dataflow model and architectures have
been studied for more than two decades and held the
promise of an elegant execution paradigm with the ability to exploit inherent parallelism in applications, the actual implementations of the model have failed to deliver the promised performance. Most modern processors

have brought the execution engine closer to an idealized


dataflow engine to achieve high performance albeit by utilizing complex hardware (e.g., instruction scheduling, register renaming, out-of-order instruction issue and retirement, non-blocking caches, branch prediction and predicated branches). It is our contention that such complexities
can be eliminated if a more suitable implementation of the
dataflow model can be discovered. We feel that the primary
limitations of the pure dataflow model that prevented commercially viable implementations are:
(a) Too fine-grained (instruction level) multithreading
(b) Difficulty in using memory hierarchies and registers
(c) Asynchronous triggering of instructions
Many researchers have addressed the first two limitations
of dataflow architectures [9, 16, 17, 18]. There have been
several research projects that demonstrated how coarser
grained threads can be utilized within the dataflow execution model. The benefits of cache memories within the context of Explicit Token Store (ETS) dataflow paradigm were
presented in [9]. In this section we propose a new dataflow
architecture that addresses the third limitation by deviating
from the asynchronous triggering of dataflow instructions,
and by scheduling instructions for synchronous execution.
The new model also decouples all memory accesses from
the thread execution (like PL/PS) to alleviate memory latencies to further exploit multithreading.
There have been several hybrid architectures proposed
where the dataflow scheduling was applied only at thread
level (i.e., macro-dataflow) with conventional control-flow
instructions comprising threads (e.g., [5], [7], [14]). In
such systems, the instructions within a thread do not retain functional properties, and introduce side-effects, WAW
and WAR dependencies. Lacking dataflow properties at instruction level requires complex hardware for the detection
of data dependencies and dynamic scheduling of instructions. In our system, the instructions within a thread still retain dataflow (functional) properties, and thus eliminate the
need for complex hardware. The results (or data) flows from
instruction to instruction and each instruction specifies a location for the data to be stored. This is contrary to control
flow model, where the results (or data) are stored in locations with no specific connection to a destination instruction
the flow of data only defines control points. Our deviation in this proposed Scheduled Dataflow system from pure
dataflow is a deviation from data driven (or token driven)
models that are traditionally used in pure dataflow systems. Our system is instruction driven where a program
counter type sequencing is used to execute instructions.

3.1 Overview of Scheduled Dataflow Architecture


We feel that it should be possible to define a dataflow
architecture that executes instructions in a prescribed order
instead executing them as soon as data is available. Compile

Opcode

Offset(R)

Dest-Instr-1 and Port

Dest-Instr-2 and Port

PC

Context

Instruction
Fetch

(a) ETS Instruction Format

Instruction &
Frame
Memory
Instr. Cache

I-Structure
Memory

Opcode

Offset(R)

Dest-Data-1 and Port

Dest-Data-2 and Port

Operand
Fetch

Operand
Cache

(b) Scheduled Dataflow Instruction Format

Figure 2. Instruction Formats

time analysis on the source program can be used to define an


expected order in which instructions may be executed (even
though data was already available for these instructions).

3.2 Instruction Formats


Before describing the architecture of Scheduled
Dataflow, it is necessary to understand the instruction
format. Our architecture is derived from Explicit Token
Dataflow system [13, 9] where each instruction specifies a
memory location by providing an offset (R) with respect
to an activation frame pointer (FP). The first data token
destined to the instruction will be stored in this memory
location, waiting for its match. When a matching data
token arrives for the instruction, the previously stored data
is retrieved, and the instruction is immediately scheduled
for execution. The result of the instruction is converted
into (up to) two tokens by tagging the data with the
address of its destination instruction (IP). The format of
the ETS instructions is shown in figure 2(a). Instructions
for Scheduled Dataflow differ from ETS instructions only
slightly (figure 2(b)). In ETS, the destinations refer to
the destination instructions (i.e., IP values); in Scheduled
Dataflow the destinations refer to the operand locations of
the destination instructions (i.e., offset value into activation
frames or register contexts). This change also permits the
detection of RAW data dependencies among instruction
in the execution pipeline and the use of result forwarding
so that results from an instruction can be sent directly
to dependent instructions. The result forwarding is not
applicable in ETS dataflow since instructions are token
driven. The source operands for Scheduled Dataflow are
specified by a single offset (R) value and refer a pair of
registers where the data values are stored by the predecessor
instructions.

3.3 Pipeline Structure of Schedule Dataflow


Figure 3 describes the architecture of the Scheduled
Dataflow (the figure does not show all the data paths for
the Synchronization Processor). The functionality of the
pipeline stages of our Execution Processor is described
here.
Instruction Fetch: The instruction fetch behaves like traditional fetch, relying on a program counter to fetch

I-Structure
Cache

Execute

Register
Contexts
Synch
Processor

Write Back
Execution Pipeline

Figure 3. General organization of Scheduled


Dataflow architecture

the next instruction. The context information can be


viewed as a part of the thread id: <FP, IP>, and used
for accessing register file specific to the thread.
Operand Fetch: The operand fetch retrieves a double
word from a register file that contains the two operands
for the instruction. Each instruction specifies an offset (R) that refers to a pair of registers (as described
above). Thus, a read port should supply a double word.
Execute: The execute executes the instruction and sends
the results to write-back along with the destination addresses (identifying the registers for the destination instructions).
Write-back: The write-back writes up to two values to
the register file; the two values may go to two different
locations in the register file. This necessitates 2 write
ports to the register file.
As can be seen, this execution pipeline described above,
behaves very much like conventional RISC pipelines while
retaining the primary dataflow properties; functional nature,
side-effect freedom, and non-blocking threads. The functional and side-effect free nature of dataflow eliminates the
need for complex hardware (e.g., scoreboard) for detecting
write-after-read (WAR) and write-after-write (WAW) dependencies and register renaming. In fact the double word
operand memory (or registers) can be viewed as implicit
register renaming and a variation of reservation stations utilized in Tomasulos algorithm. In our architecture, scheduling and register renaming are implied by the programming
model and hence defined statically. The non-blocking nature of our thread model and the use of a separate processor
(Synchronization Processor, SP) for thread synchronization
and memory accesses (See section 3.4. for more details)
eliminates unnecessary thread context switches on long latency operations or cache misses. Our architecture does not
prevent superscalar or multiple instruction issue implementations for the Execution Pipeline (EP).

3.4 Separate Synchronization Processor

4 Analytical Model Evaluating the New Architecture


4.1 Overview of the experiment
There have been many analytical formulations to predict
the performance of multithreaded programs on conventional
architectures (see for examples in [2], [4]). In this section,
we will show the preliminary performance analysis on our
Scheduled Dataflow architecture using Monte Carlo simulations. In order to analyze the architecture in a more realistic
light, we generated synthetic workloads and applied these
workloads to the simulations representing the different architecture. The workload generation is based on parameters
that have been chosen from either published data (e.g., [8]),
our observations based on specific architectural characteristics, and observations based on hand-coded programs. Our
intention is to emphasize the fundamental differences in the
programming and execution paradigms of the architectures
(viz, data-driven threads, non-blocking vs blocking threads,
no-stalls on memory access vs stalls due to cache misses,
no branch stalls vs stalls on misprediction, token driven vs
scheduling instructions, etc). Thus, it is not possible to use
the same set of parameters for all architectures or come up
with a common metric. We used the same normalized
workload for all architectures. That is, all architectures execute the same amount of useful work, but different architectures have different amounts of overhead instructions, stalls,
and context-switches.

Conv
Rhamma, L=1R
Rhamma, L=3R
Rhamma, L=5R
SDF, L=1R
SDF, L=3R
SDF, L=5R

450
400
Exec. time (K unit time)

Using multiple hardware units for the coordination and


execution of instructions is not new. We have described
three examples of decoupled architectures in section 2.
There are other systems where separate hardware units have
been proposed to handle the synchronization among threads
in multithreaded architectures (e.g., Alewife[1], StartTNG[3], EARTH[7]). We follow this tradition and propose
two hardware units for the Scheduled Dataflow. One of the
hardware units (EP) will be similar to conventional RISC
Pipelines as described previously. The other hardware unit
(SP) is responsible for accessing memory to load the initial
operands of enabled threads into registers (pre-load) and to
store the results produced by threads from registers (poststore); for maintaining synchronization counts for threads
and scheduling enabled threads (including allocation of register contexts and placing the enabled thread on the ready
queue of the execution unit). We have developed a complete
instruction set[11] for the Scheduled dataflow architecture
and hand-coded several example programs including some
Livermore Loops, matrix multiplication, Fibonacci, factorial functions in our instruction set. We rely on this experience in generating appropriate parameters for the Monte
Carlo simulations described in the next section.

500

350
300
250
200
150
100
50
0
1

# of concurrent threads

Figure 4. Effect of thread parallelism

Detailed explanation of our models for our simulations


representing the three architectures conventional processor, Scheduled Dataflow processor, and Rhamma processor
can be found in [12].

4.2 Thread parallelism


In order to measure the effect of thread level parallelism
on the performance of the different architectures, we generated a sequence of threads for each architecture. We took
the simple performance model for multithreaded processors
suggested by Agarwal[2] to introduce the latency between
a pair of threads (the time difference between the termination of a thread and the initiation of a successive thread).
We considered three values for latencies, 1, 3, 5 times the
length of a thread (L = 1R, L = 3R, L = 5R in figure 4
above). Note that the figure 4 shows the execution times for
the same total workload but for varying number of threads
comprising the workload. The execution time of conventional processor is not affected by the degree of thread parallelism since this processor executes single threaded programs only. However, as a degree of thread parallelism increases, both Scheduled Dataflow and Rhamma show performance gains. As expected, with only one thread at a time
(degree of thread parallelism = 1), multithreaded architectures perform poorly when compared to single-threaded architectures. The figure also shows that Scheduled Dataflow
executes the multithreaded workload faster than Rhamma
for all values of thread parallelism.
In the above experiment we used the same cache miss
rates (5%) and the same cache miss penalty (50 cycles)
for all architectures. We feel, however, that Scheduled
Dataflow would have smaller cache miss rates since the
pre-loads and post-stores of thread data facilitates for better cache prefetching than other architectures, as well as
better data grouping and placement that can be achieved
by compilers. It should also be mentioned that Scheduled Dataflow will provide higher degree of thread parallelism than Rhamma, since non-blocking nature of Sched-

350

350
Conv
Rhamma
SDF

Conv
Rhamma
SDF

300
Exec. time (K unit time)

Exec. time (K unit time)

300
250
200
150
100
50

250
200
150
100
50

0
10

20

30

40

50

60

70

Normalized Thread Length

Figure 5. Effect of thread length

uled Dataflow leads to finer-grained threads. These two observations indicate that we can expect even better performance for Scheduled Dataflow than shown in figure 4. In
the remaining experiments we will use L = 3R for both
Scheduled Dataflow and Rhamma architectures.

4.3 Thread granularity


In the previous experiments we set the average thread
length to 30, 20, and 50 functional instructions for conventional architecture, Scheduled Dataflow, and Rhamma respectively. It should be noted that the average thread lengths
are based upon our observations from analyzing some actual
programs written using Scheduled Dataflow instructions. In
this section, we varied the average thread lengths and the results are shown in figure 5.
Note that normalized thread length includes only functional instructions and does not include architecture-specific
overhead instructions. For conventional and Scheduled
Dataflow architectures, increasing thread run-lengths shows
performance gains to a certain degree, since longer threads
imply fewer context switches. With Rhamma, however,
longer thread does not guarantee shorter execution times.
The blocking nature of Rhamma threads proportionally
causes more thread-blockings (or context switches) per
thread as run length increases. Thus, increasing thread granularity without considering other optimizations for blocking
multithreaded systems with decoupled access/execute processors may adversely impact the performance.

4.4 Fraction of memory access instructions


Since both Scheduled Dataflow and Rhamma decouple
memory accesses from pipeline execution, we explored the
impact of the number of memory access instructions per
thread. Figure 6 shows the results, where the x-axis indicates the fraction of load/store instructions.

0
0.25

0.3

0.35

0.4

0.45

0.5

Fraction of Memory Access Instrunctions

Figure 6. Effect of fraction of memory access instructions

As for conventional architecture, increasing memory access instructions leads to increased cache misses, thus increasing the execution time. However, the decoupling permits the two multithreaded processors to tolerate the cache
miss penalties. Note that Scheduled Dataflow outperforms
Rhamma for all values of memory access instructions. This
is primarily because of the pre-loading and post-storing
performed by Scheduled Dataflow. We feel that the decoupling of memory accesses from execution is more useful if
the memory accesses can be grouped together (as done in
Scheduled Dataflow).

4.5 Effect of Cache memories


Figure 7 shows the effect of cache memories on the performance of three architectures. We assumed 50 cycle cache
miss penalty for figure 7(a) and 5% cache miss rate for figure 7(b). As observed in the previous section, both multithreaded processors are less sensitive to memory access
delays than the conventional processor.
When a cache miss occurs in Rhamma, a context
switch (switch on use) of the faulting thread occurs. In
Scheduled Dataflow, (note that only pre-load and poststore threads access the memory), assuming non-blocking
caches, a cache miss does not prevent the memory accesses for other threads. Note that this is not possible
in Rhamma since memory accesses are not separated into
pre-loads and post-stores. The delays incurred by preloads and post-stores in Scheduled Dataflow do not lead to
additional context switches since threads are enabled for execution only when the pre-loading is complete, and once enabled for execution, they complete without blocking. Once
again, we feel that the decoupling memory accesses provide better tolerance of memory latencies when used with
non-blocking multithreading models and when memory accesses are grouped for pre-loads and post-stores.

400

400
Conv
Rhamma
SDF

350

350
300
Exec. time (K unit time)

300
Exec. time (K unit time)

Conv
Rhamma
SDF

250
200
150

250
200
150

100

100

50

50

0
0

2
4
6
8
Cache Miss Rate (%)

(a) Effect of miss rates

10

20 40 60 80
Cache Miss Penalty

100

(b) Effect of miss penalties

Figure 7. Effect of cache memories

5 Conclusions
In this paper we presented a dataflow architecture that
utilizes control-flow like scheduling of instructions and separates memory accesses from instruction execution to tolerate long latency incurred by the memory access. Our primary goal is to show that it is possible to design efficient
multithreaded dataflow implementations. While decoupled
access/execute implementations are possible with single
threaded architectures, multithreading model presents better opportunities for exploiting the decoupling of memory
accesses from execution pipeline. We feel that, even among
multithreaded alternatives, non-blocking models are more
suited for the decoupled execution. Furthermore, grouping
memory accesses (e.g., pre-load and post-store) for threads
eliminates unnecessary delays (stalls) caused by memory
accesses. We strongly favor the use of dataflow instructions to reduce the complexity of the processor by eliminating complex logic needed for resolving data dependencies, branch prediction, register renaming and instruction
scheduling on superscalar implementations. Although the
results presented here are based on synthetic benchmarks
and Monte Carlo simulations, the benchmarks are driven by
either published data (e.g., load/store instruction frequencies, branch frequencies, cache miss rates and penalties) or
information obtained from analyzing several programs written for the architectures under evaluation. We are currently
developing detailed instruction simulations of the proposed
Scheduled Dataflow architecture to investigate the performance based on instruction traces.

References
[1] A. Agarwal, et. al., The MIT Alewife machine: Architecture and performance, Proc. of 22nd Intl Symp. on Computer Architecture (ISCA-22), 1995, pp. 213.

[2] A. Agarwal, Performance tradeoffs in multithreaded processors, IEEE Transactions on Parallel and Distributed Systems, vol. 3(5), pp.525539, September 1992.
[3] D. Chiou, et.al., StarT-NG: Delivering seamless parallel
computing, Proc. of the first Intl EURO-PAR conference,
Aug. 1995, pp. 101116.
[4] D.E. Culler, Multithreading: Fundamental limits, potential
gains and alternatives, Proc. of Supercomputing 91, workshop on Multithreading, 1992.
[5] R. Govindarajan, S.S. Namawarkar and P. LeNir, Design
and performance evaluation of a multithreaded architecture,
Proc. of the HPCA-1, Jan. 1995, pp. 298307.
[6] W. Grunewald, T. Ungerer, A Multithreaded processor design for distributed shared memory system, Proc. Intl Conf.
on Advances in Parallel and Distributed Computing, 1997.
[7] H.H.-J. Hum, et. al., A design study of the EARTH multiprocessor, Proc. of the Conference on Parallel Architectures
and Compilation Techniques, Limassol, Cyprus, June 1995,
pp. 5968.
[8] J.L. Hennessy and D.A. Patterson, Computer Architecture: A
Quantitative Approach, Morgan Kaufmann Publisher, 1996,
pp. 105.
[9] K.M. Kavi, and A.R. Hurson, Performance of cache memories in dataflow architectures, Euromicoro Journal on Systems Architecture, June 1998, pp. 657674.
[10] K.M. Kavi, D. Levine and A.R. Hurson, PL/PS: A nonblocking multithreaded architecture, Proc. of the Fifth International Conference on Advanced Computing (ADCOMP
97), Madras, India, Dec. 1997.
[11] H.-S. Kim, Instruction Set Architecture of Scheduled
Dataflow, Technical Report, Dept. of Electrical and Computer Engineering, University of Alabama in Huntsville,
April 1998.
[12] H.-S. Kim and K.M. Kavi, Preliminary Performance Analysis on Decoupled Multithreaded Architectures, Technical
Report, Dept. of Electrical and Computer Engineering, University of Alabama in Huntsville, October 1998.
[13] G.M. Papadopoulos, and K.R. Traub, Multithreading: A
Revisionist View of Dataflow Architectures, Proc. of the
18th International Symposium on Computer Architecture,
pp. 342351.
[14] S. Sakai, et. al. Super-threading: Architectural and software
mechanisms for optimizing parallel computations, Proc. of
1993 Intl Conference on Supercomputing, July 1993, pp.
251260.
[15] J.E. Smith, Decoupled Access/Execute Computer Architectures, Proc of the 9th Annual Symp. on Computer Architecture, May 1982, pp. 112119.
[16] M. Takesue, A unified resource management and execution
control mechanism for Dataflow Machines, Proc. of 14th
Intl Symp. on Computer Architecture, June 1987, pp. 9097.
[17] S.A. Thoreson and A.N. Long, A Feasibility study of a
Memory Hierarchy in Data Flow Environment, Proc. of Intl
Conference on Parallel Conference, June 1987, pp. 356360.
[18] M. Tokoro, J.R. Jagannathan and H. Sunahara, On the working set concept for data-flow machines, Proc. of 10th Intl
Symp. on Computer Architecture, July 1983, pp. 9097.