Supporting Ordered Multiprefix Operations in Emulated Shared Memory Cmps

Supporting Ordered Multiprefix Operations in
Emulated Shared Memory CMPs

Martti Forsell
Platform Architectures Team
VTT
Oulu, Finland
Abstract - Shared memory emulation is a promising technique
to address programmability and performance scalability concerns of chip multiprocessors (CMP) because it provides implied
synchrony in execution of machine instructions, efficient latency hiding technique, and enough effective bandwidth to route
all the memory references even with the heaviest workloads. In
our earlier research we have proposed an architectural solution
to support concurrent memory access and multioperations on
emulated shared memory CMPs with a help of active memory
units attached to memory modules. While this solution provides
faster memory access than other known solutions with minor
silicon area and power consumption overheads, the results of
multiprefixes are unfortunately not in order forcing one to use
a relatively slow logarithmic multiprefix algorithm for ordered
multiprefixes. In this paper we propose an architectural technique for supporting a limited number of concurrent ordered
multiprefix operations in emulated shared memory CMPs. The
solution is based on adding special multiprefix arrays to active
memory units. Performance, silicon area, and power consumption evaluations are given.
Keywords: Parallel computing, CMP, ordered multiprefix, concurrent memory access, computer architecture
1 Introduction
Shared memory emulation [Ranade91] is a promising technique to address programmability and performance scalability
concerns of chip multiprocessors (CMP). This is because it provides implied synchrony in execution of machine instructions,
efficient latency hiding technique, and enough effective bandwidth to route all the memory references even with the heaviest
random and concurrent access workloads. Synchronous execution is considered to make programming easier because a programmer does not need to synchronize threads of execution
explicitly after each global memory access but can rely on the
hardware to take care of that automatically. Latency hiding used
in shared memory emulation makes use of the high-throughput
computing scheme [Beck97], where other threads are executed
while a thread refers to the global shared memory. Since the
throughput computing scheme employs parallel slackness
extracted from available thread-level parallelism, it is considered
to provide remarkably better scalability than traditional symmetric multiprocessors and non uniform memory access systems
relying on snooping or directory-based cache coherence mechanisms and therefore suffering from limited bandwitdh or directory access delays and heavy coherence traffic.
We have proposed an architectural solution to support concurrent memory access and multioperations on emulated shared
memory CMPs with a help of active memory units attached to
Jussi Roivainen
Digital Systems Design Team
VTT
Oulu, Finland
memory modules [Forsell06a]. While the solution indeed provides faster memory access than other known solutions with
minor silicon area and power consumption overheads, the
results of multiprefixes are unfortunately not in order forcing
one to use a relatively slow logarithmic multiprefix algorithm
for ordered multiprefixes. In this paper we propose an architectural technique for supporting a limited number of concurrent
ordered multiprefix operations in emulated shared memory
CMPs. The solution is based on adding special multiprefix
arrays to active memory units. Performance, silicon area, and
power consumption evaluations are given.
1.1 Related work

Architectures for shared memory emulation, known also as
emulated shared memory (ESM) architectures, have been studied from the 70s when the ideal shared memory machine, the
parallel random access machine (PRAM) [Fortune78] was
invented: Schwartz proposed ultracomputers with network
switches to combine requests destined for the same memory
location [Schwartz80]. Ranade outlined a method to emulate
PRAM-like shared memory [Ranade91]. Forsell outlined a scalable on-chip computing architecture with efficient instructionlevel parallelism exploitation for general purpose parallel computers employing the PRAM model [Forsell02]. The idea of partial and limited concurrent memory access for synchronous
CMPs was presented in [Forsell05]. It takes, however, as many as
three steps to make a full concurrent access and provides only a
low number of memory locations for which concurrent access is
allowed. The idea was further extended to full concurrent memory access and multioperation support with the help of step
caches and scratchpads in [Forsell06], but the technique does
not preserve the ordering of multiprefixes. Vishkin introduced
the explicit multithreaded architecture including a multiprefix
computation unit for realizing PRAM-like computing but the
used synchronization scheme is more relaxed than in strict
PRAM limiting its applicability [Vishkin11].
The rest of the article is organized so that in Section 2 we
describe the emulated shared memory system, the novel architectural technique supporting ordered multiprefix operations is
proposed in Section 3, in Section 4 we evaluate the proposed
technique on our emulated shared memory CMP framework
and give rough silicon area and power consumption estimations, and finally in Section 5 we give our conclusions.
2. Shared memory emulation

The main idea in shared memory emulation is to provide a
user an illusion of ideal shared memory although the underlying
Common clock or independent clocks
Lowoverhead
multithreading
I1
I2
I3
Ip
P1
P2
P3
Pp
S1
S2
S3
Sp
Local instruction memory module
Scratchpad step cache unit
High-bandwidth synchronous network

A
M1
M2
M3
A Active memory unit

Mp
Shared data memory module
Distributed shared data memory
Figure 1. Emulated shared memory system supporting concurrent memory access, multioperations and arbitrarily ordered multiprefixes.
architecture has a physically distributed memory. The properties

of ideal shared memory are best captured by the PRAM model,
which abstract away asynchronicity in execution of threads and
non-uniformities (latency and need for partitioning) of memory
access. PRAM lets a programmer to focus on intirinsic parallelism of the computational problem and parallel algorithm
design instead of being forced to orchestrate asynchronous,
possibly non-uniform and implementation-dependent low-level
issues. Unfortunately the direct implementation of an ideal,
PRAM-style shared memory has proved to be physically infeasible with current silicon technology if the number of processors
is say higher than, say 4 [Forsell94]. This is because the wiring
area (and the power dissipation) of multiport memory chip raises quadratically as the number of processors increases with
respect to a single ported memory of the same capacity. In this
section, we describe the main principles and architectural techniques of shared memory emulation, introduce existing solutions for implementing concurrent memory access and multiprefix operations.
2.1 Principles
A typical scalable architecture to emulate shared memory on
a silicon platform consists of a set of processor cores connected
to a distributed shared memory via a physically scalable highbandwidth interconnection network (see Figure 1). The main
idea is to provide each processor core with a set of threads that
are executed efficiently in an interleaved manner and hide the
latency of the network. As a thread makes a memory reference,
the thread is changed and second thread can make its memory
request, and so on. No memory delay will occur assuming the
reply of the memory reference of the thread arrives to the
processor core before the thread is put back to execution
[Ranade91]. This requires that the bandwidth of the network is
high enough and hot spots can be avoided in pipelined memory access traffic. Synchronicity between consecutive instructions
can be guaranteed by using an elastic synchronization wave
between the steps [Leppnen96].
2.2 Implementation techniques

In order to efficiently emulate shared memory on a top of a
distributed memory system, processors need to be multithreaded [Valiant90, Leppnen96]. Such a multithreading can be imple-
mented as a Tp-stage, cyclic, in-order (interleaved) interthread

pipeline, which provides hazard-free execution for hiding the
latency of the memory system, maximizing overlapping of the
execution of threads, and minimizing the register access delay.
Switching between threads does not slow down operation of the
processor, because threads proceed in the pipeline only during
the forward time. If a thread tries to refer memory when the network is busy, the pipeline is suspended until the network
becomes available again. After issuing a memory read, the
thread can wait the reply for at most Mw<Tp clock cycles before
the pipeline freezes until the reply arrives. A processor is composed of F functional units, a hash address calculation unit, and
Tp sets of R registers. The scheduling of operations is static since
dynamic techniques might conflict with the synchronous thread
level parallel (TLP) execution. The PRAM model is linked to the
architecture so that a full cycle in the pipeline corresponds typically to a single PRAM step. During a step, each thread of each
processor of the CMP executes an instruction including at most
one shared memory reference subinstruction. Therefore a step
lasts for multiple, at least Tp, clock cycles.
There are two types of memory modules, data memory modules and instruction memory modules, that are accessed via the
data and instruction memory ports of processors, respectively
(see Figure 1). All the data is located to physically distributed but
logically shared data memory modules emulating the ideal
PRAM memory. Instruction memory modules are aimed to keep
the program code for each processor. The data and instruction
memory modules of size Ssd and Si bytes, respectively, are isolated from each other to guarantee parallel high-bandwidth data
and instruction streams to processors.
The communication network connects processors to distributed memory modules so that sufficient throughput and low
enough latency can be achieved for random communication
patterns with a high probability as outlined in [Ranade91,
Leppnen96]. Suitable scalable intercommunication topologies
include sparse or under populated networks, e.g. variants of
two-dimensional meshes providing fixed degree nodes as well
as fixed length of interconnection lines independently on the
number of processors. To maximize the throughput for readintensive portions of code, one can use separate lines for references going from processors to memories and for replies from
memories to processors. Memory locations are distributed
across the data modules by a randomly chosen polynomial
1. Determine intra-processor
multiprefixes
Processor 0:
Thread 0
Thread 1
Thread 2
2. Send processorwise results to modules

to determine inter processor multiprefixes
(one result per processor only)
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Thread Tp-1
Processor 1:
Thread Tp
Thread Tp+1
Thread Tp+2
3. Spread and compute the final

arbitrary ordered multiprefixes
within processors
Memory location
and active
memory unit
Thread TpK-1
Processor 2:
Thread 2Tp
Thread 2Tp+1
Thread 2Tp+2
Thread 3Tp-1
Processor P-1:
Thread (P-1)Tp
Thread (P-1)Tp+1
Thread (P-1)Tp+2
Thread PTp-1
BMPxx instruction
EMPxx instruction continued

- the processor-wise offset
is computed to thread-wise
results
- threads that have already
used their execution slot
will be updated in the end
of the memory reply
pipeline segment
EMPxx instruction
- first reference triggers an
external memory reference
- ordering is lost here since
memory references arrive
in non-deterministic order
Figure 2. Multiprefix using the two-level approach that does not preserve the ordering of multiprefix operations.
hashing function for avoiding congestion of messages and hot

spots [Ranade91, Dietzfelbinger94].
Fast SRAM bank

(or a register file)
2.3 Concurrent memory access and multiprefix

operations
mux
ALU
Concurrent reads and writes to memory locations can be

implemented using step caches. For a concurrent read, all
threads participating the access give the same results. In the case
of a concurrent write, the data of an arbitrary thread participating the write will be written to the target location. Step caches
are associative memory buffers in which data stays valid only to
the end of ongoing step of multithreaded execution [Forsell05].
The main contribution of step caches to concurrent accesses is
that they step-wisely filter out everything but the first reference
for each referenced memory location. This reduces the number
of requests per location to P allowing them to be processed
sequentially on a single ported memory module assuming Tp
P. Step caches operate similarly as ordinary caches with a few
notable exceptions: Each time a multithreaded processor refers
to the shared data memory a step cache search is performed.
Scratchpads are addressable memory buffers that are used to
store memory access data to keep the associativity of step
caches limited in implementing multioperations and thread
bunches with a help of step caches, and minimal on-core and
off-core ALUs that take care of actual intra-processor and interprocessor computation for multioperations [Forsell06] (see
Figure 1). Scratchpads are coupled with step caches into so
called scratchpad step cache units. A scratchpad step cache unit
consists of a Tp-line scratchpad, a Tp-line step cache, and a simple multioperation ALU for executing incoming concurrent ref-
Reply
Data Op Address
Figure 3. Active memory unit.
erences, multioperations and arbitrary ordered multiprefixes

sequentially.
Multioperations can be implemented as two consecutive single step operations. During the first step, a starting operation
(BMPxx for arbitrary ordered multiprefix operations) executes a
processor-wise multioperation against a step cache location
without making any reference to the external memory system
(see Figure 2). During the second step, an ending operation
(EMPxx for arbitrary ordered multiprefix operations) performs
the rest of the multioperation so that the first reference to a previously initialized memory location triggers an external memory
reference using the processor-wise multioperation result as an
operand. The external memory references that are targeted to
the same location are processed in the active memory units of
the corresponding memory module according to the type of the
multioperation. An active memory unit consists of a simple ALU
and fetcher (see Figure 3). In the case of arbitrary ordered multiprefixes the reply data is sent back to scratchpads of participating processors. The consecutive references are completed by
applying the reply data against the step cached reply data.
1. Determine intra-processor
multiprefixes.
Processor 0:
Thread 0
Thread 1
Thread 2
2. Send processorwise results to modules

where they are stored to multiprefix
arrays for in-order processing.
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Thread Tp-1
Processor 1:
Thread Tp
Thread Tp+1
Thread Tp+2
3. Compute in-order inter processor multiprefixes

(one result per processor only) on multiprefix
arrays of active memory units and send the
results back to scratchpad step cache units,
from which they are spread back to threads
and the final in-order multiprefixes are computed
Memory location
and active
memory unit
Scratchpad
step cache unit
Memory location
and active
memory unit
Thread TpK-1
Processor 2:
Thread 2Tp
Thread 2Tp+1
Thread 2Tp+2
Thread 3Tp-1
Processor P-1:
Thread (P-1)Tp
Thread (P-1)Tp+1
Thread (P-1)Tp+2
Thread PTp-1
BMPxx instruction
SMPxx instruction
- first reference triggers an
external memory reference
- ordering is preserved here
by storing in non-deterministic order
arriving references according to
IDs of sending processors
OMPxx instruction
- the processor-wise offset is computed
to thread-wise results
- threads that have already used their
execution slots will be updated in the end
of the memory reply pipeline segment
Figure 4. Multiprefix using the proposed three-level multiprefix array technique that preserves the ordering of multiprefix operations.
Fast memory bank
ACTIVE MEMORY UNIT

MD
mux
MULTIPREFIX ARRAYS
mux
Ctrl
ALU
mux
CD
ALU
Array 1
Port 2 - Array being processed
- load previously stored message
for in-order processing
Port 1 - Array being loaded
- store incoming multiprefix messages for in-order processing
Array 2
Reply Data Op Address
Figure 5. Active memory unit for the proposed in-order multiprefix solution.
3. Ordered multioperations
The fastest known implementation of concurrent memory
access and arbitrary ordered multiprefixescalled here as the
baseline solutionfails to preserve the ordering of participating
threads in multiprefixes and therefore introduces a logarithmic
slowdown for ordered multiprefixes [Forsell06a]. This is because
memory references arrive at the destination module in the order
dependent on the distance of target memory module from the
source processors, traffic situation, and relative timing of threads
compared to each others. In order to retain the high performance provided by this baseline solution in concurrent memory
access and to restore the ordering of references participating to
multiprefix operations for an arbitrary number of threads, we
propose using a three step algorithm named here as the multi-
prefix array technique, operating identically to the baseline

solution during step one, adding an array to the active memory
unit for storing and ordering references accoriding to their
source processor IDs during step two, and control logic for processing the references in order during step three (see Figure 4).
To allow for an overlapped processing of this kind of dual
instruction solution, an additional array and storage for keeping
the target memory value are needed so that one is dedicated for
processing (processing array, cumulative data register) while
another is filled with references (loading array, data register).
Thus, the modified active memory units consist of an ALU and a
multiplexer like the baseline active memory unit but adds also
two multiplexers, control logic, another ALU, two registers, and
two multiprefix arrays (see Figure 5).
#include "e+.h"
// Baseline version, T=O(log N) in e
#define size32768
int source_[size];
int main()
{
int i;
for_ (i=1, i<_number_of_threads, i<<=1,
if ( _thread_id-i>=0 )
source_[_thread_id] += source_[_thread_id-i];
);
}
; Proposed version in MBTAC assembler
; R1 = address of _thread_id, _thread_id, _thread_id<<2
; R2 = Prefix intermediate result, prefix final result
; R3 = address of _source_
OP0 _sum_
OP1
1
BMPADD0
OP0 _sum_
SMPADD0
OP0 _sum_
OP1
__thread_id ADD0
OP0 _source_ LD0
R1
WB3
SHL0 R1,O0
WB1
OP0 2
ST0
R2,A0
ADD0 R1,R3
O1,O0
R2,O0
O1,R32
O0
A0
#include "e+.h"
// Proposed version, T=O(1) in e
#define
size 32768
int sum_=0;
int source_[size];
int main()
{
int p;
prefix(p,MPADD,&sum_,1);
source_[_thread_id]=p;
}
WB2
M0
OMPADD0
WB1
R2,O0
M0
WB1
A0
WB2
M0
; Step 1
; Step 2
; Step 3
Figure 6. Ordered multiprefix add as e (for the baseline and proposed solutions) and assembler programs (for the proposed solution).
The modified active memory unit works like the baseline unit
for loads and stores. For multiprefixes it stores the incoming references that arrive in a semi-random order to the loading multiprefix array addressed by the referencing processor ID, sets the
corresponding in-use bit of the element of the array, and fetches the target memory location value to register MD (SMPxx
instructions). During the next step, the control logic first switches the arrays so that current processing array becomes loading
array and vice versa, clears the in-use bits of the new loding
array in parallel, advances the memory data from register MD to
register CD, processes the now ordered references against this
data skipping the unused elements and sending the old values
of register CD back to issuing processors running OMPxx
instructions, stores the results back to register CD, and finally
stores the final result back to the memory.
The proposed multiprefix array solution allows for a single
multiprefix computation per a memory module (or processor)
per step but two overlapping multiprefixes can share the same
moduleif they are targeted to the very same memory location,
the control logic stores the final value also to register MD. While
this may sound quite limited amount of concurrent prefixes, it
should be noted that the best speedups are gained in cases in
which as many threads as possible are participating in a multiprefix, the extreme case reducing only to single multiprefix in
which all the treads are participating.
This kind of active memory units can be used to realize the
full multiprefix concurrent read concurrent write (MCRCW)
PRAM model. A potential problem with multioperations in the
baseline solution is that active memory units in practice need to
both read and write accesses memory locations for each participating thread. This can easily lead to limiting the number of
memory locations available for multioperations, adding fast
caches to memory modules, or doubling the speed of the memory. The proposed solution eliminates these problems for
ordered multiprefixes by reading the memory only once per
incoming reference, and storing the result only once at the end
of multiprefix cmputation. Also the arrays are single-ported
decreasing the silicon area needed for them and making the
power consumption potentially modest.
From a point of view of programming, using new fast
ordered multiprefix operations of an MCRCW-enabled ESM is
simple: A programmer needs just to apply MCRCW-aware primitives to get up to a logarithmic performance boost. Figure 6
shows a logarithmic prefix algorithm for the baseline solution

and a constant time prefix employing the proposed solution for
full MCRCW. It shows the baseline and MCRCW versions of the
prefix benchmark as e-programs and the MCRCW version in our
CMP framework assembler program.
4. Evaluation
In order to evaluate the performance and estimate the silicon
area, and power consumption of the proposed multiprefix array
technique, we applied it to the ECLIPSE CMP framework being
developed at VTT [Forsell02, Forsell10].
4.1 Preliminary performance simulations

We measured the performance of the multiprefix solutions by
simulating execution of five prefix problems that can be used as
a primitive of parallel computing (see Table 1) in six CMP configurations having 4 to 64 512-threaded MBTAC processor cores
(see Table 2).
The benchmark programs were compiled with the e-compiler, ec, with O2 and ilp optimizations on. The resulting programs were simulated with the IPSMSim modified for the proposed solution. The results of the simulations are shown in
Figure 7.
In order to roughly compare the goodness of proposed solution to the alternative solutions, we determined the number of
steps needed for exclusive and concurrent memory accesses,
associative multioperations and both arbitrary ordered and fully
ordered multiprefixes in the baseline solution, the proposed
solution, and combining solution guaranteeing ordering but
introducing always the sorting phase (see Table 3).
4.2 Silicon area and power consumption

We estimated the silicon area, power consumption, and maximum clock frequency of unoptimized ESM CMPs with 4, 16 and
64 processors using the recently proposed performance-areapower model of ESM CMPs [Forsell08] assuming high-performance 65 nm silicon technology, minimum global wiring pitch for
interconnects, 1 MB data SRAM and 8 kiloinstructions totally
uncompressed program SRAM per processor. The model features over 100 parameters and determines the number of gates
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Benchmark
Tbase
Pbase
Wbase Tprop
Pprop
Wprop N
Description
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------aprefix
1
N
N
N
1
1
T
Arbitrary ordered multiprefix of a table of N integers
prefix-x
log N N
N log N N
1
1
T
Ordered multiprefix of a table of N/x integers (x=1, 4, 16, 64
concurrent multiprefix computations)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Table 1. Benchmarks used in evaluation.
100000
40,0
Overhead
Overhead
80000
60000
40000
20000
0
E4
30,0
E1
6
E16
20,0
E6
4
E64
10,0
C4
C16
0,0
aprefix
aprefix
prefix-1
prefix-1
Speedup
Speedup
Execution
Execution ttime
ime (cl
(clock
ock
cycles)
cycles)
Configuration
E4
E16
E64
C4
C16
C64
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Multiprefix processing machinery
baseline
baseline
baseline
proposed
proposed
proposed
Number of processors
P
4
16
64
4
16
64
Number of functional units
F
4
4
4
4
4
4
512
512
512
512
512
512
Number of threads per processor
Tp
Total number of threads (in tests)
T
2k
8k
32k
2k
8k
32k
Number of switches
S
4
16
256
4
16
256
Size of data memory (MB)
Sm
4
16
64
4
16
64
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Table 2. CMP configurations used in the evaluation.
p
prefix-Q
refix-Q
p
prefix-P
refix-P
aprefix
aprefix
35,0
30,0
25,0
20,0
15,0
10,0
5,0
0,0
p
prefix-1
refix-1
p
prefix-Q
refix-Q
p
prefix-P
refix-P
C64
C4
C16
C64
aprefix
a
prefix
p
prefix-1
refix-1
p
prefix-Q
refix-Q
p
prefix-P
refix-P
Figure 7. Execution time (top left), overhead with respect to ideal machine with the same instruction set (top right), and acieved speedup
Operation
Combining solution
Baseline solution
Multiprefix arrays
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Definition
[Ranade91]
[Forsell06a]
<This paper>
Exclusive load
2-3
1
1
Exclusive store
2-3
1
1
Concurrent load
2-3
1
1
Concurrent store
2-3
1
1
0.5
Multioperation (Tparticipating T )
2-3
1
1
Arbitrary ordered multiprefix (Tpt T0.5)
2-3
1
1
Ordered multiprefix (Tpt T0.5)
2-3
O(log Tpt)
2 if Ncmp P, O(log Tpt) if Ncmp P
Multioperation (Tpt > T0.5)
2-3
2
2
Arbitrary ordered multiprefix (Tpt > T0.5)
2-3
2
2
0.5
Ordered multiprefix (Tpt > T )
2-3
O(log Tpt)
3 if Ncmp P, O(log Tpt) if Ncmp P
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Table 3. Execution time of memory related operations in the evaluated solutions in steps. Multioperations associative cumulative operations
like multiprefixes but no return values are sent back to processors, just the content of the target memory location is altered.
(Ncmp=number of concurrent multiprefixes, Tpt=number of participating threads)
by summing the gate counts of elements together, determines

the area by multiplying the gate counts in different categories
with typical gate size in that category, assuming typical overhead, and adding the area occupied by interconnect wiring. The
clock cycle duration is predicted based on a wire delay estimate
obtained using a parasitic capacitance model of parallel wires
and the length of interconnect wiring calculated from the
dimensions of a processor storage module. The power consumption is determined in the same manner as the silicon area
by employing typical dynamical and static power consumption
per gate, taking the predicted clock cycle into account and
adding the power consumption of the interconnection network
wiring. The results given by the model are shown in Figure 8.
4.3 Discussion
As expected CMPs using the proposed multiprefix array solution executed ordered prefix programs much faster than the
baseline CMPs if the number of concurrent multiprefixes does
not exceed the number of memory modules (processors). The
1200
800
600
Com
C
om
Mem
Mem
400
Proc
Proc
200
0
Power
Power co
consumption
nsumption (W
(W))
Silicon
Silicon a
area
rea (mm^
(mm^2}
2}
1000
1000
800
C
Com
om
600
Me
m
Mem
400
Proc
Proc
200
0
E4
E1
E16
6
E6
E64
4
C4
C16
C64
E4
E1
E16
6
E6
E64
4
C4
C16
C64
Figure 8. Silicon area (left), and power consumption (right) estimates for non-optimized implementations of CMPs using high-performance
65 nm silicon technology with minimum global wiring pitch for interconnects implying clock frequency of 1.29 GHz assuming 1 MB
data SRAM and 8 kiloinstructions program SRAM per processor for determining the clock frequency.
individual speedups ranged from 16.8 to 31.0 while the average

speedups were 19.0, 20.0 and 22,8 for C4, C16 and C64, respectively. The results are better than those predicted by the rough
logarithmic speedups, ignoring the effect of constant factors and
that not every part of the programs is necessarily speed up, 11,
13 and 15, for C4, C16 and C64, respectively. The fact that the
overheads of multiprefixes were very small with respect to an
ideal PRAM with the similar instruction set, confirms that the
efficiency of the proposed technique is high. Finally, the silicon
area and power figures as well as maximum clock frequency of
an unoptimized design, 220 mm2, 250 W at 1.29 GHz for a 16core CMP, respectively, do not differ radically from those of typical commercial CMPs with similar number of registers while the
area and power overheads of the proposed MCRCW implementation with respect to the baseline are less than 1.2%.
5. Conclusions
We have described an architectural technique supporting a
limited number of concurrent ordered multiprefix operations in
councurrent memory access-aware ESM CMPs. The solution is
based on adding special multiprefix arrays to active memory
units. According to our evaluations, the technique indeed provides high speedups with respect to the baseline CMPs while
keeping the silicon area and power consumption overheads
very low. The measured average speedups ranged from 19.0 on
a 4-processor chip to 22.8 on a 64-processor chip. While the
poposed technique supports up to P ordered simultaneous multiprefix operations, the baseline solution supports P or more
simultaenous prefixes as long as the the threads participating to
a single multiprefix belong to the same processor.
Our future work includes investigating whether more than
one concurrent ordered multiprefix per a memory module could
be issued, the relatively high buffering requirements could be
decreased, and whether there exists a way to make concurrent
access even faster than in the proposed solution. Finally, we aim
to investigate memory module level caching solutions to make
all memory locations equal with respect to multioperations.
Acknowledgements
This work was supported by the REPLICA frontier research
project of VTT and grant 128733 of the Academy of Finland.
References
[Beck97] A. Beck, High Throughput Computing: An Interview with
Miron Livny, 1997. HPCWire.
[Dietzfelbinger94] M. Dietzfelbinger et. al.: Dynamic Perfect Hashing:

Upper and Lower Bounds, SIAM Journal on Computing, Vol. 23, No. 4
1994, pp. 738-761.
[Forsell94] M. Forsell, Are Multiport Memories Physically Feasible?,
Computer Architecture News 22, 4 (September 1994), 47-54.
[Forsell97] M. Forsell, MTACA Multithreaded VLIW Architecture for
PRAM Simulation, Journal of Universal Computer Science 3, 9 (1997),
1037-1055.
[Forsell02] M. Forsell, A Scalable High-Performance Computing
Solution for Network on Chips, IEEE Micro 22, 5 (September-October
2002), 46-55.
[Forsell05] M. Forsell, Step Cachesa Novel Approach to Concurrent
Memory Access on Shared Memory MP-SOCs, Proc. 23th IEEE
NORCHIP, November 21-22, 2005, Oulu, Finland, 74-77.
[Forsell06a] M. Forsell, Realizing Multioperations for Step Cached MPSOCs, Proc. SOC06, November 14-16, 2006, Tampere, Finland.
[Forsell06b] M. Forsell, Reducing the associativity and size of step
caches in CRCW operation, In the Proceeding of 8th Workshop on
Advances in Parallel and Distributed Computational Models (in conjunction with the 20th IEEE International Parallel and Distributed
Processing Symposium, IPDPS 06), April 25, 2006, Rhodes, Greece.
[Forsell08] M. Forsell and J. Roivainen, Performance, Area and Power
Trade-Offs in Mesh-Based Emulated Shared Memory CMP
Architectures, In the Proceedings of the 2008 International Conference
on Parallel and Distributed Processing Techniques and Applications
(PDPTA08), July 14-17, 2008, Las Vegas, USA, 471-477.
[Forsell10] M. Forsell, TOTAL ECLIPSEAn Efficient Architectural
Realization of the Parallel Random Access Machine, In Parallel and
Distributed Computing Edited by Alberto Ros, IN-TECH, Vienna, 2010,
39-64. (ISBN 978-953-307-057-5)
[ITRS09] International Technology Roadmap for Semiconductors,
Semiconductor Industry Association, 2009; http://public.itrs.net/.
[Jaja92] J. Jaja: Introduction to Parallel Algorithms, Addison-Wesley,
Reading, 1992.
[Leppnen96] V. Leppnen, Studies on the realization of PRAM,
Dissertation 3, Turku Centre for Computer Science, University of Turku,
Turku, 1996.
[Leppnen98] V. Leppnen, Balanced PRAM Simulations via Moving
Threads and Hashing, Journal of Universal Computer Science 4, 8
(1998), 675-689.
[Pamunuwa03] D. Pamunuwa, L-R. Zheng and H. Tenhunen,
Maximizing Throughput Over Parallel Wire Structures in the Deep
Submicrometer Regime, IEEE Transactions on Very Large Scale
Integration (VLSI) Systems 11, 2 (April 2003), 224-243.
[Ranade91] A. Ranade, How to Emulate Shared Memory, Journal of
Computer and System Sciences 42, (1991), 307--326.
[Schwarz80] J. T. Schwarz, Ultracomputers, ACM Transactions on
Programming Languages and Systems 2, 4 (1980), 484-521.
[Valiant90] L. G. Valiant, A Bridging Model for Parallel Computation,
Communications of the ACM 33, 8 (1990), 103-111.
[Valiant90] L. G. Valiant, A Bridging Model for Parallel Computation,
Communications of the ACM 33, 8 (1990), 103-111.
[Vishkin11] U. Vishkin, Using Simple Abstraction to Reinvent
Computing for Parallelism, Communications of the ACM 54, 1 (January
2011), 75-85.

Supporting Ordered Multiprefix Operations in Emulated Shared Memory Cmps

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Supporting Ordered Multiprefix Operations in Emulated Shared Memory Cmps

Încărcat de

Drepturi de autor:

Formate disponibile

Supporting Ordered Multiprefix Operations in

Emulated Shared Memory CMPs

1.1 Related work

2. Shared memory emulation

Common clock or independent clocks

Local instruction memory module

Scratchpad step cache unit

High-bandwidth synchronous network

A Active memory unit

Shared data memory module

Distributed shared data memory

architecture has a physically distributed memory. The properties

2.2 Implementation techniques

mented as a Tp-stage, cyclic, in-order (interleaved) interthread

2. Send processorwise results to modules

3. Spread and compute the final

EMPxx instruction continued

hashing function for avoiding congestion of messages and hot

Fast SRAM bank

2.3 Concurrent memory access and multiprefix

Concurrent reads and writes to memory locations can be

Figure 3. Active memory unit.

erences, multioperations and arbitrary ordered multiprefixes

2. Send processorwise results to modules

3. Compute in-order inter processor multiprefixes

Fast memory bank

ACTIVE MEMORY UNIT

Reply Data Op Address

prefix array technique, operating identically to the baseline

shows a logarithmic prefix algorithm for the baseline solution

4.1 Preliminary performance simulations

4.2 Silicon area and power consumption

by summing the gate counts of elements together, determines

individual speedups ranged from 16.8 to 31.0 while the average

[Dietzfelbinger94] M. Dietzfelbinger et. al.: Dynamic Perfect Hashing:

S-ar putea să vă placă și