Documente Academic
Documente Profesional
Documente Cultură
1 Introduction
Shared memory emulation [Ranade91] is a promising technique to address programmability and performance scalability
concerns of chip multiprocessors (CMP). This is because it provides implied synchrony in execution of machine instructions,
efficient latency hiding technique, and enough effective bandwidth to route all the memory references even with the heaviest
random and concurrent access workloads. Synchronous execution is considered to make programming easier because a programmer does not need to synchronize threads of execution
explicitly after each global memory access but can rely on the
hardware to take care of that automatically. Latency hiding used
in shared memory emulation makes use of the high-throughput
computing scheme [Beck97], where other threads are executed
while a thread refers to the global shared memory. Since the
throughput computing scheme employs parallel slackness
extracted from available thread-level parallelism, it is considered
to provide remarkably better scalability than traditional symmetric multiprocessors and non uniform memory access systems
relying on snooping or directory-based cache coherence mechanisms and therefore suffering from limited bandwitdh or directory access delays and heavy coherence traffic.
We have proposed an architectural solution to support concurrent memory access and multioperations on emulated shared
memory CMPs with a help of active memory units attached to
Jussi Roivainen
Digital Systems Design Team
VTT
Oulu, Finland
memory modules [Forsell06a]. While the solution indeed provides faster memory access than other known solutions with
minor silicon area and power consumption overheads, the
results of multiprefixes are unfortunately not in order forcing
one to use a relatively slow logarithmic multiprefix algorithm
for ordered multiprefixes. In this paper we propose an architectural technique for supporting a limited number of concurrent
ordered multiprefix operations in emulated shared memory
CMPs. The solution is based on adding special multiprefix
arrays to active memory units. Performance, silicon area, and
power consumption evaluations are given.
Lowoverhead
multithreading
I1
I2
I3
Ip
P1
P2
P3
Pp
S1
S2
S3
Sp
M1
M2
M3
Figure 1. Emulated shared memory system supporting concurrent memory access, multioperations and arbitrarily ordered multiprefixes.
2.1 Principles
A typical scalable architecture to emulate shared memory on
a silicon platform consists of a set of processor cores connected
to a distributed shared memory via a physically scalable highbandwidth interconnection network (see Figure 1). The main
idea is to provide each processor core with a set of threads that
are executed efficiently in an interleaved manner and hide the
latency of the network. As a thread makes a memory reference,
the thread is changed and second thread can make its memory
request, and so on. No memory delay will occur assuming the
reply of the memory reference of the thread arrives to the
processor core before the thread is put back to execution
[Ranade91]. This requires that the bandwidth of the network is
high enough and hot spots can be avoided in pipelined memory access traffic. Synchronicity between consecutive instructions
can be guaranteed by using an elastic synchronization wave
between the steps [Leppnen96].
1. Determine intra-processor
multiprefixes
Processor 0:
Thread 0
Thread 1
Thread 2
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Thread Tp-1
Processor 1:
Thread Tp
Thread Tp+1
Thread Tp+2
Memory location
and active
memory unit
Thread TpK-1
Processor 2:
Thread 2Tp
Thread 2Tp+1
Thread 2Tp+2
Thread 3Tp-1
Processor P-1:
Thread (P-1)Tp
Thread (P-1)Tp+1
Thread (P-1)Tp+2
Thread PTp-1
BMPxx instruction
EMPxx instruction
- first reference triggers an
external memory reference
- ordering is lost here since
memory references arrive
in non-deterministic order
Figure 2. Multiprefix using the two-level approach that does not preserve the ordering of multiprefix operations.
mux
ALU
Reply
Data Op Address
1. Determine intra-processor
multiprefixes.
Processor 0:
Thread 0
Thread 1
Thread 2
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Scratchpad
step cache unit
Thread Tp-1
Processor 1:
Thread Tp
Thread Tp+1
Thread Tp+2
Memory location
and active
memory unit
Scratchpad
step cache unit
Memory location
and active
memory unit
Thread TpK-1
Processor 2:
Thread 2Tp
Thread 2Tp+1
Thread 2Tp+2
Thread 3Tp-1
Processor P-1:
Thread (P-1)Tp
Thread (P-1)Tp+1
Thread (P-1)Tp+2
Thread PTp-1
BMPxx instruction
SMPxx instruction
- first reference triggers an
external memory reference
- ordering is preserved here
by storing in non-deterministic order
arriving references according to
IDs of sending processors
OMPxx instruction
- the processor-wise offset is computed
to thread-wise results
- threads that have already used their
execution slots will be updated in the end
of the memory reply pipeline segment
Figure 4. Multiprefix using the proposed three-level multiprefix array technique that preserves the ordering of multiprefix operations.
MULTIPREFIX ARRAYS
mux
Ctrl
ALU
mux
CD
ALU
Array 1
Port 2 - Array being processed
- load previously stored message
for in-order processing
Port 1 - Array being loaded
- store incoming multiprefix messages for in-order processing
Array 2
Figure 5. Active memory unit for the proposed in-order multiprefix solution.
3. Ordered multioperations
The fastest known implementation of concurrent memory
access and arbitrary ordered multiprefixescalled here as the
baseline solutionfails to preserve the ordering of participating
threads in multiprefixes and therefore introduces a logarithmic
slowdown for ordered multiprefixes [Forsell06a]. This is because
memory references arrive at the destination module in the order
dependent on the distance of target memory module from the
source processors, traffic situation, and relative timing of threads
compared to each others. In order to retain the high performance provided by this baseline solution in concurrent memory
access and to restore the ordering of references participating to
multiprefix operations for an arbitrary number of threads, we
propose using a three step algorithm named here as the multi-
#include "e+.h"
// Baseline version, T=O(log N) in e
#define size32768
int source_[size];
int main()
{
int i;
for_ (i=1, i<_number_of_threads, i<<=1,
if ( _thread_id-i>=0 )
source_[_thread_id] += source_[_thread_id-i];
);
}
; Proposed version in MBTAC assembler
; R1 = address of _thread_id, _thread_id, _thread_id<<2
; R2 = Prefix intermediate result, prefix final result
; R3 = address of _source_
OP0 _sum_
OP1
1
BMPADD0
OP0 _sum_
SMPADD0
OP0 _sum_
OP1
__thread_id ADD0
OP0 _source_ LD0
R1
WB3
SHL0 R1,O0
WB1
OP0 2
ST0
R2,A0
ADD0 R1,R3
O1,O0
R2,O0
O1,R32
O0
A0
#include "e+.h"
// Proposed version, T=O(1) in e
#define
size 32768
int sum_=0;
int source_[size];
int main()
{
int p;
prefix(p,MPADD,&sum_,1);
source_[_thread_id]=p;
}
WB2
M0
OMPADD0
WB1
R2,O0
M0
WB1
A0
WB2
M0
; Step 1
; Step 2
; Step 3
Figure 6. Ordered multiprefix add as e (for the baseline and proposed solutions) and assembler programs (for the proposed solution).
The modified active memory unit works like the baseline unit
for loads and stores. For multiprefixes it stores the incoming references that arrive in a semi-random order to the loading multiprefix array addressed by the referencing processor ID, sets the
corresponding in-use bit of the element of the array, and fetches the target memory location value to register MD (SMPxx
instructions). During the next step, the control logic first switches the arrays so that current processing array becomes loading
array and vice versa, clears the in-use bits of the new loding
array in parallel, advances the memory data from register MD to
register CD, processes the now ordered references against this
data skipping the unused elements and sending the old values
of register CD back to issuing processors running OMPxx
instructions, stores the results back to register CD, and finally
stores the final result back to the memory.
The proposed multiprefix array solution allows for a single
multiprefix computation per a memory module (or processor)
per step but two overlapping multiprefixes can share the same
moduleif they are targeted to the very same memory location,
the control logic stores the final value also to register MD. While
this may sound quite limited amount of concurrent prefixes, it
should be noted that the best speedups are gained in cases in
which as many threads as possible are participating in a multiprefix, the extreme case reducing only to single multiprefix in
which all the treads are participating.
This kind of active memory units can be used to realize the
full multiprefix concurrent read concurrent write (MCRCW)
PRAM model. A potential problem with multioperations in the
baseline solution is that active memory units in practice need to
both read and write accesses memory locations for each participating thread. This can easily lead to limiting the number of
memory locations available for multioperations, adding fast
caches to memory modules, or doubling the speed of the memory. The proposed solution eliminates these problems for
ordered multiprefixes by reading the memory only once per
incoming reference, and storing the result only once at the end
of multiprefix cmputation. Also the arrays are single-ported
decreasing the silicon area needed for them and making the
power consumption potentially modest.
From a point of view of programming, using new fast
ordered multiprefix operations of an MCRCW-enabled ESM is
simple: A programmer needs just to apply MCRCW-aware primitives to get up to a logarithmic performance boost. Figure 6
4. Evaluation
In order to evaluate the performance and estimate the silicon
area, and power consumption of the proposed multiprefix array
technique, we applied it to the ECLIPSE CMP framework being
developed at VTT [Forsell02, Forsell10].
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Benchmark
Tbase
Pbase
Wbase Tprop
Pprop
Wprop N
Description
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------aprefix
1
N
N
N
1
1
T
Arbitrary ordered multiprefix of a table of N integers
prefix-x
log N N
N log N N
1
1
T
Ordered multiprefix of a table of N/x integers (x=1, 4, 16, 64
concurrent multiprefix computations)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Table 1. Benchmarks used in evaluation.
100000
40,0
Overhead
Overhead
80000
60000
40000
20000
0
E4
30,0
E1
6
E16
20,0
E6
4
E64
10,0
C4
C16
0,0
aprefix
aprefix
prefix-1
prefix-1
Speedup
Speedup
Execution
Execution ttime
ime (cl
(clock
ock
cycles)
cycles)
Configuration
E4
E16
E64
C4
C16
C64
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Multiprefix processing machinery
baseline
baseline
baseline
proposed
proposed
proposed
Number of processors
P
4
16
64
4
16
64
Number of functional units
F
4
4
4
4
4
4
512
512
512
512
512
512
Number of threads per processor
Tp
Total number of threads (in tests)
T
2k
8k
32k
2k
8k
32k
Number of switches
S
4
16
256
4
16
256
Size of data memory (MB)
Sm
4
16
64
4
16
64
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Table 2. CMP configurations used in the evaluation.
p
prefix-Q
refix-Q
p
prefix-P
refix-P
aprefix
aprefix
35,0
30,0
25,0
20,0
15,0
10,0
5,0
0,0
p
prefix-1
refix-1
p
prefix-Q
refix-Q
p
prefix-P
refix-P
C64
C4
C16
C64
aprefix
a
prefix
p
prefix-1
refix-1
p
prefix-Q
refix-Q
p
prefix-P
refix-P
Figure 7. Execution time (top left), overhead with respect to ideal machine with the same instruction set (top right), and acieved speedup
Operation
Combining solution
Baseline solution
Multiprefix arrays
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Definition
[Ranade91]
[Forsell06a]
<This paper>
Exclusive load
2-3
1
1
Exclusive store
2-3
1
1
Concurrent load
2-3
1
1
Concurrent store
2-3
1
1
0.5
Multioperation (Tparticipating T )
2-3
1
1
Arbitrary ordered multiprefix (Tpt T0.5)
2-3
1
1
Ordered multiprefix (Tpt T0.5)
2-3
O(log Tpt)
2 if Ncmp P, O(log Tpt) if Ncmp P
Multioperation (Tpt > T0.5)
2-3
2
2
Arbitrary ordered multiprefix (Tpt > T0.5)
2-3
2
2
0.5
Ordered multiprefix (Tpt > T )
2-3
O(log Tpt)
3 if Ncmp P, O(log Tpt) if Ncmp P
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------Table 3. Execution time of memory related operations in the evaluated solutions in steps. Multioperations associative cumulative operations
like multiprefixes but no return values are sent back to processors, just the content of the target memory location is altered.
(Ncmp=number of concurrent multiprefixes, Tpt=number of participating threads)
per gate, taking the predicted clock cycle into account and
adding the power consumption of the interconnection network
wiring. The results given by the model are shown in Figure 8.
4.3 Discussion
As expected CMPs using the proposed multiprefix array solution executed ordered prefix programs much faster than the
baseline CMPs if the number of concurrent multiprefixes does
not exceed the number of memory modules (processors). The
1200
800
600
Com
C
om
Mem
Mem
400
Proc
Proc
200
0
Power
Power co
consumption
nsumption (W
(W))
Silicon
Silicon a
area
rea (mm^
(mm^2}
2}
1000
1000
800
C
Com
om
600
Me
m
Mem
400
Proc
Proc
200
0
E4
E1
E16
6
E6
E64
4
C4
C16
C64
E4
E1
E16
6
E6
E64
4
C4
C16
C64
Figure 8. Silicon area (left), and power consumption (right) estimates for non-optimized implementations of CMPs using high-performance
65 nm silicon technology with minimum global wiring pitch for interconnects implying clock frequency of 1.29 GHz assuming 1 MB
data SRAM and 8 kiloinstructions program SRAM per processor for determining the clock frequency.
5. Conclusions
We have described an architectural technique supporting a
limited number of concurrent ordered multiprefix operations in
councurrent memory access-aware ESM CMPs. The solution is
based on adding special multiprefix arrays to active memory
units. According to our evaluations, the technique indeed provides high speedups with respect to the baseline CMPs while
keeping the silicon area and power consumption overheads
very low. The measured average speedups ranged from 19.0 on
a 4-processor chip to 22.8 on a 64-processor chip. While the
poposed technique supports up to P ordered simultaneous multiprefix operations, the baseline solution supports P or more
simultaenous prefixes as long as the the threads participating to
a single multiprefix belong to the same processor.
Our future work includes investigating whether more than
one concurrent ordered multiprefix per a memory module could
be issued, the relatively high buffering requirements could be
decreased, and whether there exists a way to make concurrent
access even faster than in the proposed solution. Finally, we aim
to investigate memory module level caching solutions to make
all memory locations equal with respect to multioperations.
Acknowledgements
This work was supported by the REPLICA frontier research
project of VTT and grant 128733 of the Academy of Finland.
References
[Beck97] A. Beck, High Throughput Computing: An Interview with
Miron Livny, 1997. HPCWire.