Documente Academic
Documente Profesional
Documente Cultură
e-mail: muellerinformatik.hu-berlin.de
Abstra
t
This paper
ontributes a study on integrating
ommuni
ation middleware into a multi-threading environment. It addresses problems rooted in blo
king and provides solutions to build asyn
hronous
ommuni
ating
me
hanisms on top of syn
hronous ones. The paper
details dierent strategies for handling multi-threaded
on
urren
y in both an event-signalling framework and
a polling fashion, depending on the underlying message layer. It motivates the advantages of a single
re
eption point for prioritized multi-threading and dis
usses potential for zero-
opy overhead on message re
eption. Implementation details are given for a
ase
study with DSM-Threads, a distributed exe
ution environment with per node multi-threading, adapted to
use Madeleine as an abstra
tion layer for
ommuni
ation and BIP on the lower level to utilize Myrinet.
The study underlines strengths and weaknesses of the
layered
omponents and gives general guidelines for approa
hing similar eorts of
ommuni
ation abstra
tion.
1 Introdu
tion
In re
ent years, multi-threading has re
eived in
reased attention as a means to hide laten
ies and exploit shared-memory multi-pro
essors (SMPs), whi
h
lead to standardizations su
h as POSIX Threads
(Pthreads) [10 and in
orporation of threading into
main-stream languages su
h as Java. At the same time,
network
ommuni
ation has seen remarkable improvements resulting in higher bandwidth and lower laten
y
onne
tions, e.g., via Myrinet, SCI and Gigabit Ethernet. These trends have
hanged the eld of super
omputing. Appli
ations formerly dedi
ated to super
omputing have been restru
tured to exe
ute in unison
within a
luster of dedi
ated workstations. The advent
of
luster
omputing emphasizes the need to
ombine
the trends of multi-threading and high-e
ien
y
ommuni
ation. In parti
ular, pro
essor speeds are still
advan
ing faster than improvements in the networking area. Hen
e,
ommuni
ation is likely to remain a
bottlene
k in distributed
omputing. Multi-threading
may hide laten
ies imposed by
ommuni
ation. In ad-
fax:-3010
dition, asyn
hronous
ommuni
ation allows
omputation to progress without waiting for message re
eption.
This paper addresses these trends by des
ribing
dierent approa
hes to
ombine multi-threading, distributed exe
ution and message-passing
ommuni
ation within existing frameworks. A
ase study details
the eorts of adapting a distributed shared memory
framework, DSM-Threads [7, to adhere to
ommuni
ation abstra
tion provided by Madeleine [1 and BIP
[9 for Myrinet. Several issues are raised, ranging from
on
i
ts in
ontrol for distributed exe
ution over different message re
eption paradigms to methods of integrating thread-safe message passing for asyn
hronous
ommuni
ation. Calls are
onsidered thread-safe (or
MT-safe: they blo
k the
alling thread, not the pro
ess).
DSM-Threads is a runtime system to support distributed threads with a distributed shared virtual
memory. Appli
ations that adhere to the Pthreads
standard rely on shared memory and thus experien
e
their best performan
e on SMPs. The programming
model of these appli
ations, however,
ontains inherent parallelism that is not only limited to SMPs but
an readily be exploited on a distributed system with
shared virtual memory, su
h as DSM-Threads. Hen
e,
DSM-Threads supports an API whi
h strongly resembles that of POSIX Threads. It supports s
alability
through a variety of syn
hronization and memory
oheren
e proto
ols that stri
tly follow a de
entralized
approa
h, as opposed to a
lient-server paradigm where
a server may present a bottlene
k. Communi
ation
relies on message passing with point-to-point
onne
tions. It tolerates asyn
hronous
ommuni
ation on
sends and allows out-of-order message passing. In addition, nodes in the distributed environment may itself
be multi-threaded.
Madeleine is a framework for message passing that
supports TCP/IP, BIP, VIA and SCI. Besides its
task to abstra
t from a
tual network interfa
es, it
also provides distributed exe
ution and supports multithreading over a non-standard thread pa
kage, Mar
el.
BIP is a fast interfa
e to the Myrinet network ar
hite
ture that supports various methods for
ommuni
a-
2 Overview
This se
tion gives an overview of the dierent software
omponents, i.e., starting with drivers for networking on the lowest level through various levels of
middleware for
ommuni
ation abstra
tion and
on
urrent exe
ution up to a distributed runtime system.
Figure 1 depi
ts the
omponents for our sample study
but the approa
h
an be generalized to arbitrary systems with the same design goals where dierent middleware
omponents may be used. On the top level, DSMDSMThreads
Madeleine
BIP/
TCP
Marcel
(shallow binding)
Pthreads
Threads provides a distributed runtime system. DSMThreads also happens to support distributed shared
memory but this is not a requirement for this study,
i.e., we mostly abstra
t from the DSM features in
the following. The distributed runtime requires an
underlying environment for per-node multi-threading
(Pthreads) and a message-passing framework. Sin
e
message passing has to be mapped onto a variety of
network ar
hite
tures, whi
h are subje
t to
onstant
hanges and improvement in bandwidth and laten
y,
it is imperative to
hoose a portable and extensible
approa
h for supporting these network ar
hite
tures.
In this study, Madeleine was
hosen as an intermediate layer sin
e it supports a variety of network ar
hite
tures and standards, su
h as Ethernet via TCP,
Myrinet via BIP, SCI and VIA. Our study mostly emphasizes the usage of BIP and TCP in this
ontext.
Madeleine also provides a
ommon interfa
e to the upper layers, whi
h enhan
es their portability.
Modi
ations within DSM-Threads in
luded the
handling of
ommuni
ation, whi
h was mapped onto
the Madeleine API, both on the sending and re
eiving side. Table 1 depi
ts the mapping from a TCPoriented interfa
e to the
orresponding Madeleine routines. Madeleine also supports in
remental sends and
re
eives of messages, whi
h
an be used for zero-
opy
overhead by dire
tly spe
ifying the origin and destination within memory for message sends and re
eptions,
respe
tively. This aspe
t will be detailed later on.
Madeleine is thread safe in the sense that a
ess
TCP/IP-based
a
ept
read
lose (read)
open
write
lose (write)
Madeleine
mad
mad
mad
mad
mad
mad
re
eive
unpa
k byte
re
vbuf re
eive
sendbuf init
pa
k byte
sendbuf send
node 2
> program
node 2
slave 1
slave 1
T0
dsm_init
dsm_exit
node 1
master
T0
T0
CS
master
Tfct
T0
dsm_init
dsm_thread_create
dsm_thread_create
CS
dsm_exit
T0
dsm_exit
CS
node 3
(function)
slave 2
slave 2
dsm_exit
dsm_init
T0
CS
dsm_init
dsm_exit
CS
Tfct
dsm_exit
Tfct
program start
pthread_create
Tfct
CS
dsm_init
dsm_thread_create
dsm_thread_create
node 3
(function)
dsm_init
node 1
lower-level
omponents. Hen
e, it was de
ided to adjust DSM-Threads to adhere to the distributed exe
ution model of Madeleine and BIP. Finally, the number
of nodes for distributed exe
ution is stati
ally regulated, i.e., remote pro
esses are
reated at initiation
time. No additional nodes
an be added later on.
Figure 2(b) depi
ts the modied DSM-Threads
model for distributed exe
ution. It still provides the
same API to the user, i.e., the user is given the option
to distribute remote exe
ution expli
itly or the runtime
handles this task impli
itly. However, spe
i
ations of
target nodes are only treated as hints. At initiation
time, the DSM runtime system instru
ts Madeleine
(and thereby BIP) to
reate pro
esses on a stati
set
of nodes. During runtime, a user request for remote
exe
ution simply results in
ommuni
ation with a target node to
reate a new thread on this node within
the existing pro
ess. Hen
e, it is no longer possible to
spawn multiple pro
esses per node. Instead, su
h user
requests result in
lustering of a set of threads within
the dedi
ated single pro
ess on this node. There are no
impa
ts on the semanti
s of the exe
ution model sin
e
DSM-Threads assumes a shared-memory programming
paradigm. But there may be lost opportunities to exploit SMPs on the OS level when a Pthreads implementation does not already support SMPs.
4 Communi
ation
This se
tion
ontrasts dierent approa
hes to realize
message-based
ommuni
ation with regard to several
onstraints. The laten
y and bandwidth limitations
of today's systems still pose a problem for
ontemporary frameworks that support distributed exe
ution.
Although
onsiderable advan
es both in laten
y and
throughput have been made, the network generally remains the bottlene
k in distributed environments. Hiding laten
ies by multi-threading may help but
annot
eliminate the problem. Another option to improve the
situation is to improve the responsiveness of a system,
i.e., to ensure that when an important message is re
eived, it will be handled right away.
4.1 Direct and Indirect Communication
P1
T1
P2
T2
T1
P1
P2
POC1
T2
POC2
T2
msg to T1
CS
T1
POC1.1
msg to T1
msg to T2
CS
T2
T1
POC2.1
POC1.2
POC2.2
msg to T2
(a) Direct Reception of Messages on Distinct Ports/Channels
registerHandler( ReceiveMessage )
receive( buffer )
Handler
ReceiveMessage
return
process message
test != 0
process message
CS
memory
CS
memory
worker
temporary buffer
for a page
actual page
CS communication server
receiving a message
actions of the CS
Mar el
Routines
Pthreads
lo
k task
unlo
k task
mar
el key
reate
mar
el setspe
ifi
mar
el getspe
ifi
mar
el mutex init
mar
el mutex lo
k
mar
el mutex unlo
k
mar
el givehandba
k
tmallo
tfree
mar
el sele
t
mar
el sem init
mar
el sem P
mar
el sem V
pthread mutex lo
k
pthread mutex unlo
k
pthread key
reate
pthread setspe
ifi
pthread getspe
ifi
pthread mutex init
pthread mutex lo
k
pthread mutex unlo
k
s
hed yield
mar
el mutex t
mar
el mutexattr t
mar
el key t
pthread mutex t
pthread mutexattr t
pthread key t
adopted from
adopted from
Mar
el.
Mar
el.
sele
t
sem init (uses lo
k/
ond)
sem p (uses lo
k/
ond)
sem v (uses lo
k/
ond)
Data Types
sender
sender
T
REQUEST
receiver
CS
T: wait(c)
ACK
DATA
receiver
CS
REQUEST
CS
ACK
CS: signal(c)
T continues
DATA
(a)
(b)
but the server has to remember pending partial operations. Demultiplexing of messages to the
orre
t
thread is realized through separate
ondition variables
for ea
h thread. Hen
e, signaling a spe
i
ondition
only awakes one sele
ted thread.
The se
ond extension addresses the hand-shaking
me
hanism for
redit handling to ensure that the limited number of buers within BIP does not over
ow. In
BIP, ea
h re
eiver posses a limited number of buers
per sender for storing short messages. In the beginning, a sender posses all
redits for a re
eiver, as depi
ted Figure 8(a) with Bsnd and Br
v
redits for sends
to B and re
eives from B, respe
tively. The latter denotes the number of already used
redits. When node
A sends a message to B, then
redits on A for B will be
de
remented, as seen in Figure 8(b-d). A send without
proper
redits simply results in blo
king sin
e it must
be assumed that the re
eiver does not have an empty
slot for short messages. In addition, used
redits will
be returned as piggyba
ks on every message, as seen in
Figure 8(
). This informs the other node that a slot
has be
ome available for another message.
Sending messages in DSM-Threads was modied to
the extend that (in the absen
e of
redits) messages are
stored in a message queue. This ee
tively results in
asyn
hronous sends for threads on the DSM level. The
message queue is later handled by workers when new
redits are re
eived. The worker is informed by the
ommuni
ation server about the arrival of new
redits. Depending on the number of re
eived
redits, a
subset of the queued messages may then be sent. Noti
e that DSM-Threads uses asyn
hronous
ommuni
ation proto
ols in general for syn
hronization and
onsisten
y handling on the DSM level, whi
h fa
ilitates
redit hand-shaking in the des
ribed manner.
[5. They investigate performan
e issues and me
hanisms for mat
hing polling rates and message arrival
rates. They
on
lude that signals are more appropriate for
oarse parallelism while polling ex
els under
ner granular parallelism. They also suggest to
ombine both approa
hes depending on the
urrent state
of the s
heduler, i.e., they suggest to use polling when
no threads are ready. Our work diers in that we aim
to minimize modi
ations to existing
omponents. Neither the thread s
heduling (Pthreads), whi
h may even
be inside the operating system, nor the
ommuni
ation
layer (BIP) was modied in our work.
Maquelin et al. also suggest to
ombine polling and
interrupt handling for
ommuni
ation but on a hardware level [6. Interrupts are only generated if the
network interfa
e realizes that polling may not result
in ee
tive responsiveness. Results show that this approa
h outperforms traditional polling in terms of responsiveness while a
hieving
omparable performan
e
overhead. At the same time, this overhead is lower
than traditional interrupt handling. In
ontrast, our
work was restri
ted to software approa
hes of dealing
with
ommuni
ation issues.
Itzkovitz et al. dis
uss the merits of interrupts and
polling on Windows NT for Millipede [4. They suggest that the
ombination of polling, multi-threading
and asyn
hronous
ommuni
ation may neither a
hieve
the best responsiveness, nor utilization, nor the lowest number of
ontext swit
hes. They enhan
ed Fast
Messages to implement signal noti
ation. Our work
fo
uses on portability and extensibility with regard to
existing
ommuni
ation
omponents. Hen
e, we did
not enhan
e BIP by signal noti
ation. Future work
may in
lude su
h an approa
h to assess if the results
from NT
an be generalized to other environments.
6 Related Work
7 Con lusion
We presented our experien e on integrating ommuni ation middleware into a multi-threading envi-
Bsnd Brcv
Asnd Arcv
Bsnd Brcv
Asnd Arcv
Bsnd Brcv
Asnd Arcv
Bsnd Brcv
Asnd Arcv
(a)
0 0
3
credits
0 back to B
1 credit
back to A
(c)
(b)
Figure 8. Credit Mechanism
Referen
es
[1 Lu
Bouge, Jean-Fran
ois Mehaut, and Raymond
Namyst. Madeleine: an e
ient and portable
ommuni
ation interfa
e for multithreaded environments. In Pro
. 1998 Int. Conf. Parallel Ar
hite
tures and Compilation Te
hniques (PACT '98),
pages 240{247, ENST, Paris, Fran
e, O
tober
1998. IFIP WG 10.3 and IEEE.
[2 M. Eberl, W. Karl, M. Lebere
ht, and M. S
hulz.
Eine Software-Infrastruktur fur Na
hri
htenaustaus
h und gemeinsamen Spei
her auf SCIbasierten PC-Clustern. In Cluster Computing
Workshop, 1999.
[3 B. Herland and M. Eberl. A
ommon design of
PVM and MPI using the SCI inter
onne
t. nal
[4
[5
[6
[7
0 credits
back to A
(d)
Conferen e on Parallel and Distributed Pro essing Te hniques and Appli ations, pages 315{324,
April 1998.
[10 Te
hni
al Committee on Operating Systems and
Appli
ation Environments of the IEEE. Portable
1996. ANSI/IEEE Std 1003.1, 1995 Edition, in luding 1003.1 : Amendment 2: Threads Extension [C Language.