Documente Academic
Documente Profesional
Documente Cultură
SimulatingtheBrain
Quizdemonstratingtheamountofcomputingpowerneededforsomeproblems.Thismeans
weneedtobeabletoefficientlyusedistributedmemory.
ABasicModelofDistributedMemory
Todesignalgorithmsforasupercomputeramachinemodelisnecessary.
PrivateMemoryModelasystemasindividualnodeswiththeirownmemorythatothernodes
cannotdirectlyaccess.
Togetinformationfromtheprivatememorymessagepassingmustbeused.Themessage
willhavetofindapathfromthesourcetothedestination.
RulesoftheModel:
0Ruledonttalkaboutthemodel.Youmustinternalizetheserules.
1Rulethenetworkisfullyconnected,thereisalwaysapathfromanynodetoanyother
nodeinthenetwork.
2Rulethelinksarebidirectional.Thelinkscancarryamessagebetweennodesinboth
directionsatthesametime.
3Rulethenodescansendandreceiveonemessageatatime.Thismeansifanodewants
tosendmessagestoothernodesitmustdoitoneatatime.
4Rulethecosttosendandreceivenwords.Tosendamessagefromonenodeto
anothercouldgobydifferentpaths.Atthistimewewillsaythecostofcommunicationisnot
affectedbythepathlength.(Laterinlessonthiswillbediscussed).
Thisruleonlyapplieswhentherearenomessagescompetingforlinks.
Thisequationdoesnotaccountforthepathlength.
T msg(n) = + n
5Rulewhenmessagesaretryingtogooverthesamelinkatthesametime,thisiscalled
congestion.
Kwaycongestionreducesbandwidth
k=thenumberofmessagescompetingforthesamelink
T msg(n) = + n k
Thismeansthattwomessages(goinginthesamedirection)thatarecompetingforthesame
linkwillhaveatimethatisclosertotheserialtime,thantheparalleltimeforthetwo
messages.
MEMORIZETHESERULES!
PipelinedMessageDelivery
Assume:thereisamessagethatmusttraversePnodes,itisonewordlong.Themessageis
atP=0andneedstogotoP1.
So:
n = 1 : a + t(P 1)
Now,letssendamessagetwowordslong.Thewordsarepipelinedthroughthepath
So:
n = 2 : a + t(P 1) + t
(theaddedtisforthesecondwordexitingthepipeline)
Soforamessageofnwords:
n = a + t(P 1) + t ( n 1 )
Simplifyto:
Timeformessageofsizen:
n : a + t(P 2) + tn
consider:
t(P 2) tobethewiredelayofthemessagetravelingacrossthenetwork.
tn isthetimerelatedtothemessagesize.
a isthesoftwareoverheadforpreparingmessagesfordelivery.
GettingtheFeelfortheAlphaBetaModel
= compute[time/operation]
Computationrequiresmuchlesstimethancommunication,sotrytohaveaslittle
communicationaspossible.
Also,youcansend~1000bytesinthesameamountoftimethatittakestopreparea
message.Itisbettertosendfewerlargermessagesthanfrequentshortones.
ApplyingtheRules
IftwomessagessentatthesametimedoNOToverlapthenthetransmissiontimeisnot
affected.Butifthetwomessagesoverlap(travelinthesamedirection)atanylinkthenthis
overlapmustbeserializedandthetransmissiontimeislonger.
CollectiveOperations
AlltoOneReduce:allnodesparticipatetoproduceafinalresultononenode.
1.
Havealloddnumberednodessendtheirdatatothelowerevennumber.
Onlyonewordissentineachmessagesothecommunicationtimeis +
2.
Afterthefirststeptherewillonlybeevenrankednodesleft.Sonowsendtheodd
partnertoitsevenpartner(theoddnumberednodewilsendtothelowereven
numberednode).
Thiscommunicationtimeisalso + .
3.
Repeatuntilonlyoneresultisleft.Thecommunicationtimeis:
(#ofsteps)*( + )
PointtoPoint
1.
AssumeaSPMDstyle.
2.
Assumethepseudocodealgorithmhasaccesstothevariablesrank(rankistheidof
theprocessanditisunique)andp(p=numberofprocesses).
Inpracticetheremaybemorethanoneprocessassignedtoanode.
3.
Aprimitivecalled
handlesendAsync(buf[1:n]dest).
Itdoesanasynchronoussendusingabufferofsizenandadestinationforthe
message.
AreturnfromsendAsyncmeansamessagehasbeensent,notthatithasbeen
received.Tofindoutwhathashappened,testthehandlethatisreturned.
4.
Asynchronousreceive.Thisispostedbythedestination.
handlerecvAsync(buf[1:n]source)
AreturnfromrecvAsyncmeansthemessagewasreceived.Tofindoutifthedatais
available,thehandlemustbetested.
5.
Blocking.Thisprimitivewillpauseuntilthecorrespondingoperationsarecomplete.
Theprimitiveis:
wait(handle1,handle2,)orwait(*)
PointtoPointCompletionSemantics
wait(handle1,)itissafetousethebufferwhentheprimitivereturns.
asyncRecvareturnmeansthemessagewasdelivered.
asyncSendthereturndoesnottellyouverymuch.Thisisdonetoallowtheprogrammerthe
abilitytodecidewhattodoaboutasendmessage:waitformessagereceivedormakea
copyandcontinuetowork.
WhenwritingalgorithmsRemembereverysendmusthaveamatchingreceive.
SendandReceiveinAction
sendAsyncandrecvAsynccangettrappedinadeadlockdependinguponhowthewaitis
implemented.
AlltoOneReducePseudocode
lets=localvalue
bitmask1
whilebitmask<Pdo
PARTNERRANK^bitmask
ifRANK&bitmaskthen
sendAsync(sPARTNER)
wait(*)
break//onesent,theprocesscandropout
elseif(PARTNER<P)
recvAsync(tPARTNER)
wait(*)
SS+t
bitmask(bitmask<<1)
ifRANK=0thenprint(s)
Note: Onlysendersdropout
Sendershavea1atthebitmaskposition
RANK(Receiver)<RANK(Sender)<P
VectorReductions
Inthevectorversion:
sendAsync(S[1:n]PARTNER)
or
sendAsync(S[:])
VectorReductionsQuiz
Timetodoavectorreduction=( + * n)logP
MoreCollectives
AlltoOneReduce OnetoAllBroadcastThesearedualsofoneanother
OnetoAllBroadcastonenodehasinformationandallothernodesgetacopyofit.
Scatteroneprocessstartswiththedata,thensendsapieceofittoallotherprocesses.
Thedualofascatterisgather.
Allgathereachprocesshasapieceofthedata.Thedataisgatheredandeveryprocess
getsacopyofallthedata.
Thedualofallgatherisreducescatter.
Reducescatter:theprocessescontaindatathatisreducedtoonevector.Thisvectorhas
piecesthatarethenscatteredtoallotherprocesses.
APseudoCodeAPIforCollectives
AlltoOneReduce:eachprocessneedstocall..reduce(Alocal[1:n],root).
Acollectiveisanoperationthatmustbeexecutedbyallprocesses.
Thefollowingisinvalidbecauseitdoesnotcallallprocesses,andresultsinadeadlock.
ifRANK=0then
reduce(X[:],0)
else
://twiddlethumbs
broadcast(Alocal[1:n],root)whenthebroadcastiscompleteeveryprocesswillhavethe
samedatainitsbuffer.
gather(In[1:m],Out[1:m][1:P],root)everyprocesshasalittlebitofdata.Allthedataistobe
collectedontooneprocess.
NOTE:misusedinsteadofn.Thereasonforthisisn=mPwheremPisthesizeofthe
combinedoutput.
scatter(In[1:m][1:P],root,Out[1:m])>theinputissizemxP.
allGather(In[1:m],Out[1:m][1:P])Thereisnorootrank
reduceScatter(In[1:m][1:P],Out[1:m])
reshape(A[1:m][1:n])A^[1:m:n]goesfrom2Dto1D
reshape(A[1:m:n])A^[1:m][1:n]goesfrom1Dto2D
Itgoesfromalogical2Drepresentationtoa1Drepresentation.
AllGatherfromBuildingBlocks
AllGatherPseudoCode:
gather(IN,Out,root)
broadcast(reshape(Out),root)
CollectivesLowerBounds
Whatshouldthecostgoalsbe?
ThetreebasedalgorithmmaybesendingtoomuchdatabyafactorofP.
AllGatherQuiz
Iftwoprimitivesattainthelowerbounds,willthecombinationofthetwoalsoachievethelower
bounds?Thecombinationofoptimalprimitiveswillproduceoptimalresults.
ImplementScatterQuiz
Recallaprocesscanonlydoonesimultaneoussendandreceive.Iftherootisdoingallofthe
sends...thentheyexecutesequentially.
ImplementingScatterandGather
Goal:findan logP algorithm.
Adifferentwaytoimplementscatter:Insteadofoneprocessdoingallthescattering,divide
andconquertheproblem.
Dividethenetworkintotwoandsendhalfthedatatootherhalfofthenetwork.Noweachof
thesecandothesamething,dividetheirpartofthenetworkanddata.
Whatisthecostofthenewmethod?
Thenewmethodhasretainedthelowerboundwithrespecttolatencyandbandwidth.
WhentoUseTreeBasedReduction
Whenisthetreebasedalgorithmstillagoodalgorithm?
AskyourselfthequestionWhendoesthealphatermdominatethebetaterm?
When: n
When:nissmall.
WhatsWrongwithTreeBasedAlgorithms?
Thereistoomuchredundantcommunication.
BucketingAlgorithmsforCollectives
Usingagatherbroadcastschemecanbecommunicationintensive.
Insteadhaveeachprocesssendtoaneighbor.Allprocessescandothisinparallel.
ThiswillresultinP1communicationsteps.
T (n = mP ) = ( + n/P )(P 1) P + n
n isoptimal
P issuboptimal
overallthisalgorithmisgood.
ABandwidthoptimalBroadcast
Todeviseabandwidthoptimalbroadcast,usethebandwidthfriendlyprimitives.
Firstscatter,thenallgather.Reshapesarealsonecessary.
AllReduce
Allprocesseshaveacopyofthedata.
Whatshouldbecombinedtomakeabandwidthoptimalalgorithm?
Usereducescatterandallgather