Sunteți pe pagina 1din 8

Lesson21BasicModel

SimulatingtheBrain
Quizdemonstratingtheamountofcomputingpowerneededforsomeproblems.Thismeans
weneedtobeabletoefficientlyusedistributedmemory.

ABasicModelofDistributedMemory
Todesignalgorithmsforasupercomputeramachinemodelisnecessary.

PrivateMemoryModelasystemasindividualnodeswiththeirownmemorythatothernodes
cannotdirectlyaccess.

Togetinformationfromtheprivatememorymessagepassingmustbeused.Themessage
willhavetofindapathfromthesourcetothedestination.

RulesoftheModel:
0Ruledonttalkaboutthemodel.Youmustinternalizetheserules.

1Rulethenetworkisfullyconnected,thereisalwaysapathfromanynodetoanyother
nodeinthenetwork.

2Rulethelinksarebidirectional.Thelinkscancarryamessagebetweennodesinboth
directionsatthesametime.

3Rulethenodescansendandreceiveonemessageatatime.Thismeansifanodewants
tosendmessagestoothernodesitmustdoitoneatatime.

4Rulethecosttosendandreceivenwords.Tosendamessagefromonenodeto
anothercouldgobydifferentpaths.Atthistimewewillsaythecostofcommunicationisnot
affectedbythepathlength.(Laterinlessonthiswillbediscussed).

Thisruleonlyapplieswhentherearenomessagescompetingforlinks.

Thisequationdoesnotaccountforthepathlength.

= latency, unitsoftime, itisafixedcostnomatterhowlargethemessage.


= inversebandwidth, unitsoftime/word .
n=numberofwords

T msg(n) = + n

5Rulewhenmessagesaretryingtogooverthesamelinkatthesametime,thisiscalled
congestion.
Kwaycongestionreducesbandwidth

k=thenumberofmessagescompetingforthesamelink
T msg(n) = + n k

Thismeansthattwomessages(goinginthesamedirection)thatarecompetingforthesame
linkwillhaveatimethatisclosertotheserialtime,thantheparalleltimeforthetwo
messages.

MEMORIZETHESERULES!

PipelinedMessageDelivery
Assume:thereisamessagethatmusttraversePnodes,itisonewordlong.Themessageis
atP=0andneedstogotoP1.
So:
n = 1 : a + t(P 1)

Now,letssendamessagetwowordslong.Thewordsarepipelinedthroughthepath
So:
n = 2 : a + t(P 1) + t

(theaddedtisforthesecondwordexitingthepipeline)

Soforamessageofnwords:

n = a + t(P 1) + t ( n 1 )

Simplifyto:
Timeformessageofsizen:
n : a + t(P 2) + tn

consider:

t(P 2) tobethewiredelayofthemessagetravelingacrossthenetwork.

tn isthetimerelatedtothemessagesize.

a isthesoftwareoverheadforpreparingmessagesfordelivery.

a ismuchlargerthan t(P 2) sotheycanbetreatedasoneconstant.

So a + t(P 2) isproportionalto and isproportionalto1/t.

GettingtheFeelfortheAlphaBetaModel

= compute[time/operation]

Computationrequiresmuchlesstimethancommunication,sotrytohaveaslittle
communicationaspossible.

Also,youcansend~1000bytesinthesameamountoftimethatittakestopreparea
message.Itisbettertosendfewerlargermessagesthanfrequentshortones.

ApplyingtheRules
IftwomessagessentatthesametimedoNOToverlapthenthetransmissiontimeisnot
affected.Butifthetwomessagesoverlap(travelinthesamedirection)atanylinkthenthis
overlapmustbeserializedandthetransmissiontimeislonger.

CollectiveOperations

AlltoOneReduce:allnodesparticipatetoproduceafinalresultononenode.
1.
Havealloddnumberednodessendtheirdatatothelowerevennumber.
Onlyonewordissentineachmessagesothecommunicationtimeis +

2.
Afterthefirststeptherewillonlybeevenrankednodesleft.Sonowsendtheodd
partnertoitsevenpartner(theoddnumberednodewilsendtothelowereven
numberednode).
Thiscommunicationtimeisalso + .

3.
Repeatuntilonlyoneresultisleft.Thecommunicationtimeis:
(#ofsteps)*( + )

PointtoPoint
1.
AssumeaSPMDstyle.
2.
Assumethepseudocodealgorithmhasaccesstothevariablesrank(rankistheidof
theprocessanditisunique)andp(p=numberofprocesses).

Inpracticetheremaybemorethanoneprocessassignedtoanode.
3.
Aprimitivecalled

handlesendAsync(buf[1:n]dest).
Itdoesanasynchronoussendusingabufferofsizenandadestinationforthe
message.
AreturnfromsendAsyncmeansamessagehasbeensent,notthatithasbeen
received.Tofindoutwhathashappened,testthehandlethatisreturned.

4.
Asynchronousreceive.Thisispostedbythedestination.
handlerecvAsync(buf[1:n]source)

AreturnfromrecvAsyncmeansthemessagewasreceived.Tofindoutifthedatais
available,thehandlemustbetested.

5.
Blocking.Thisprimitivewillpauseuntilthecorrespondingoperationsarecomplete.
Theprimitiveis:
wait(handle1,handle2,)orwait(*)

PointtoPointCompletionSemantics
wait(handle1,)itissafetousethebufferwhentheprimitivereturns.

asyncRecvareturnmeansthemessagewasdelivered.
asyncSendthereturndoesnottellyouverymuch.Thisisdonetoallowtheprogrammerthe
abilitytodecidewhattodoaboutasendmessage:waitformessagereceivedormakea
copyandcontinuetowork.

WhenwritingalgorithmsRemembereverysendmusthaveamatchingreceive.

SendandReceiveinAction
sendAsyncandrecvAsynccangettrappedinadeadlockdependinguponhowthewaitis
implemented.

AlltoOneReducePseudocode

lets=localvalue
bitmask1
whilebitmask<Pdo
PARTNERRANK^bitmask
ifRANK&bitmaskthen
sendAsync(sPARTNER)
wait(*)
break//onesent,theprocesscandropout
elseif(PARTNER<P)

recvAsync(tPARTNER)
wait(*)
SS+t
bitmask(bitmask<<1)
ifRANK=0thenprint(s)

Note: Onlysendersdropout
Sendershavea1atthebitmaskposition

RANK(Receiver)<RANK(Sender)<P

VectorReductions

Inthevectorversion:
sendAsync(S[1:n]PARTNER)
or
sendAsync(S[:])

VectorReductionsQuiz
Timetodoavectorreduction=( + * n)logP

MoreCollectives
AlltoOneReduce OnetoAllBroadcastThesearedualsofoneanother

OnetoAllBroadcastonenodehasinformationandallothernodesgetacopyofit.

Scatteroneprocessstartswiththedata,thensendsapieceofittoallotherprocesses.
Thedualofascatterisgather.

Allgathereachprocesshasapieceofthedata.Thedataisgatheredandeveryprocess
getsacopyofallthedata.
Thedualofallgatherisreducescatter.

Reducescatter:theprocessescontaindatathatisreducedtoonevector.Thisvectorhas
piecesthatarethenscatteredtoallotherprocesses.

APseudoCodeAPIforCollectives

AlltoOneReduce:eachprocessneedstocall..reduce(Alocal[1:n],root).
Acollectiveisanoperationthatmustbeexecutedbyallprocesses.

Thefollowingisinvalidbecauseitdoesnotcallallprocesses,andresultsinadeadlock.

ifRANK=0then
reduce(X[:],0)

else
://twiddlethumbs

broadcast(Alocal[1:n],root)whenthebroadcastiscompleteeveryprocesswillhavethe
samedatainitsbuffer.

gather(In[1:m],Out[1:m][1:P],root)everyprocesshasalittlebitofdata.Allthedataistobe
collectedontooneprocess.

NOTE:misusedinsteadofn.Thereasonforthisisn=mPwheremPisthesizeofthe
combinedoutput.

scatter(In[1:m][1:P],root,Out[1:m])>theinputissizemxP.

allGather(In[1:m],Out[1:m][1:P])Thereisnorootrank

reduceScatter(In[1:m][1:P],Out[1:m])

reshape(A[1:m][1:n])A^[1:m:n]goesfrom2Dto1D
reshape(A[1:m:n])A^[1:m][1:n]goesfrom1Dto2D

Itgoesfromalogical2Drepresentationtoa1Drepresentation.

AllGatherfromBuildingBlocks
AllGatherPseudoCode:

gather(IN,Out,root)
broadcast(reshape(Out),root)

CollectivesLowerBounds

Whatshouldthecostgoalsbe?

reducehasacostof T (n) = ( + n)logP Isthisgoodorbad?

ThetreebasedalgorithmmaybesendingtoomuchdatabyafactorofP.

AllGatherQuiz
Iftwoprimitivesattainthelowerbounds,willthecombinationofthetwoalsoachievethelower
bounds?Thecombinationofoptimalprimitiveswillproduceoptimalresults.

ImplementScatterQuiz
Recallaprocesscanonlydoonesimultaneoussendandreceive.Iftherootisdoingallofthe
sends...thentheyexecutesequentially.

ImplementingScatterandGather
Goal:findan logP algorithm.

Adifferentwaytoimplementscatter:Insteadofoneprocessdoingallthescattering,divide
andconquertheproblem.

Dividethenetworkintotwoandsendhalfthedatatootherhalfofthenetwork.Noweachof
thesecandothesamething,dividetheirpartofthenetworkanddata.

Whatisthecostofthenewmethod?
Thenewmethodhasretainedthelowerboundwithrespecttolatencyandbandwidth.

T (n) = logP + n((P 1)/P )

WhentoUseTreeBasedReduction
Whenisthetreebasedalgorithmstillagoodalgorithm?
AskyourselfthequestionWhendoesthealphatermdominatethebetaterm?
When: n
When:nissmall.

WhatsWrongwithTreeBasedAlgorithms?
Thereistoomuchredundantcommunication.

BucketingAlgorithmsforCollectives
Usingagatherbroadcastschemecanbecommunicationintensive.
Insteadhaveeachprocesssendtoaneighbor.Allprocessescandothisinparallel.

ThiswillresultinP1communicationsteps.
T (n = mP ) = ( + n/P )(P 1) P + n


n isoptimal
P issuboptimal

overallthisalgorithmisgood.

ABandwidthoptimalBroadcast
Todeviseabandwidthoptimalbroadcast,usethebandwidthfriendlyprimitives.
Firstscatter,thenallgather.Reshapesarealsonecessary.

AllReduce
Allprocesseshaveacopyofthedata.
Whatshouldbecombinedtomakeabandwidthoptimalalgorithm?

Usereducescatterandallgather

S-ar putea să vă placă și