Sunteți pe pagina 1din 294

High Performance Computing

By: Charles Severance

High Performance Computing

By: Charles Severance

Online: < http://cnx.org/content/col11136/1.4/ >

CONNEXIONS
Rice University, Houston, Texas

This selection and arrangement of content as a collection is copyrighted by Charles Severance. It is licensed under the Creative Commons Attribution 3.0 license (http://creativecommons.org/licenses/by/3.0/). Collection structure revised: February 11, 2010 PDF generated: March 21, 2010 For copyright and attribution information for the modules contained in this collection, see p. 271.

Table of Contents
Introduction to the Connexions Edition F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F I Introduction to High Performance Computing F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F Q 1 Modern Computer Architectures 1.1 wemory F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F U 1.2 plotingEoint xumers F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PW 2 Programming and Tuning Software 2.1 ht gompiler hoes F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F RU 2.2 iming nd ro(ling F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F TP 2.3 iliminting glutter F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F VS 2.4 voop yptimiztions F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F IHI 3 Shared-Memory Parallel Processors 3.1 nderstnding rllelism F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F FF F F F F F F F F F F F IPQ 3.2 hredEwemory wultiproessors F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F IRS 3.3 rogrmming hredEwemory wultiproessors F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F FF F F F F F F F F F F F IUH 4 Scalable Parallel Processing 4.1 vnguge upport for erformne F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F IWI 4.2 wessgeEssing invironments F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PIQ 5 Appendixes 5.1 eppendix gX righ erformne wiroproessors F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PQW 5.2 eppendix fX vooking t essemly vnguge F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PST Index F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F PTU Attributions F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F FPUI

iv

Introduction to the Connexions Edition

Introduction to the Connexions Edition


he purpose of this ook hs lwys een to teh new progrmmers nd sientists out the sis of righ erformne gomputingF oo mny prllel nd high performne omputing ooks fous on the rhitetureD theory nd omputer siene surrounding rgF s wnted this ook to spek to the prtiing ghemistry studentD hysiistD or fiologist who need to write nd run their progrms s prt of their reserhF s ws using the (rst edition of the ook written y uevin howd in IWWT when s found out tht the ook ws going out of printF s immeditely sent n ngry letter to y9eilly ustomer support imploring them to keep the ook going s it ws the only ook of its kind in the mrketpleF ht omplint letter triggered severl onverstions whih let to me eoming the uthor of the seond editionF sn true 4openEsoure4 fshion E sine s omplined out it E s got to (x itF huring pll IWWUD while s ws using the ook to teh my rg ourseD s reEwrote the ook one hpter t timeD fueled y multiple lteEnight lttes nd the fer of not hving nything redy for the weeks letureF he seond edition me out in tuly IWWVD nd ws pretty well reeivedF s got mny good omments from tehers nd sientists who felt tht the ook did good jo of tehing the prtitioner E whih mde me very hppyF sn IWWVD this ook ws pulished t rossrods in the history of righ erformne gomputingF sn the lte IWWH9s there ws still question to whether the lrge vetor superomputers with their speilized memory systems ould resist the ssult from the inresing lok rtes of the miroproessorsF elso in the lter IWWH9s there ws question whether the fstD expensiveD nd powerEhungry sg rhitetures would win over the ommodity sntel miroproessors nd ommodity memory tehnologiesF fy PHHQD the mrket hd deided tht the ommodity miroproessor ws king E its performne nd the performne of ommodity memory susystems kept inresing so rpidlyF fy PHHTD the sntel rhiteture hd eliminted ll the sg rhiteture proessors y gretly inresing lok rte nd truly winning the inresingly importnt ploting oint ypertions per tt ompetitionF yne users (gured out how to e'etively use loosely oupled proessorsD overll ost nd improving energy onsumption of ommodity miroproessors eme overriding ftors in the mrket pleF hese hnges led to the ook eoming less nd less relevnt to the ommon use ses in the rg (eld nd led to the ook going out of print E muh to the hgrin of its smll ut devoted fn seF s ws redued to uying used opies of the ook from emzon in order to hve few opies lying round the o0e to give s gifts to unsuspeting visitorsF hnks the the forwrdElooking pproh of y9eilly nd essoites to use pounder9s gopyright nd relesing outEofEprint ooks under gretive gommons ettriutionD this ook one gin rises from the shes like the proveril hoenixF fy ringing this ook to gonnexions nd pulishing it under gretive gommons ettriution liense we re insuring tht the ook is never gin osoleteF e n tke the ore elements of the ook whih re still relevnt nd new ommunity of uthors n dd to nd dpt the ook s needed over timeF ulishing through gonnexions lso keeps the ost of printed ooks very low nd so it will e wise hoie s textook for ollege ourses in righ erformne gomputingF he gretive gommons viensing
1 This
content is available online at <http://cnx.org/content/m32709/1.1/>.

P nd the ility to print lolly n mke this ook ville in ny ountry nd ny shool in the worldF vike ikipediD those of us who use the ook n eome the volunteers who will help improve the ook nd eome oEuthors of the ookF s need to thnk uevin howd who wrote the (rst edition nd griously let me lter it from over to over in the seond editionF wike voukides of y9eilly ws the editor of oth the (rst nd seond editions nd we tlk from time to time out possile future edition of the ookF wike ws lso instrumentl in helping to relese the ook from y9eilly under gretive gommons ettriutionF he tem t gonnexions hs een wonderful to work withF e shre pssion for righ erformne gomputing nd new forms of pulishing so tht the knowledge rehes s mny people s possileF s wnt to thnk tn ydegrd nd uthi plether for enourgingD supporting nd helping me through the reEpulishing proessF hniel illimson did n mzing jo of onverting the mterils from the y9eilly formts to the gonnexions formtsF s truly look forwrd to seeing how fr this ook will go now tht we n hve n unlimited numer of oEuthors to invest nd then use the ookF s look forwrd to work with you llF ghrles everne E xovemer IPD PHHW

Introduction to High Performance Computing


2

Why Worry About Performance?


yver the lst dedeD the de(nition of wht is lled high performne omputing hs hnged drmtillyF sn IWVVD n rtile ppered in the ll treet tournl titled ettk of the uiller wiros tht desried how omputing systems mde up of mny smll inexpensive proessors would soon mke lrge superomE puters osoleteF et tht timeD personl omputer osting 6QHHH ould perform HFPS million )otingEpoint opertions per seondD worksttion osting 6PHDHHH ould perform Q million )otingEpoint opertionsD nd superomputer osting 6Q million ould perform IHH million )otingEpoint opertions per seondF hereE foreD why ouldn9t we simply onnet RHH personl omputers together to hieve the sme performne of superomputer for 6IFP millionc his vision hs ome true in some wysD ut not in the wy the originl proponents of the killer miro theory envisionedF snstedD the miroproessor performne hs relentlessly gined on the superomputer performneF his hs ourred for two resonsF pirstD there ws muh more tehnology hedroom for improving performne in the personl omputer reD wheres the superomputers of the lte IWVHs were pushing the performne envelopeF elsoD one the superomputer ompnies roke through some tehnil rrierD the miroproessor ompnies ould quikly dopt the suessful elements of the superomputer designs few short yers lterF he seond nd perhps more importnt ftor ws the emergene of thriving personl nd usiness omputer mrket with everEinresing performne demndsF gomputer usge suh s Qh grphisD grphil user interfesD multimediD nd gmes were the driving ftors in this mrketF ith suh lrge mrketD ville reserh dollrs poured into developing inexpensive high performne proessors for the home mrketF he result of this trend towrd fster smller omputers is diretly evident s former superomputer mnufturers re eing purhsed y worksttion ompnies @ilion qrphis purhsed gryD nd rewlettEkrd purhsed gonvex in IWWTAF es result nerly every person with omputer ess hs some high performne proessingF es the pek speeds of these new personl omputers inreseD these omputers enounter ll the performne hllenges typilly found on superomputersF hile not ll users of personl worksttions need to know the intimte detils of high performne omputingD those who progrm these systems for mximum performne will ene(t from n understnding of the strengths nd weknesses of these newest high performne systemsF

Scope of High Performance Computing

computer
2 This

righ performne omputing runs rod rnge of systemsD from our desktop omputers through lrge prllel proessing systemsF feuse most high performne systems re sed on @sgA proessorsD mny tehniques lerned on one type of system trnsfer to the other systemsF

reduced instruction set

content is available online at <http://cnx.org/content/m32676/1.2/>.

R righ performne sg proessors re designed to e esily inserted into multipleEproessor system with P to TR gs essing single memory using @wAF rogrmming multiple proessors to solve single prolem dds its own set of dditionl hllenges for the progrmmerF he progrmmer must e wre of how multiple proessors operte togetherD nd how work n e e0iently divided mong those proessorsF iven though eh proessor is very powerfulD nd smll numers of proessors n e put into single enlosureD often there will e pplitions tht re so lrge they need to spn multiple enlosuresF sn order to ooperte to solve the lrger pplitionD these enlosures re linked with highEspeed network to funtion s @xyAF e xy n e used individully through th queuing system or n e used s lrge multiomputer using messge pssing tool suh s @wA or @wsAF por the lrgest prolems with more dt intertions nd those users with ompute udgets in the millions of dollrsD there is still the top end of the high performne omputing spetrumD the slle prllel proessing systems with hundreds to thousnds of proessorsF hese systems ome in two )vorsF yne type is progrmmed using messge pssingF snsted of using stndrd lol re networkD these systems re onneted using proprietryD slleD highEndwidthD lowElteny interonnet @how is tht for mrketing spekcAF feuse of the high performne interonnetD these systems n sle to the thousnds of proessors while keeping the time spent @wstedA performing overhed ommunitions to minimumF he seond type of lrge prllel proessing system is the @xweA systemsF hese systems lso use high performne interEonnet to onnet the proessorsD ut insted of exhnging messgesD these systems use the interonnet to implement distriuted shred memory tht n e essed from ny proessor using lodGstore prdigmF his is similr to progrmming w systems exept tht some res of memory hve slower ess thn othersF

symmetric multi processing

network of workstations message-passing interface

parallel virtual machine

scalable non-uniform memory access

Studying High Performance Computing


he study of high performne omputing is n exellent hne to revisit omputer rhitetureF yne we set out on the quest to wring the lst it of performne from our omputer systemsD we eome more motivted to fully understnd the spets of omputer rhiteture tht hve diret impt on the system9s performneF hroughout ll of omputer historyD slespeople hve told us tht their ompiler will solve ll of our prolemsD nd tht the ompiler writers n get the solute est performne from their hrdwreF his lim hs never eenD nd proly never will eD ompletely trueF he ility of the ompiler to deliver the pek performne ville in the hrdwre improves with eh sueeding genertion of hrdwre nd softwreF roweverD s we move up the hierrhy of high performne omputing rhitetures we n depend on the ompiler less nd lessD nd progrmmers must tke responsiility for the performne of their odeF sn the single proessor nd w systems with few gsD one of our gols s progrmmers should e to sty out of the wy of the ompilerF yften onstruts used to improve performne on prtiulr rhiteture limit our ility to hieve performne on nother rhitetureF purtherD these rillint @red otuseA hnd optimiztions often onfuse ompilerD limiting its ility to utomtilly trnsform our ode to tke dvntge of the prtiulr strengths of the omputer rhitetureF es progrmmersD it is importnt to know how the ompiler works so we n know when to help it out nd when to leve it loneF e lso must e wre tht s ompilers improve @never s muh s slespeople limA it9s est to leve more nd more to the ompilerF es we move up the hierrhy of high performne omputersD we need to lern new tehniques to mp our progrms onto these rhiteturesD inluding lnguge extensionsD lirry llsD nd ompiler diretivesF es we use these feturesD our progrms eome less portleF elsoD using these higherElevel onstrutsD we must not mke modi(tions tht result in poor performne on the individul sg miroproessors tht often mke up the prllel proessing systemF

Measuring Performance
hen omputer is eing purhsed for omputtionlly intensive pplitionsD it is importnt to determine how well the system will tully perform this funtionF yne wy to hoose mong set of ompeting systems is to hve eh vendor lon you system for period of time to test your pplitionsF et the end of the evlution periodD you ould send k the systems tht did not mke the grde nd py for your fvorite systemF nfortuntelyD most vendors won9t lend you system for suh n extended period of time unless there is some ssurne you will eventully purhse the systemF wore often we evlute the system9s potentil performne using F here re industry enhE mrks nd your own lolly developed enhmrksF foth types of enhmrks require some reful thought nd plnning for them to e n e'etive tool in determining the est system for your pplitionF

benchmarks

The Next Step


uite side from eonomisD omputer performne is fsinting nd hllenging sujetF gomputer rhiteture is interesting in its own right nd topi tht ny omputer professionl should e omfortle withF qetting the lst it of perE formne out of n importnt pplition n e stimulting exeriseD in ddition to n eonomi neessityF here re proly few people who simply enjoy mthing wits with lever omputer rhitetureF ht do you need to get into the gmec

e si understnding of modern omputer rhitetureF ou don9t need n dvned degree in omputer engineeringD ut you do need to understnd the si terminologyF e si understnding of enhmrkingD or performne mesurementD so you n quntify your own suesses nd filures nd use tht informtion to improve the performne of your pplitionF
his ook is intended to e n esily understood introdution nd overview of high performne omputingF st is n interesting (eldD nd one tht will eome more importnt s we mke even greter demnds on our most ommon personl omputersF sn the high performne omputer (eldD there is lwys trdeo' etween the single g performne nd the performne of multiple proessor systemF wultiple proessor systems re generlly more expensive nd di0ult to progrm @unless you hve this ookAF ome people lim we eventully will hve single gs so fst we won9t need to understnd ny type of dvned rhitetures tht require some skill to progrmF o fr in this (eld of omputingD even s performne of single inexpensive miroproessor hs inresed over thousndfoldD there seems to e no less interest in lshing thousnd of these proessors together to get millionfold inrese in powerF he heper the uilding loks of high performne omputing eomeD the greter the ene(t for using mny proessorsF sf t some point in the futureD we hve single proessor tht is fster thn ny of the SIPEproessor slle systems of todyD think how muh we ould do when we onnet SIP of those new proessors together in single systemF ht9s wht this ook is ll outF sf you9re interestedD red onF

Chapter 1
Modern Computer Architectures

1.1 Memory
1.1.1 Introduction1
1.1.1.1 Memory
vet9s sy tht you re fst sleep some night nd egin dremingF sn your dremD you hve time mhine nd few SHHEwrz fourEwy superslr proessorsF ou turn the time mhine k to IWVIF yne you rrive k in timeD you go out nd purhse n sfw g with n sntel VHVV miroproessor running t RFUU wrzF por muh of the rest of the nightD you toss nd turn s you try to dpt the SHHEwrz proessor to the sntel VHVV soket using soldering iron nd wiss ermy knifeF tust efore you wke upD the new omputer (nlly worksD nd you turn it on to run the vinpk2 enhmrk nd issue press releseF ould you expet this to turn out to e drem or nightmrec ghnes re good tht it would turn out to e nightmreD just like the previous night where you went k to the widdle eges nd put jet engine on horseF @ou hve got to stop eting doule pepperoni pizzs so lte t nightFA iven if you n speed up the omputtionl spets of proessor in(nitely fstD you still must lod nd store the dt nd instrutions to nd from memoryF ody9s proessors ontinue to reep ever loser to in(nitely fst proessingF wemory performne is inresing t muh slower rte @it will tke longer for memory to eome in(nitely fstAF wny of the interesting prolems in high performne omputing use lrge mount of memoryF es omputers re getting fsterD the size of prolems they tend to operte on lso goes upF he troule is tht when you wnt to solve these prolems t high speedsD you need memory system tht is lrgeD yet t the sme time fst" ig hllengeF ossile pprohes inlude the followingX

ivery memory system omponent n e mde individully fst enough to respond to every memory ess requestF low memory n e essed in roundEroin fshion @hopefullyA to give the e'et of fster memory systemF he memory system design n e mde wide so tht eh trnsfer ontins mny ytes of informE tionF he system n e divided into fster nd slower portions nd rrnged so tht the fst portion is used more often thn the slow oneF
eginD eonomis re the dominnt fore in the omputer usinessF e hepD sttistilly optimized memory system will e etter seller thn prohiitively expensiveD lzingly fst oneD so the (rst hoie is not muh of hoie t llF fut these hoiesD used in omintionD n ttin good frtion of the performne
1 This content is available online at <http://cnx.org/content/m32733/1.2/>. 2 See Chapter 15, Using Published Benchmarks, for details on the Linpack benchmark.

CHAPTER 1. MODERN COMPUTER ARCHITECTURES

you would get if every omponent were fstF ghnes re very good tht your high performne worksttion inorportes severl or ll of themF yne the memory system hs een deided uponD there re things we n do in softwre to see tht it is used e0ientlyF e ompiler tht hs some knowledge of the wy memory is rrnged nd the detils of the hes n optimize their use to some extentF he other ple for optimiztions is in user pplitionsD s we9ll see lter in the ookF e good pttern of memory ess will work withD rther thn ginstD the omponents of the systemF sn this hpter we disuss how the piees of memory system workF e look t how ptterns of dt nd instrution ess ftor into your overll runtimeD espeilly s g speeds inreseF e lso tlk it out the performne implitions of running in virtul memory environmentF

1.1.2 Memory Technology3

access memory

elmost ll fst memories used tody re semiondutorEsedF4 hey ome in two )vorsX @hewA nd @ewAF he term mens tht you n ddress memory lotions in ny orderF his is to distinguish rndom ess from seril memoriesD where you hve to step through ll intervening lotions to get to the prtiulr one you re interested inF en exmple of storge medium tht is rndom is mgneti tpeF he terms dynmi nd stti hve to do with the tehnology used in the design of the memory ellsF hews re hrgeEsed deviesD where eh it is represented y n eletril hrge stored in very smll pitorF he hrge n lek wy in short mount of timeD so the system hs to e ontinully refreshed to prevent dt from eing lostF he t of reding it in hew lso dishrges the itD requiring tht it e refreshedF st9s not possile to red the memory it in the hew while it9s eing refreshedF ew is sed on gtesD nd eh it is stored in four to six onneted trnsistorsF ew memories retin their dt s long s they hve powerD without the need for ny form of dt refreshF hew o'ers the est prieGperformneD s well s highest density of memory ells per hipF his mens lower ostD less ord speD less powerD nd less hetF yn the other hndD some pplitions suh s he nd video memory require higher speedD to whih ew is etter suitedF gurrentlyD you n hoose etween ew nd hew t slower speeds " down to out SH nnoseonds @nsAF ew hs ess times down to out U ns t higher ostD hetD powerD nd ord speF sn ddition to the si tehnology to store single it of dtD memory performne is limited y the prtil onsidertions of the onEhip wiring lyout nd the externl pins on the hip tht ommunite the ddress nd dt informtion etween the memory nd the proessorF

static random access memory not

random

dynamic random

1.1.2.1 Access Time

he mount of time it tkes to red or write memory lotion is lled the F e relted quntity is the F heres the ess time sys how quikly you n referene memory lotionD yle time desries how often you n repet referenesF hey sound like the sme thingD ut they9re notF por instneD if you sk for dt from hew hips with SHEns ess timeD it my e IHH ns efore you n sk for more dt from the sme hipsF his is euse the hips must internlly reover from the previous essF elsoD when you re retrieving dt sequentilly from hew hipsD some tehnologies hve improved performneF yn these hipsD dt immeditely following the previously essed dt my e essed s quikly s IH nsF eess nd yle times for ommodity hews re shorter thn they were just few yers goD mening tht it is possile to uild fster memory systemsF fut g lok speeds hve inresed tooF he home omputer mrket mkes good studyF sn the erly IWVHsD the ess time of ommodity hew @PHH nsA ws shorter thn the lok yle @RFUU wrz a PIH nsA of the sfw g F his ment tht hew ould e onneted diretly to the g without worrying out over running the memory systemF pster nd

memory cycle time

memory access time

3 This content is available online at <http://cnx.org/content/m32716/1.2/>. 4 Magnetic core memory is still used in applications where radiation hardness
radiation  is important.

 resistance to changes caused by ionizing

W e models were introdued in the midEIWVHs with gs tht loked more quikly thn the ess times of ville ommodity memoryF pster memory ws ville for prieD ut vendors punted y selling omputers with dded to the memory ess yleF it sttes re rti(il delys tht slow down referenes so tht memory ppers to mth the speed of fster g " t penltyF roweverD the tehnique of dding wit sttes egins to signi(ntly impt performne round PScQQwrzF odyD g speeds re even frther hed of hew speedsF he lok time for ommodity home omputers hs gone from PIH ns for the to round Q ns for QHHEwrz entiumEssD ut the ess time for ommodity hew hs deresed disproportiontely less " from PHH ns to round SH nsF roessor performne doules every IV monthsD while memory performne doules roughly every seven yersF he gGmemory speed gp is even lrger in worksttionsF ome models lok t intervls s short s IFT nsF row do vendors mke up the di'erene etween g speeds nd memory speedsc he memory in the gryEI superomputer used ew tht ws ple of keeping up with the IPFSEns lok yleF sing ew for its min memory system ws one of the resons tht most gry systems needed liquid oolingF nfortuntelyD it9s not prtil for modertely pried system to rely exlusively on ew for storgeF st9s lso not prtil to mnufture inexpensive systems with enough storge using exlusively ewF he solution is hierrhy of memories using proessor registersD one to three levels of ew heD hew min memoryD nd virtul memory stored on medi suh s diskF et eh point in the memory hierrhyD triks re employed to mke the est use of the ville tehnologyF por the reminder of this hpterD we will exmine the memory hierrhy nd its impt on performneF sn senseD with tody9s high performne miroproessor performing omputtions so quiklyD the tsk of the high performne progrmmer eomes the reful mngement of the memory hierrhyF sn some sense it9s useful intelletul exerise to view the simple omputtions suh s ddition nd multiplition s in(nitely fst in order to get the progrmmer to fous on the impt of memory opertions on the overll performne of the progrmF

wait states

1.1.3 Registers5
et lest the top lyer of the memory hierrhyD the g registersD operte s fst s the rest of the proessorF he gol is to keep opernds in the registers s muh s possileF his is espeilly importnt for intermedite vlues used in long omputtion suh sX

a q B PFRI C e G E B w
hile omputing the vlue of e divided y D we must store the result of multiplying q y PFRIF st would e shme to hve to store this intermedite result in memory nd then relod it few instrutions lterF yn ny modern proessor with moderte optimiztionD the intermedite result is stored in registerF elsoD the vlue is used in two omputtionsD nd so it n e loded one nd used twie to eliminte wsted lodF gompilers hve een very good t deteting these types of optimiztions nd e0iently mking use of the ville registers sine the IWUHsF edding more registers to the proessor hs some performne ene(tF st9s not prtil to dd enough registers to the proessor to store the entire prolem dtF o we must still use the slower memory tehnologyF

1.1.4 Caches6
yne we go eyond the registers in the memory hierrhyD we enounter hesF ghes re smll mounts of ew tht store suset of the ontents of the memoryF he hope is tht the he will hve the right suset of min memory t the right timeF
5 This 6 This
content is available online at <http://cnx.org/content/m32681/1.2/>. content is available online at <http://cnx.org/content/m32725/1.2/>.

IH

CHAPTER 1. MODERN COMPUTER ARCHITECTURES

he tul he rhiteture hs hd to hnge s the yle time of the proessors hs improvedF he proessors re so fst tht o'Ehip ew hips re not even fst enoughF his hs led to multilevel he pproh with oneD or even twoD levels of he implemented s prt of the proessorF le IFI shows the pproximte speed of essing the memory hierrhy on SHHEwrz hig PIITR elphF egisters vI ynEghip vP ynEghip vQ y'Eghip wemory
Table 1.1

P ns R ns S ns QH ns PPH ns

X wemory eess peed on hig PIITR elph

hen every referene n e found in heD you sy tht you hve IHH7 hit rteF qenerllyD hit rte of WH7 or etter is onsidered good for levelEone @vIA heF sn levelEtwo @vPA heD hit rte of ove SH7 is onsidered eptleF felow thtD pplition performne n drop o' steeplyF yne n hrterize the verge red performne of the memory hierrhy y exmining the proility tht prtiulr lod will e stis(ed t prtiulr level of the hierrhyF por exmpleD ssume memory rhiteture with n vI he speed of IH nsD vP speed of QH nsD nd memory speed of QHH nsF sf memory referene were stis(ed from vI he US7 of the timeD vP he PH7 of the timeD nd min memory S7 of the timeD the verge memory performne would eX

@HFUS B IH A C @ HFPH B QH A C @ HFHS B QHH A a PVFS ns


ou n esily see why it9s importnt to hve n vI he hit rte of WH7 or higherF qiven tht he holds only suset of the min memory t ny timeD it9s importnt to keep n index of whih res of the min memory re urrently stored in the heF o redue the mount of spe tht must e dedited to trking whih memory res re in heD the he is divided into numer of equl sized slots known s F ih line ontins some numer of sequentil min memory lotionsD generlly four to sixteen integers or rel numersF heres the dt within line omes from the sme prt of memoryD other lines n ontin dt tht is fr seprted within your progrmD or perhps dt from someody else9s progrmD s in pigure IFI @ghe lines n ome from di'erent prts of memoryAF hen you sk for something from memoryD the omputer heks to see if the dt is ville within one of these he linesF sf it isD the dt is returned with miniml delyF sf it9s notD your progrm my e delyed while new line is fethed from min memoryF yf ourseD if new line is rought inD nother hs to e thrown outF sf you9re lukyD it won9t e the one ontining the dt you re just out to needF

lines

II

Cache lines can come from dierent parts of memory

Figure 1.1

yn multiproessors @omputers with severl gsAD written dt must e returned to min memory so the rest of the proessors n see itD or ll other proessors must e mde wre of lol he tivityF erhps they need to e told to invlidte old lines ontining the previous vlue of the written vrile so tht they don9t identlly use stle dtF his is known s mintining etween the di'erent hesF he prolem n eome very omplex in multiproessor systemF7 ghes re e'etive euse progrms often exhiit hrteristis tht help kep the hit rte highF hese hrteristis re lled nd Y progrms often mke use of instrutions nd dt tht re ner to other instrutions nd dtD oth in spe nd timeF hen he line is retrieved from min memoryD it ontins not only the informtion tht used the he missD ut lso some neighoring informtionF ghnes re good tht the next time your progrm needs dtD it will e in the he line just fethed or nother one reently fethedF ghes work est when progrm is reding sequentilly through the memoryF essume progrm is reding QPEit integers with he line size of PST itsF hen the progrm referenes the (rst word in the he lineD it wits while the he line is loded from min memoryF hen the next seven referenes to memory re stis(ed quikly from the heF his is lled euse the ddress of eh suessive dt element is inremented y one nd ll the dt retrieved into the he is usedF he following loop is unitEstride loopX

coherency

spatial

temporal locality of reference

unit stride

hy saIDIHHHHHH w a w C e@sA ixh hy


hen progrm esses lrge dt struture using nonEunit strideD performne su'ers euse dt is loded into he tht is not usedF por exmpleX
7 Section
3.2.1 describes cache coherency in more detail.

IP

CHAPTER 1. MODERN COMPUTER ARCHITECTURES


hy saIDIHHHHHHD V w a w C e@sA ixh hy

his ode would experiene the sme numer of he misses s the previous loopD nd the sme mount of dt would e loded into the heF roweverD the progrm needs only one of the eight QPEit words loded into heF iven though this progrm performs oneEeighth the dditions of the previous loopD its elpsed time is roughly the sme s the previous loop euse the memory opertions dominte performneF hile this exmple my seem it ontrivedD there re severl situtions in whih nonEunit strides our quite oftenF pirstD when pyex twoEdimensionl rry is stored in memoryD suessive elements in the (rst olumn re stored sequentilly followed y the elements of the seond olumnF sf the rry is proessed with the row itertion s the inner loopD it produes unitEstride referene pttern s followsX

ievBR e@PHHDPHHA hy t a IDPHH hy s a IDPHH w a w C e@sDtA ixh hy ixh hy


snterestinglyD pyex progrmmer would most likely write the loop @in lphetil orderA s followsD produing nonEunit stride of VHH ytes etween suessive lod opertionsX

ievBR e@PHHDPHHA hy s a IDPHH hy t a IDPHH w a w C e@sDtA ixh hy ixh hy


feuse of thisD some ompilers n detet this suoptiml loop order nd reverse the order of the loops to mke est use of the memory systemF es we will see in etion IFPFID howeverD this ode trnsformtion my produe di'erent resultsD nd so you my hve to give the ompiler permission to interhnge these loops in this prtiulr exmple @orD fter reding this ookD you ould just ode it properly in the (rst pleAF

while @ ptr 3a xvv A ptr a ptrE>nextY


he next element tht is retrieved is sed on the ontents of the urrent elementF his type of loop ounes ll round memory in no prtiulr ptternF his is lled nd there re no good wys to improve the performne of this odeF e third pttern often found in ertin types of odes is lled gther @or stterA nd ours in loops suh sX

pointer chasing

IQ

w a w C e @ sxh@sA A
where the sxh rry ontins o'sets into the e rryF eginD like the linked listD the ext pttern of memory referenes is known only t runtime when the vlues stored in the sxh rry re knownF ome speilEpurpose systems hve speil hrdwre support to elerte this prtiulr opertionF

1.1.5 Cache Organization8

he proess of piring memory lotions with he lines is lled F yf ourseD given tht he is smller thn min memoryD you hve to shre the sme he lines for di'erent memory lotionsF sn hesD eh he line hs reord of the memory ddress @lled the A it represents nd perhps when it ws lst usedF he tg is used to trk whih re of memory is stored in prtiulr he lineF he wy memory lotions @tgsA re mpped to he lines n hve ene(il e'et on the wy your progrm runsD euse if two hevily used memory lotions mp onto the sme he lineD the miss rte will e higher thn you would like it to eF ghes n e orgnized in one of severl wysX diret mppedD fully ssoitiveD nd set ssoitiveF

mapping tag

1.1.5.1 Direct-Mapped Cache


hiret mppingD s shown in pigure IFP @wny memory ddresses mp to the sme he lineAD is the simplest lgorithm for deiding how memory mps onto the heF yD for exmpleD tht your omputer hs REuf heF sn diret mpped shemeD memory lotion H mps into he lotion HD s do memory lotions RuD VuD IPuD etF sn other wordsD memory mps onto the he sizeF enother wy to think out it is to imgine metl spring with hlk line mrked down the sideF ivery time round the springD you enounter the hlk line t the sme ple modulo the irumferene of the springF sf the spring is very longD the hlk line rosses mny oilsD the nlog eing lrge memory with mny lotions mpping into the sme he lineF rolems our when lternting runtime memory referenes in diretEmpped he point to the sme he lineF ih referene uses he miss nd reples the entry just repledD using lot of overhedF he populr word for this is F hen there is lots of thrshingD he n e more of liility thn n sset euse eh he miss requires tht he line e re(lled " n opertion tht moves more dt thn merely stisfying the referene diretly from min memoryF st is esy to onstrut pthologil se tht uses thrshing in REuf diretEmpped heX

thrashing

8 This

content is available online at <http://cnx.org/content/m32722/1.2/>.

IR

CHAPTER 1. MODERN COMPUTER ARCHITECTURES


Many memory addresses map to the same cache line

Figure 1.2

ievBR e@IHPRAD f@IHPRA gywwyx GppG eDf hy saIDIHPR e@sA a e@sA B f@sA ixh hy ixh
he rrys e nd f oth tke up extly R uf of storgeD nd their inlusion together in gywwyx ssures tht the rrys strt extly R uf prt in memoryF sn REuf diret mpped heD the sme line tht is used for e@IA is used for f@IAD nd likewise for e@PA nd f@PAD etFD so lternting referenes use repeted he missesF o (x itD you ould either djust the size of the rry eD or put some other vriles into gywwyxD etween themF por this reson one should generlly void rry dimensions tht re lose to powers of twoF

1.1.5.2 Fully Associative Cache

et the other extreme from diret mpped he is D where ny memory lotion n e mpped into ny he lineD regrdless of memory ddressF pully ssoitive hes get their nme from the type of memory used to onstrut them " ssoitive memoryF essoitive memory is like regulr memoryD exept tht eh memory ell knows something out the dt it ontinsF

fully associative cache

IS hen the proessor goes looking for piee of dtD the he lines re sked ll t one whether ny of them hs itF he he line ontining the dt holds up its hnd nd sys s hve itY if none of them doD there is he missF st then eomes question of whih he line will e repled with the new dtF ther thn mp memory lotions to he lines vi n lgorithmD like diretE mpped heD the memory system n sk the fully ssoitive he lines to hoose mong themselves whih memory lotions they will representF sully the lest reently used line is the one tht gets overwritten with new dtF he ssumption is tht if the dt hsn9t een used in quite whileD it is lest likely to e used in the futureF pully ssoitive hes hve superior utiliztion when ompred to diret mpped hesF st9s di0ult to (nd relEworld exmples of progrms tht will use thrshing in fully ssoitive heF he expense of fully ssoitive hes is very highD in terms of sizeD prieD nd speedF he ssoitive hes tht do exist tend to e smllF

1.1.5.3 Set-Associative Cache


xow imgine tht you hve two diret mpped hes sitting side y side in single he unit s shown in pigure IFQ @woEwy setEssoitive heAF ih memory lotion orresponds to prtiulr he line in eh of the two diretEmpped hesF he one you hoose to reple during he miss is sujet to deision out whose line ws used lst " the sme wy the deision ws mde in fully ssoitive he exept tht now there re only two hoiesF his is lled F etEssoitive hes generlly ome in two nd four seprte nks of heF hese re lled nd set ssoitive hesD respetivelyF yf ourseD there re ene(ts nd drwks to eh type of heF e setE ssoitive he is more immune to he thrshing thn diretEmpped he of the sme sizeD euse for eh mpping of memory ddress into he lineD there re two or more hoies where it n goF he euty of diretEmpped heD howeverD is tht it9s esy to implement ndD if mde lrge enoughD will perform roughly s well s setEssoitive designF our mhine my ontin multiple hes for severl di'erent purposesF rere9s little progrm for using thrshing in REuf twoEwy setE ssoitive heX

set-associative cache two-way four-way

ievBR e@IHPRAD f@IHPRAD g@IHPRA gywwyx GppG eDfDg hy saIDIHPR e@sA a e@sA B f@sA C g@sA ixh hy ixh
vike the previous he thrsher progrmD this fores repeted esses to the sme he linesD exept tht now there re three vriles ontending for the hoose set sme mpping insted of twoF eginD the wy to (x it would e to hnge the size of the rrys or insert something in etween themD in gywwyxF fy the wyD if you identlly rrnged progrm to thrsh like thisD it would e hrd for you to detet it " side from feeling tht the progrm runs little slowF pew vendors provide tools for mesuring he missesF

IT

CHAPTER 1. MODERN COMPUTER ARCHITECTURES


Two-way set-associative cache

Figure 1.3

1.1.5.4 Instruction Cache


o fr we hve glossed over the two kinds of informtion you would expet to (nd in he etween min memory nd the gX instrutions nd dtF fut if you think out itD the demnd for dt is seprte from the demnd for instrutionsF sn superslr proessorsD for exmpleD it9s possile to exeute n instrution tht uses dt he miss longside other instrutions tht require no dt from he t llD iFeFD they operte on registersF st doesn9t seem fir tht he miss on dt referene in one instrution should keep you from fething other instrutions euse the he is tied upF purthermoreD he depends on lolity of referene etween its of dt nd other its of dt or instrutions nd other instrutionsD ut wht kind of interply is there etween instrutions nd dtc st would seem possile for instrutions to ump perfetly useful dt from heD or vie versD with omplete disregrd for lolity of refereneF wny designs from the IWVHs used single he for oth instrutions nd dtF fut newer designs re employing wht is known s the D where the demnd for dt is segregted from the demnd for instrutionsF win memory is still single lrge poolD ut these proessors hve seprte dt nd instrution hesD possily of di'erent designsF fy providing two independent soures for dt nd instrutionsD the ggregte rte of informtion oming from memory is inresedD nd interferene etween the two types of memory referenes is minimizedF elsoD instrutions generlly hve n extremely high level of lolity of referene euse of the sequentil nture of most progrmsF feuse the instrution hes don9t hve to e prtiulrly lrge to e e'etiveD typil rhiteture is to hve seprte vI hes for instrutions nd dt nd to hve omined vP heF por exmpleD the sfwGwotorol owerg THRe hs seprte QPEu

Harvard Memory Architecture

IU fourEwy setEssoitive vI hes for instrution nd dt nd omined vP heF

1.1.6 Virtual Memory9


irtul memory deouples the ddresses used y the progrm @virtul ddressesA from the tul ddresses where the dt is stored in memory @physil ddressesAF our progrm sees its ddress spe strting t H nd working its wy up to some lrge numerD ut the tul physil ddresses ssigned n e very di'erentF st gives degree of )exiility y llowing ll proesses to elieve they hve the entire memory system to themselvesF enother trit of virtul memory systems is tht they divide your progrm9s memory up into " hunksF ge sizes vry from SIP ytes to I wf or lrgerD depending on the mhineF ges don9t hve to e lloted ontiguouslyD though your progrm sees them tht wyF fy eing seprted into pgesD progrms re esier to rrnge in memoryD or move portions out to diskF

pages

1.1.6.1 Page Tables


y tht your progrm sks for vrile stored t lotion IHHHF sn virtul memory mhineD there is no diret orrespondene etween your progrm9s ide of where lotion IHHH is nd the physil memory systems9 ideF o (nd where your vrile is tully storedD the lotion hs to e trnslted from virtul to physil ddressF he mp ontining suh trnsltions is lled F ih proess hs severl pge tles ssoited with itD orresponding to di'erent regionsD suh s progrm text nd dt segmentsF o understnd how ddress trnsltion worksD imgine the following senrioX t some pointD your proE grm sks for dt from lotion IHHHF pigure IFR @irtulEtoEphysil ddress mppingA shows the steps required to omplete the retrievl of this dtF fy hoosing lotion IHHHD you hve identi(ed whih region the memory referene flls inD nd this identi(es whih pge tle is involvedF votion IHHH then helps the proessor hoose n entry within the tleF por instneD if the pge size is SIP ytesD IHHH flls within the seond pge @pges rnge from ddresses H!SIID SIP!IHPQD IHPR!ISQSD etFAF hereforeD the seond tle entry should hold the ddress of the pge housing the vlue t lotion IHHHF

page table

9 This

content is available online at <http://cnx.org/content/m32728/1.2/>.

IV

CHAPTER 1. MODERN COMPUTER ARCHITECTURES


Virtual-to-physical address mapping

Figure 1.4

he operting system stores the pgeEtle ddresses virtullyD so it9s going to tke virtulEtoEphysil trnsltion to lote the tle in memoryF yne more virtulEtoE physil trnsltionD nd we (nlly hve the true ddress of lotion IHHHF he memory referene n ompleteD nd the proessor n return to exeuting your progrmF

1.1.6.2 Translation Lookaside Buer


es you n seeD ddress trnsltion through pge tle is pretty omplitedF st required two tle lookups @mye threeA to lote our dtF sf every memory referene ws tht omplitedD virtul memory omputers would e horrile performersF portuntelyD lolity of referene uses virtul ddress trnsltions to group togetherY progrm my repet the sme virtul pge mpping millions of times seondF end where we hve repeted use of the sme dtD we n pply heF ell modern virtul memory mhines hve speil he lled @vfA for virtulEtoEphysilEmemoryEddress trnsltionF he two inputs to the vf re n integer tht identi(es the progrm mking the memory request nd the virtul pge requestedF prom the output pops pointer to the physil pge numerF irtul ddress inY physil ddress outF vf lookups our in prllel with instrution exeutionD so if the ddress dt is in the vfD memory referenes proeed quiklyF vike other kinds of hesD the vf is limited in sizeF st doesn9t ontin enough entries to hndle ll the possile virtulEtoEphysilEddress trnsltions for ll the progrms tht might run on your omputerF vrger pools of ddress trnsltions re kept out in memoryD in the pge tlesF sf your progrm sks for virtulEtoE physilEddress trnsltionD nd the entry doesn9t exist in the vfD you su'er F he informtion needed my hve to e generted @ new pge my need to e retedAD or it my hve to e retrieved from the pge tleF he vf is good for the sme reson tht other types of hes re goodX it redues the ost of memory referenesF fut like other hesD there re pthologil ses where the vf n fil to deliver vlueF he

translation lookaside buer

TLB miss

IW esiest se to onstrut is one where every memory referene your progrm mkes uses vf missX

iev @IHHHHHHHA gywwyx hy saHDWWWW hy taIDIHHHHHHHDIHHHH w a w C @tCsA ixh hy ixh hy


essume tht the vf pge size for your omputer is less thn RH ufF ivery time through the inner loop in the ove exmple odeD the progrm sks for dt tht is R ytesBIHDHHH a RHDHHH ytes wy from the lst refereneF ht isD eh referene flls on di'erent memory pgeF his uses IHHH vf misses in the inner loopD tken IHHI timesD for totl of t lest one million vf missesF o dd insult to injuryD eh referene is gurnteed to use dt he miss s wellF edmittedlyD no one would strt with loop like the one oveF fut presuming tht the loop ws ny good to you t llD the restrutured version in the ode elow would ruise through memory like wrm knife through utterX

iev @IHHHHHHHA gywwyx hy saIDIHHHHHHH w a w C @sA ixh hy


he revised loop hs unit strideD nd vf misses our only every so oftenF sully it is not neessry to expliitly tune progrms to mke good use of the vfF yne progrm is tuned to e heEfriendlyD it nerly lwys is tuned to e vf friendlyF feuse there is performne ene(t to keeping the vf very smllD the vf entry often ontins length (eldF e single vf entry n e over megyte in length nd n e used to trnslte ddresses stored in multiple virtul memory pgesF

1.1.6.3 Page Faults


e pge tle entry lso ontins other informtion out the pge it representsD inluding )gs to tell whether the trnsltion is vlidD whether the ssoited pge n e modi(edD nd some informtion desriing how new pges should e initilizedF eferenes to pges tht ren9t mrked vlid re lled F king worstEse senrioD sy tht your progrm sks for vrile from prtiulr memory lotionF he proessor goes to look for it in the he nd (nds it isn9t there @he missAD whih mens it must e loded from memoryF xext it goes to the vf to (nd the physil lotion of the dt in memory nd (nds there is no vf entry @ vf missAF hen it tries onsulting the pge tle @nd re(lling the vfAD ut (nds tht either there is no entry for your prtiulr pge or tht the memory pge hs een shipped to disk @oth re pge fultsAF ih step of the memory hierrhy hs shrugged o' your requestF e new pge will hve to e reted in memory nd possilyD depending on the irumstnesD re(lled from diskF elthough they tke lot of timeD pge fults ren9t errorsF iven under optiml onditions every progrm su'ers some numer of pge fultsF riting vrile for the (rst time or lling suroutine tht hs

page faults

PH

CHAPTER 1. MODERN COMPUTER ARCHITECTURES

never een lled n use pge fultF his my e surprising if you hve never thought out it eforeF he illusion is tht your entire progrm is present in memory from the strtD ut some portions my never e lodedF here is no reson to mke spe for pge whose dt is never referened or whose instrutions re never exeutedF ynly those pges tht re required to run the jo get reted or pulled in from the diskF10 he pool of physil memory pges is limited euse physil memory is limitedD so on mhine where mny progrms re loying for speD there will e higher numer of pge fultsF his is euse physil memory pges re ontinully eing reyled for other purposesF roweverD when you hve the mhine to yourselfD nd memory is less in demndD lloted pges tend to stik round for whileF sn shortD you n expet fewer pge fults on quiet mhineF yne trik to rememer if you ever end up working for omputer vendorX lwys run short enhmrks twieF yn some systemsD the numer of pge fults will go downF his is euse the seond run (nds pges left in memory y the (rstD nd you won9t hve to py for pge fults ginF11 ging spe @swp speA on the disk is the lst nd slowest piee of the memory hierrhy for most mhinesF sn the worstEse senrio we sw how memory referene ould e pushed down to slower nd slower performne medi efore (nlly eing stis(edF sf you step kD you n view the disk pging spe s hving the sme reltionship to min memory s min memory hs to heF he sme kinds of optimiztions pply tooD nd lolity of referene is importntF ou n run progrms tht re lrger thn the min memory system of your mhineD ut sometimes t gretly deresed performneF hen we look t memory optimiztions in etion PFRFID we will onentrte on keeping the tivity in the fstest prts of the memory system nd voiding the slow prtsF

1.1.7 Improving Memory Performance12


qiven the importneD in the re of high performne omputingD of the performne of omputer9s memory susystemD mny tehniques hve een used to improve the performne of the memory systems of omputersF he two ttriutes of memory system performne re generlly nd F ome memory system design hnges improve one t the expense of the otherD nd other improvements positively impt oth ndwidth nd ltenyF fndwidth generlly fouses on the est possile stedyEstte trnsfer rte of memory systemF sully this is mesured while running long unitEstride loop reding or reding nd writing memoryF13 vteny is mesure of the worstEse performne of memory system s it moves smll mount of dt suh s QPE or TREit word etween the proessor nd memoryF foth re importnt euse they re n importnt prt of most high performne pplitionsF feuse memory systems re divided into omponentsD there re di'erent ndwidth nd lteny (gures etween di'erent omponents s shown in pigure IFS @imple wemory ystemAF he ndwidth rte etween he nd the g will e higher thn the ndwidth etween min memory nd the heD for instneF here my e severl hes nd pths to memory s wellF sullyD the pek memory ndwidth quoted y vendors is the speed etween the dt he nd the proessorF sn the rest of this setionD we look t tehniques to improve ltenyD ndwidthD or othF

bandwidth

latency

1.1.7.1 Large Caches


es we mentioned t the strt of this hpterD the disprity etween g speeds nd memory is growingF sf you look loselyD you n see vendors innovting in severl wysF ome worksttions re eing o'ered with RE wf dt hes3 his is lrger thn the min memory systems of mhines just few yers goF ith lrge enough heD smll @or even modertely lrgeA dt set n (t ompletely inside nd get inredily good performneF th out for this when you re testing new hrdwreF hen your progrm grows too lrge for the heD the performne my drop o' onsiderlyD perhps y ftor of IH or moreD depending
10 The term for this is demand paging. 11 Text pages are identied by the disk device and block number from which they 12 This content is available online at <http://cnx.org/content/m32736/1.2/>. 13 See the STREAM section in Chapter 15 for measures of memory bandwidth.
came.

PI on the memory ess ptternsF snterestinglyD n inrese in he size on the prt of vendors n render enhmrk osoleteF

Simple Memory System

Figure 1.5

p to IWWPD the vinpk IHHIHH enhmrk ws proly the single mostE respeted enhmrk to determine the verge performne ross wide rnge of pplitionsF sn IWWPD sfw introdued the sfw ETHHH whih hd he lrge enough to ontin the entire IHHIHH mtrix for the durtion of the enhmrkF por the (rst timeD worksttion hd performne on this enhmrk on the sme order of superomputersF sn senseD with the entire dt struture in ew heD the ETHHH ws operting like gry vetor superomputerF he prolem ws tht the gry ould mintin nd improve the performne for IPHIPH mtrixD wheres the ETHHH su'ered signi(nt performne loss t this inresed mtrix sizeF oonD ll the other worksttion vendors introdued similrly lrge hesD nd the IHHIHH vinpk enhmrk esed to e useful s n inditor of verge pplition performneF

1.1.7.2 Wider Memory Systems


gonsider wht hppens when he line is re(lled from memoryX onseutive memory lotions from min memory re red to (ll onseutive lotions within the he lineF he numer of ytes trnsferred depends on how ig the line is " nywhere from IT ytes to PST ytes or moreF e wnt the re(ll to proeed quikly euse n instrution is stlled in the pipelineD or perhps the proessor is witing for more instrutionsF sn pigure IFT @xrrow memory systemAD if we hve two hew hips tht provide us with R its of dt every IHH ns @rememer yle timeAD he (ll of ITEyte line tkes ITHH nsF

PP

CHAPTER 1. MODERN COMPUTER ARCHITECTURES


Narrow memory system

Figure 1.6

yne wy to mke the heEline (ll opertion fster is to widen the memory system s shown in pigure IFU @ide memory systemAF snsted of hving two rows of hewsD we rete multiple rows of hewsF xow on every IHHEns yleD we get QP ontiguous itsD nd our heEline (lls re four times fsterF

Wide memory system

Figure 1.7

e n improve the performne of memory system y inresing the width of the memory system up

PQ to the length of the he lineD t whih time we n (ll the entire line in single memory yleF yn the qs ower ghllenge series of systemsD the memory width is PST itsF he downside of wider memory system is tht hews must e dded in multiplesF sn mny modern worksttions nd personl omputersD memory is expnded in the form of single inline memory modules @swwsAF swws urrently re either QHED UPED or ITVEpin modulesD eh of whih is mde up of severl hew hips redy to e instlled into memory suEsystemF

1.1.7.3 Bypassing Cache


st9s interesting tht we hve spent nerly n entire hpter on how gret he is for high performne omputersD nd now we re going to ypss the he to improve performneF es mentioned erlierD some types of proessing result in nonEunit strides @or ouning roundA through memoryF hese types of memory referene ptterns ring out the worstEse ehvior in heEsed rhiteturesF st is these referene ptterns tht see improved performne y ypssing the heF snility to support these types of omputtions remins n re where trditionl superomputers n signi(ntly outperform highEspeed sg proessorsF por this resonD sg proessors tht re serious out numer runhing my hve speil instrutions tht ypss dt he memoryY the dt re trnsferred diretly etween the proessor nd the min memory systemF14 sn pigure IFV @fypssing heA we hve four nks of swws tht n do he (lls t IPV its per IHH ns memory yleF ememer tht the dt is ville fter SH ns ut we n9t get more dt until the hews refresh SH!TH ns lterF roweverD if we re doing QPEit nonEunitE stride lods nd hve the pility to ypss heD eh lod will e stis(ed from one of the four swws in SH nsF hile tht sww refreshedD nother lod n our from ny of the other three swws in SH nsF sn rndom mix of nonEunit lods there is US7 hne tht the next lod will fll on fresh hewF sf the lod flls on nk while it is refreshingD it simply hs to wit until the refresh ompletesF e further dvntge of ypssing he is tht the dt doesn9t need to e moved through the ew heF his opertion n dd from IH!SH ns to the lod time for single wordF his lso voids invlidting the ontents of n entire he line in the heF edding he ypssD inresing memoryEsystem widthsD nd dding nks inreses the ost of memory systemF gomputerEsystem vendors mke n eonomi hoie s to how mny of these tehniques they need to pply to get su0ient performne for their prtiulr proessor nd systemF reneD s proessor speed inresesD vendors must dd more of these memory system fetures to their ommodity systems to mintin lne etween proessor nd memoryEsystem speedF
14 By
chosen. the way, most machines have uncached memory spaces for process synchronization and I/O device registers. However, memory references to these locations bypass the cache because of the address chosen, not necessarily because of the instruction

PR

CHAPTER 1. MODERN COMPUTER ARCHITECTURES


Bypassing cache

Figure 1.8

1.1.7.4 Interleaved and Pipelined Memory Systems


etor superomputersD suh s the ge Gw nd the gonvex gQD re mhines tht depend on multiE nked memory systems for performneF he gQD in prtiulrD hs memory system with up to PSTEwy interlevingF ih interleve @or nkA is TR its wideF his is n expensive memory system to uildD ut it hs some very nie performne hrteristisF rving lrge numer of nks helps to redue the hnes of repeted ess to the sme memory nkF sf you do hit the sme nk twie in rowD howeverD the penlty is dely of nerly QHH ns " long time for mhine with lok speed of IT nsF o when things go wellD they go very wellF roweverD hving lrge numer of nks lone is not su0ient to feed ITEns proessor using SH ns hewF sn ddition to interlevingD the memory susystem lso needs to e pipelinedF ht isD the g must egin the seondD thirdD nd fourth lod efore the g hs reeived the results of the (rst lod s shown in pigure IFW @wultinked memory systemAF hen eh time it reeives the results from nk nD it must strt the lod from nk nCR to keep the pipeline fedF his wyD fter rief strtup delyD lods omplete every IT ns nd so the memory system ppers to operte t the lok rte of the gF his pipelined memory pproh is filitted y the IPVEelement vetor registers in the gQ proessorF sing gtherGstter hrdwreD nonEunitEstride opertions n lso e pipelinedF he only di'erene for nonEunitEstride opertions is tht the nks re not essed in sequentil orderF ith rndom pttern of memory referenesD it9s possile to reess memory nk efore it hs ompletely refreshed from previous essF his is lled F

bank stall

PS

Multibanked memory system

Figure 1.9

hi'erent ess ptterns re sujet to nk stlls of vrying severityF por instneD esses to every fourth word in n eightEnk memory system would lso e sujet to nk stllsD though the reovery would our soonerF eferenes to every seond word might not experiene nk stlls t llY eh nk my hve reovered y the time its next referene omes roundY it depends on the reltive speeds of the proessor nd memory systemF srregulr ess ptterns re sure to enounter some nk stllsF sn ddition to the nk stll hzrdD singleEword referenes mde diretly to multinked memory system rry greter lteny thn those of @suessfullyA hed memory essesF his is euse referenes re going out to memory tht is slower thn heD nd there my e dditionl ddress trnsltion steps s wellF roweverD nked memory referenes re pipelinedF es long s referenes re strted well enough in dvneD severl pipelinedD multinked referenes n e in )ight t one timeD giving you good throughputF he ghgEPHS system performed vetor opertions in memoryEtoEmemory fshion using set of expliit memory pipelinesF his system hd superior performne for very long unitEstride vetor omputtionsF e single instrution ould perform TSDHHH omputtions using three memory pipesF

1.1.7.5 Software Managed Caches


rere9s n interesting thoughtX if vetor proessor n pln fr enough in dvne to strt memory pipeD why n9t sg proessor strt heE(ll efore it relly needs the dt in those sme situtionsc sn wyD this is priming the he to hide the lteny of the heE(llF sf this ould e done fr enough in dvneD it would pper tht ll memory referenes would operte t the speed of the heF his onept is lled prefething nd it is supported using speil prefeth instrution ville on mny sg proessorsF e prefeth instrution opertes just like stndrd lod instrutionD exept tht the proessor doesn9t wit for the he to (ll efore the instrution ompletesF he ide is to prefeth fr enough hed of the omputtion to hve the dt redy in he y the time the tul omputtion oursF he following is n exmple of how this might e usedX

PT

CHAPTER 1. MODERN COMPUTER ARCHITECTURES


hy saIDIHHHHHHDV ipigr@e@sCVAA hy taHDU wawCe@sCtA ixh hy ixh hy

his is not the tul pyexF refething is usully done in the ssemly ode generted y the ompiler when it detets tht you re stepping through the rry using (xed strideF he ompiler typilly estimte how fr hed you should e prefethingF sn the ove exmpleD if the heE(lls were prtiulrly slowD the vlue V in sCV ould e hnged to IT or QP while the other vlues hnged ordinglyF sn proessor tht ould only issue one instrution per yleD there might e no pyk to prefeth instrutionY it would tke up vlule time in the instrution strem in exhnge for n unertin ene(tF yn superslr proessorD howeverD he hint ould e mixed in with the rest of the instrution strem nd issued longside otherD rel instrutionsF sf it sved your progrm from su'ering extr he missesD it would e worth hvingF

1.1.7.6 Post-RISC Eects on Memory References


wemory opertions typilly ess the memory during the exeute phse of the pipeline on sg proessorF yn the postEsg proessorD things re no di'erent thn on sg proessor exept tht mny lods n e hlf (nished t ny given momentF yn some urrent proessorsD up to PV memory opertions my e tive with IH witing for o'Ehip memory to rriveF his is n exellent wy to ompenste for slow memory lteny ompred to the g speedF gonsider the following loopX

vyyX

vyehs vyehs vyeh sxg yi sxg gywei fv

TDIHHHH SDH IDP@SA I IDQ@SA S SDT vyy

et the stertions et the index vrile vod vlue from memory edd one to I tore the inremented vlue k to memory edd one to S ghek for loop termintion frnh if S < T k to vyy

sn this exmpleD ssume tht it tke SH yles to ess memoryF hen the fethG deode puts the (rst lod into the instrution reorder u'er @sfAD the lod strts on the next yle nd then is suspended in the exeute phseF roweverD the rest of the instrutions re in the sfF he sxg I must wit for the lod nd the yi must lso witF roweverD y using renme registerD the sxg SD gyweiD nd fv n ll e omputedD nd the fethGdeode goes up to the top of the loop nd sends nother lod into the sf for the next memory lotion tht will hve to witF his looping ontinues until out IH itertions of the loop re in the sfF hen the (rst lod tully shows up from memory nd the sxg I nd yi from the (rst itertion egins exeutingF yf ourse the store tkes whileD ut out tht time the seond lod (nishesD so there is more work to do nd so on. . . vike mny spets of omputingD the postEsg rhitetureD with its outEofEorder nd speultive exeuE tionD optimizes memory referenesF he postEsg proessor dynmilly unrolls loops t exeution time to ompenste for memory susystem delyF essuming pipelined multinked memory system tht n hve multiple memory opertions strted efore ny omplete @the r eEVHHH n hve IH o'E hip memory opE ertions in )ight t one timeAD the proessor ontinues to dispth memory opertions until those opertions egin to ompleteF

PU nlike vetor proessor or prefeth instrutionD the postEsg proessor does not need to ntiipte the preise pttern of memory referenes so it n refully ontrol the memory susystemF es resultD the postEsg proessor n hieve pek performne in frEwider rnge of ode sequenes thn either vetor proessors or inEorder sg proessors with prefeth pilityF his impliit tolerne to memory lteny mkes the postEsg proessors idel for use in the slle shredEmemory proessors of the futureD where the memory hierrhy will eome even more omplex thn urrent proessors with three levels of he nd min memoryF nfortuntelyD the one ode segment tht doesn9t ene(t signi(ntly from the postEsg rhiteture is the linkedElist trverslF his is euse the next ddress is never known until the previous lod is ompleted so ll lods re fundmentlly serilizedF

1.1.7.7 Dynamic RAM Technology Trends


wuh of the tehniques in this setion hve foused on how to del with the imperfetions of the dynmi ew hip @lthough when your lok rte hits QHH!THH wrz or Q!P nsD even ew strts to look pretty slowAF st9s ler tht the demnd for more nd more ew will ontinue to inreseD nd gigits nd more hew will (t on single hipF feuse of thisD signi(nt work is underwy to mke new superEhews fster nd more tuned to the extremely fst proessors of the present nd the futureF ome of the tehnologies re reltively strightforwrdD nd others require mjor redesign of the wy tht proessors nd memories re mnufturedF ome hew improvements inludeX

pst hew sves time y llowing mode in whih the entire ddress doesn9t hve to e reE loked into the hip for eh memory opertionF snstedD there is n ssumption tht the memory will e essed sequentilly @s in heEline (llAD nd only the lowEorder its of the ddress re loked in for suessive reds or writesF is modi(tion to output u'ering on pge mode ew tht llows it to operte roughly twie s quikly for opertions other thn refreshF is synhronized using n externl lok tht llows the he nd the hew to oordinte their opertionsF elsoD hew n pipeline the retrievl of multiple memory its to improve overll throughputF is proprietry tehnology ple of SHH wfGse dt trnsferF ewf uses signi(nt logi within the hip nd opertes t higher power levels thn typil hewF omines ew he on the sme hip s the hewF his tightly ouples the ew nd hew nd provides performne similr to ew devies with ll the limittions of ny he rhitetureF yne dvntge of the ghew pproh is tht the mount of he is inresed s the mount of hew is inresedF elso when deling with memory systems with lrge numer of interlevesD eh interleve hs its own ew to redue ltenyD ssuming the dt requested ws in the ewF en even more dvned pproh is to integrte the proessorD ewD nd hew onto single hip loked t sy S qrzD ontining IPV wf of dtF nderstndlyD there is wide rnge of tehnil prolems to solve efore this type of omponent is widely ville for 6PHH " ut it9s not out of the questionF he mnufturing proesses for hew nd proessors re lredy eginning to onverge in some wys @ewfAF he iggest performne prolem when we hve this type of system will eD ht to do if you need ITH wfc

page mode

pst pge mode hew ixtended dt out ew @ihy ewA ynhronous hew @hewA ewf ghed hew @ghewA

EDO RAM Synchronous DRAM RAMBUS Cached DRAM

PV

CHAPTER 1. MODERN COMPUTER ARCHITECTURES the

1.1.8 Closing Notes15


hey sy tht the omputer of the future will e good memory system tht just hppens to hve g tthedF es high performne miroproessor systems tke over s high performne omputing enginesD the prolem of heEsed memory system tht uses hew for min memory must e solvedF here re mny rhiteture nd tehnology e'orts underwy to trnsform worksttion nd personl omputer memories to e s ple s superomputer memoriesF es g speed inreses fster thn memory speedD you will need the tehniques in this ookF elsoD s you move into multiple proessorsD memory prolems don9t get etterY usully they get worseF ith mny hungry proessors lwys redy for more dtD memory susystem n eome extremely strinedF ith just little skillD we n often restruture memory esses so tht they ply to your memory system9s strengths insted of its weknessesF

1.1.9 Exercises16
Exercise 1.1
he following ode segment trverses pointer hinX

while @@p a @hr BA BpA 3a xvvAY


row will suh ode intert with the he if ll the referenes fll within smll portion of memE oryc row will the ode intert with the he if referenes re strethed ross mny megytesc

Exercise 1.2 Exercise 1.3

row would the ode in ixerise IFI ehve on multinked memory system tht hs no hec

e long time goD people regulrly wrote selfEmodifying ode " progrms tht wrote into instrution memory nd hnged their own ehviorF ht would e the implitions of selfEmodifying ode on mhine with rrvrd memory rhiteturec

Exercise 1.4

essume memory rhiteture with n vI he speed of IH nsD vP speed of QH nsD nd memory speed of PHH nsF gompre the verge memory system performne with @IA vI VH7D vP IH7D nd memory IH7Y nd @PA vI VS7 nd memory IS7F

Exercise 1.5

yn omputer systemD run loops tht proess rrys of vrying length from IT to IT millionX

ee@sA a ee@sA C Q
row does the numer of dditions per seond hnge s the rry length hngesc ixperiment with ievBRD ievBVD sxiqiBRD nd sxiqiBVF hih hs more signi(nt impt on performneX lrger rry elements or integer versus )otingEpointc ry this on rnge of di'erent omputersF

Exercise 1.6

grete twoEdimensionl rry of IHPRIHPRF voop through the rry with rows s the inner loop nd then gin with olumns s the inner loopF erform simple opertion on eh elementF ho the loops perform di'erentlyc hyc ixperiment with di'erent dimensions for the rry nd see the performne imptF

Exercise 1.7

rite progrm tht repetedly exeutes timed loops of di'erent sizes to determine the he size for your systemF
15 This 16 This
content is available online at <http://cnx.org/content/m32690/1.2/>. content is available online at <http://cnx.org/content/m32698/1.2/>.

PW

1.2 Floating-Point Numbers


1.2.1 Introduction17
yften when we wnt to mke point tht nothing is sredD we syD one plus one does not equl twoF his is designed to shok us nd ttk our fundmentl ssumptions out the nture of the universeF ellD in this hpter on )otingE point numersD we will lern tht  0.1 + 0.1 does not lwys equl 0.2 when we use )otingEpoint numers for omputtionsF sn this hpter we explore the limittions of )otingEpoint numers nd how you s progrmmer n write ode to minimize the e'et of these limittionsF his hpter is just rief introdution to signi(nt (eld of mthemtis lled F

numerical analysis

1.2.2 Reality18
he rel world is full of rel numersF untities suh s distnesD veloitiesD mssesD nglesD nd other quntities re ll rel numersF19 e wonderful property of rel numers is tht they hve unlimited uryF por exmpleD when onsidering the rtio of the irumferene of irle to its dimeterD we rrive t vlue of QFIRISWPFFFF he deiml vlue for does not terminteF feuse rel numers hve unlimited uryD even though we n9t write it downD is still rel numerF ome rel numers re rtionl numers euse they n e represented s the rtio of two integersD suh s IGQF xot ll rel numers re rtionl numersF xot surprisinglyD those rel numers tht ren9t rtionl numers re lled irrtionlF ou proly would not wnt to strt n rgument with n irrtionl numer unless you hve lot of free time on your hndsF nfortuntelyD on piee of pperD or in omputerD we don9t hve enough spe to keep writing the digits of F o wht do we doc e deide tht we only need so muh ury nd round rel numers to ertin numer of digitsF por exmpleD if we deide on four digits of uryD our pproximtion of is QFIRPF ome stte legislture ttempted to pss lw tht ws to e threeF hile this is often ited s evidene for the s of governmentl entitiesD perhps the legislture ws just suggesting tht we only need one digit of ury for F erhps they foresw the need to sve preious memory spe on omputers when representing rel numersF

pi pi

pi

pi

pi

pi

1.2.3 Representation20
qiven tht we nnot perfetly represent rel numers on digitl omputersD we must ome up with ompromise tht llows us to pproximte rel numersF21 here re numer of di'erent wys tht hve een used to represent rel numersF he hllenge in seleting representtion is the trdeEo' etween spe nd ury nd the trdeo' etween speed nd uryF sn the (eld of high performne omputing we generlly expet our proessors to produe )otingE point result every THHEwrz lok yleF st is pretty ler tht in most pplitions we ren9t willing to drop this y ftor of IHH just for little more uryF fefore we disuss the formt used y most high performne omputersD we disuss some lterntive @leit slowerA tehniques for representing rel numersF
17 This content is available online at <http://cnx.org/content/m32739/1.2/>. 18 This content is available online at <http://cnx.org/content/m32741/1.2/>. 19 In high performance computing we often simulate the real world, so it is somewhat ironic that we use simulated real numbers
(oating-point) in those simulations of the real world.

20 This content is available online 21 Interestingly, analog computers

at <http://cnx.org/content/m32772/1.2/>. have an easier time representing real numbers. Imagine a water- adding analog computer

which consists of two glasses of water and an empty glass. The amount of water in the two glasses are perfectly represented real numbers. By pouring the two glasses into a third, we are adding the two real numbers perfectly (unless we spill some), and we wind up with a real number amount of water in the third glass. The problem with analog computers is knowing just how much water is in the glasses when we are all done. It is also problematic to perform 600 million additions per second using this technique without getting pretty wet. Try to resist the temptation to start an argument over whether quantum mechanics would cause the real numbers to be rational numbers. And don't point out the fact that even digital computers are really analog computers at their core. I am trying to keep the focus on oating-point values, and you keep drifting away!

QH

CHAPTER 1. MODERN COMPUTER ARCHITECTURES

1.2.3.1 Binary Coded Decimal


sn the erliest omputersD one tehnique ws to use inry oded deiml @fghAF sn fghD eh seEIH digit ws stored in four itsF xumers ould e ritrrily long with s muh preision s there ws memoryX

IPQFRS HHHI HHIH HHII HIHH HIHI


his formt llows the progrmmer to hoose the preision required for eh vrileF nfortuntelyD it is di0ult to uild extremely highEspeed hrdwre to perform rithmeti opertions on these numersF feuse eh numer my e fr longer thn QP or TR itsD they did not (t niely in registerF wuh of the )otingE point opertions for fgh were done using loops in miroodeF iven with the )exiility of ury on fgh representtionD there ws still need to round rel numers to (t into limited mount of speF enother limittion of the fgh pproh is tht we store vlue from H!W in fourEit (eldF his (eld is ple of storing vlues from H!IS so some of the spe is wstedF

1.2.3.2 Rational Numbers


yne intriguing method of storing rel numers is to store them s rtionl numersF o rie)y review mthemtisD rtionl numers re the suset of rel numers tht n e expressed s rtio of integer numersF por exmpleD PPGU nd IGP re rtionl numersF ome rtionl numersD suh s IGP nd IGIHD hve perfet representtion s seEIH deimlsD nd othersD suh s IGQ nd PPGUD n only e expressed s in(niteElength seEIH deimlsF hen using rtionl numersD eh rel numer is stored s two integer numers representing the numertor nd denomintorF he si frtionl rithmeti opertions re used for dditionD sutrtionD multiplitionD nd divisionD s shown in pigure IFIH @tionl numer mthemtisAF

Rational number mathematics

Figure 1.10

he limittion tht ours when using rtionl numers to represent rel numers is tht the size of the numertors nd denomintors tends to growF por eh dditionD ommon denomintor must e foundF o

QI keep the numers from eoming extremely lrgeD during eh opertionD it is importnt to (nd the @qghA to redue frtions to their most ompt representtionF hen the vlues grow nd there re no ommon divisorsD either the lrge integer vlues must e stored using dynmi memory or some form of pproximtion must e usedD thus losing the primry dvntge of rtionl numersF por mthemtil pkges suh s wple or wthemti tht need to produe ext results on smller dt setsD the use of rtionl numers to represent rel numers is t times useful tehniqueF he perforE mne nd storge ost is less signi(nt thn the need to produe ext results in some instnesF

common divisor

greatest

1.2.3.3 Fixed Point


sf the desired numer of deiml ples is known in dvneD it9s possile to use (xedEpoint representtionF sing this tehniqueD eh rel numer is stored s sled integerF his solves the prolem tht seEIH frtions suh s HFI or HFHI nnot e perfetly represented s seEP frtionF sf you multiply IIHFUU y IHH nd store it s sled integer IIHUUD you n perfetly represent the seEIH frtionl prt @HFUUAF his pproh n e used for vlues suh s moneyD where the numer of digits pst the deiml point is smll nd knownF roweverD just euse ll numers n e urtely represented it doesn9t men there re not errors with this formtF hen multiplying (xedEpoint numer y frtionD you get digits tht n9t e represented in (xedEpoint formtD so some form of rounding must e usedF por exmpleD if you hve 6IPSFVU in the nk t R7 interestD your interest mount would e 6SFHQRVF roweverD euse your nk lne only hs two digits of uryD they only give you 6SFHQD resulting in lne of 6IQHFWHF yf ourse you proly hve herd mny stories of progrmmers getting rih depositing mny of the remining HFHHRV mounts into their own ountF wy guess is tht nks hve proly (gured tht one out y nowD nd the nk keeps the money for itselfF fut it does mke one wonder if they round or trunte in this type of lultionF22

1.2.3.4 Mantissa/Exponent
he )otingEpoint formt tht is most prevlent in high performne omputing is vrition on sienti( nottionF sn sienti( nottion the rel numer is represented using mntissD seD nd exponentX TFHP IH23 F he mntiss typilly hs some (xed numer of ples of uryF he mntiss n e represented in se PD se ITD or fghF here is generlly limited rnge of exponentsD nd the exponent n e expressed s power of PD IHD or ITF he primry dvntge of this representtion is tht it provides wide overll rnge of vlues while using (xedElength storge representtionF he primry limittion of this formt is tht the di'erene etween two suessive vlues is not uniformF por exmpleD ssume tht you n represent three seEIH digitsD nd your exponent n rnge from !IH to IHF por numers lose to zeroD the distne etween suessive numers is very smllF por the numer 1.72 1010 D the next lrger numer is 1.73 1010 F he distne etween these two lose smll numers is HFHHHHHHHHHHHIF por the numer 6.33 1010 D the next lrger numer is 6.34 1010 F he distne etween these lose lrge numers is IHH millionF sn pigure IFII @histne etween suessive )otingEpoint numersAD we use two seEP digits with n exponent rnging from !I to IF
22 Perhaps
banks round this instead of truncating, knowing that they will always make it up in teller machine fees.

QP

CHAPTER 1. MODERN COMPUTER ARCHITECTURES


Distance between successive oating-point numbers

Figure 1.11

here re multiple equivlent representtions of numer when using sienti( nottionX 6.00 105 0.60 106 0.06 107 fy onventionD we shift the mntiss @djust the exponentA until there is extly one nonzero digit to the left of the deiml pointF hen numer is expressed this wyD it is sid to e normlizedF sn the ove listD only TFHH IH5 is normlizedF pigure IFIP @xormlized )otingEpoint numersA shows how some of the )otingEpoint numers from pigure IFII @histne etween suessive )otingEpoint numersA re not normlizedF hile the mntissGexponent hs een the dominnt )otingEpoint pproh for high performne omE putingD there were wide vriety of spei( formts in use y omputer vendorsF ristorillyD eh omputer vendor hd their own prtiulr formt for )otingEpoint numersF feuse of thisD progrm exeuted on severl di'erent rnds of omputer would generlly produe di'erent nswersF his invrily led to heted disussions out whih system provided the right nswer nd whih system@sA were generting meningless resultsF23

Normalized oating-point numbers

Figure 1.12

23 Interestingly,

there was an easy answer to the question for many programmers. Generally they trusted the results from the

computer they used to debug the code and dismissed the results from other computers as garbage.

QQ hen storing )otingEpoint numers in digitl omputersD typilly the mntiss is normlizedD nd then the mntiss nd exponent re onverted to seEP nd pked into QPE or TREit wordF sf more its were lloted to the exponentD the overll rnge of the formt would e inresedD nd the numer of digits of ury would e deresedF elso the se of the exponent ould e seEP or seEITF sing IT s the se for the exponent inreses the overll rnge of exponentsD ut euse normliztion must our on fourEit oundriesD the ville digits of ury re redued on the vergeF vter we will see how the siii USR stndrd for )otingEpoint formt represents numersF

1.2.4 Eects of Floating-Point Representation24


yne prolem with the mntissGseGexponent representtion is tht not ll seEIH numers n e exE pressed perfetly s seEP numerF por exmpleD IGP nd HFPS n e represented perfetly s seEP vluesD while IGQ nd HFI produe in(nitely repeting seEP deimlsF hese vlues must e rounded to e stored in the )otingEpoint formtF ith su0ient digits of preisionD this generlly is not prolem for omputtionsF roweverD it does led to some nomlies where lgeri rules do not pper to pplyF gonsider the following exmpleX

ievBR D a HFI a H hy saIDIH a C ixhhy sp @ FiF IFH A rix sx BD9elger is truth9 ivi sx BD9xot here9 ixhsp sx BDIFHE ixh
et (rst glneD this ppers simple enoughF wthemtis tells us ten times HFI should e oneF nfortuntelyD euse HFI nnot e represented extly s seEP deimlD it must e roundedF st ends up eing rounded down to the lst itF hen ten of these slightly smller numers re dded togetherD it does not quite dd up to IFHF hen nd re ievBRD the di'erene is out IH-7 D nd when they re ievBVD the di'erene is out IH-16 F yne possile method for ompring omputed vlues to onstnts is to sutrt the vlues nd test to see how lose the two vlues eomeF por exmpleD one n rewrite the test in the ove ode to eX

sp @ ef@IFHEAFvF IiETA rix sx BD9glose enough for government work9 ivi sx BD9xot even lose9 ixhsp
24 This
content is available online at <http://cnx.org/content/m32755/1.2/>.

QR

CHAPTER 1. MODERN COMPUTER ARCHITECTURES

he type of the vriles in question nd the expeted error in the omputtion tht produes determines the pproprite vlue used to delre tht two vlues re lose enough to e delred equlF enother re where inext representtion eomes prolem is the ft tht lgeri inverses do not hold with ll )otingEpoint numersF por exmpleD using ievBRD the vlue @IFHGA B does not evlute to IFH for IQS vlues of from one to IHHHF his n e prolem when omputing the inverse of mtrix using vEdeompositionF vEdeomposition repetedly does divisionD multiplitionD dditionD nd sutrtionF sf you do the strightforwrd vEdeomposition on mtrix with integer oe0ients tht hs n integer solutionD there is pretty good hne you won9t get the ext solution when you run your lgorithmF hisussing tehniques for improving the ury of mtrix inverse omputtion is est left to numeril nlysis textF

1.2.5 More Algebra That Doesn't Work25


hile the exmples in the proeeding setion foused on the limittions of multiplition nd divisionD ddition nd sutrtion re notD y ny mensD perfetF feuse of the limittion of the numer of digits of preisionD ertin dditions or sutrtions hve no e'etF gonsider the following exmple using ievBR with U digits of preisionX

a IFPSiV a C UFSiEQ sp @ FiF A rix sx BD9em s nuts or whtc9 ixhsp


hile oth of these numers re preisely representle in )otingEpointD dding them is prolemtiF rior to dding these numers togetherD their deiml points must e ligned s in pigure IFIQ @pigure RERX voss of ury while ligning deiml pointsAF

Figure 4-4: Loss of accuracy while aligning decimal points

Figure 1.13

25 This

content is available online at <http://cnx.org/content/m32754/1.2/>.

QS nfortuntelyD while we hve omputed the ext resultD it nnot (t k into ievBR vrile @U digits of uryA without trunting the HFHHUSF o fter the dditionD the vlue in is extly IFPSiVF iven sdderD the ddition ould e performed of timesD nd the vlue for would still e IFPSiVF feuse of the limittion on preisionD not ll lgeri lws pply ll the timeF por instneD the nswer you otin from C will e the sme s CD s per the ommuttive lw for dditionF hihever opernd you pik (rstD the opertion yields the sme resultY they re mthemtilly equivlentF st lso mens tht you n hoose either of the following two forms nd get the sme nswerX

millions

@ C A C @ C A C
roweverD this is not equivlentX

@ C A C
he third version isn9t equivlent to the (rst two euse the order of the lultions hs hngedF eginD the rerrngement is equivlent lgerillyD ut not omputtionllyF fy hnging the order of the lultionsD we hve tken dvntge of the ssoitivity of the opertionsY we hve mde n of the originl odeF o understnd why the order of the lultions mttersD imgine tht your omputer n perform rithmeti signi(nt to only (ve deiml plesF elso ssume tht the vlues of D D nd re FHHHHSD FHHHHSD nd IFHHHHD respetivelyF his mens thtX

associative transformation

@ C A C a FHHHHS C FHHHHS C IFHHHH a FHHHI C IFHHHH


utX

a IFHHHI

@ C A C a FHHHHS C IFHHHH C FHHHHS a IFHHHH C FHHHHS

a IFHHHH

he two versions give slightly di'erent nswersF hen dding CCD the sum of the smller numers ws insigni(nt when dded to the lrger numerF fut when omputing CCD we dd the two smll numers (rstD nd their omined sum is lrge enough to in)uene the (nl nswerF por this resonD ompilers tht rerrnge opertions for the ske of performne generlly only do so fter the user hs requested optimiztions eyond the defultsF por these resonsD the pyex lnguge is very strit out the ext order of evlution of exE pressionsF o e omplintD the ompiler must ensure tht the opertions our extly s you express themF26
26 Often
even if you didn't mean it.

QT

CHAPTER 1. MODERN COMPUTER ARCHITECTURES even

por uernighn nd ithie gD the opertor preedene rules re di'erentF elthough the preedenes etween opertors re honored @iFeFD B omes efore CD nd evlution generlly ours left to right for opertors of equl preedeneAD the ompiler is llowed to tret few ommuttive opertions @CD BD 8D nd |A s if they were fully ssoitiveD if they re prenthesizedF por instneD you might tell the g ompilerX

a x C @y C zAY
roweverD the g ompiler is free to ignore youD nd omine D D nd in ny order it plesesF xow rmed with this knowledgeD view the following hrmlessElooking ode segmentX

ievBR wDe@IHHHHHHA w a HFH hy saIDIHHHHHH w a w C e@sA ixhhy


fegins to look like nightmre witing to hppenF he ury of this sum depends of the reltive mgnitudes nd order of the vlues in the rry eF sf we sort the rry from smllest to lrgest nd then perform the dditionsD we hve more urte vlueF here re other lgorithms for omputing the sum of n rry tht redue the error without requiring full sort of the dtF gonsult good textook on numeril nlysis for the detils on these lgorithmsF sf the rnge of mgnitudes of the vlues in the rry is reltively smllD the strightE forwrd omputtion of the sum is proly su0ientF

1.2.6 Improving Accuracy Using Guard Digits27


sn this setion we explore tehnique to improve the preision of )otingEpoint omputtions without using dditionl storge spe for the )otingEpoint numersF gonsider the following exmple of seEIH system with (ve digits of ury performing the following sutrtionX

IHFHHI E WFWWWQ a HFHHIU


ell of these vlues n e perfetly represented using our )otingEpoint formtF roweverD if we only hve (ve digits of preision ville while ligning the deiml points during the omputtionD the results end up with signi(nt error s shown in pigure IFIR @xeed for gurd digits AF
27 This
content is available online at <http://cnx.org/content/m32744/1.2/>.

QU

Need for guard digits

Figure 1.14

o perform this omputtion nd round it orretlyD we do not need to inrese the numer of signi(nt digits for vluesF e doD howeverD need dditionl digits of preision while performing the omputtionF he solution is to dd extr whih re mintined during the interim steps of the ompuE ttionF sn our seD if we mintined six digits of ury while ligning operndsD nd rounded efore normlizing nd ssigning the (nl vlueD we would get the proper resultF he gurd digits only need to e present s prt of the )otingEpoint exeution unit in the gF st is not neessry to dd gurd digits to the registers or to the vlues stored in memoryF st is not neessry to hve n extremely lrge numer of gurd digitsF et some pointD the di'erene in the mgnitude etween the opernds eomes so gret tht lost digits do not 'et the ddition or rounding resultsF

stored

guard digits

1.2.7 History of IEEE Floating-Point Format28


1.2.7.1 History of IEEE Floating-Point Format
rior to the sg miroproessor revolutionD eh vendor hd their own )otingE point formts sed on their designers9 views of the reltive importne of rnge versus ury nd speed versus uryF st ws not unommon for one vendor to refully nlyze the limittions of nother vendor9s )otingEpoint formt nd use this informtion to onvine users tht theirs ws the only urte )otingE point implementtionF sn relity none of the formts ws perfetF he formts were simply imperfet in di'erent wysF huring the IWVHs the snstitute for iletril nd iletronis ingineers @siiiA produed stndrd for the )otingEpoint formtF he title of the stndrd is siii USREIWVS tndrd for finry plotingEoint erithmetiF his stndrd provided the preise de(nition of )otingEpoint formt nd desried the opertions on )otingEpoint vluesF feuse siii USR ws developed fter vriety of )otingEpoint formts hd een in use for quite some timeD the siii USR working group hd the ene(t of exmining the existing )otingEpoint designs nd tking the strong pointsD nd voiding the mistkes in existing designsF he siii USR spei(tion hd its eginnings in the design of the sntel iVHVU )otingEpoint oproessorF he iVHVU )otingEpoint formt improved on the hig e )otingEpoint formt y dding numer of signi(nt feturesF he ner universl doption of siii USR )otingEpoint formt hs ourred over IHEyer time periodF he high performne omputing vendors of the mid IWVHs @gry sfwD higD nd gontrol htA hd their own proprietry )otingEpoint formts tht they hd to ontinue supporting euse of their instlled user seF hey relly hd no hoie ut to ontinue to support their existing formtsF sn the mid to lte IWVHs the primry systems tht supported the siii formt were sg worksttions nd some oproessors
28 This
content is available online at <http://cnx.org/content/m32770/1.2/>.

QV

CHAPTER 1. MODERN COMPUTER ARCHITECTURES

for miroproessorsF feuse the designers of these systems hd no need to protet proprietry )otingE point formtD they redily dopted the siii formtF es sg proessors moved from generlEpurpose integer omputing to high performne )otingEpoint omputingD the g designers found wys to mke siii )otingEpoint opertions operte very quiklyF sn IH yersD the siii USR hs gone from stndrd for )otingEpoint oproessors to the dominnt )otingEpoint stndrd for ll omputersF feuse of this stndrdD weD the usersD re the ene(iries of portle )otingEpoint environmentF

1.2.7.2 IEEE Floating-Point Standard


he siii USR stndrd spei(ed numer of di'erent detils of )otingEpoint opertionsD inludingX torge formts reise spei(tions of the results of opertions peil vlues pei(ed runtime ehvior on illegl opertions

peifying the )otingEpoint formt to this level of detil insures tht when omputer system is omplint with the stndrdD users n expet repetle exeution from one hrdwre pltform to nother when opertions re exeuted in the sme orderF

1.2.7.3 IEEE Storage Format


he two most ommon siii )otingEpoint formts in use re QPE nd TREit numersF le IFPX rmeters of siii QPE nd TREfit pormts gives the generl prmeters of these dt typesF

Parameters of IEEE 32- and 64-Bit Formats


siiiUS ingle houle houleEixtended pyex ievBR ievBV ievBIH g )ot doule long doule fits QP TR ixponent fits V II wntiss fits PR SQ

>aVH

>aIS

>aTR

Table 1.2

sn pyexD the QPEit formt is usully lled ievD nd the TREit formt is usully lled hyfviF roweverD some pyex ompilers doule the sizes for these dt typesF por tht resonD it is sfest to delre your pyex vriles s ievBR or ievBVF he douleEextended formt is not s well supported in ompilers nd hrdwre s the singleE nd douleEpreision formtsF he it rrngement for the single nd doule formts re shown in pigure IFIS @siiiUSR )otingEpoint formtsAF fsed on the storge lyouts in le IFPX rmeters of siii QPE nd TREfit pormtsD we n derive the rnges nd ury of these formtsD s shown in le IFQF

QW

IEEE754 oating-point formats

Figure 1.15

siiiUSR ingle houle ixtended houle

winimum xormlized xumer IFPiEQV PFPiEQHV QFRiERWQP

vrgest pinite xumer QFR iCQV IFV iCQHV IFP iCRWQP

fseEIH eury TEW digits ISEIU digits IVEPI digits

Table 1.3

X nge nd eury of siii QPE nd TREfit pormts

1.2.7.3.1 Converting from Base-10 to IEEE Internal Format


e now exmine how QPEit )otingEpoint numer is storedF he highEorder it is the sign of the numerF xumers re stored in signEmgnitude formt @iFeFD not P9s E omplementAF he exponent is stored in the VEit (eld ised y dding IPU to the exponentF his results in n exponent rnging from EIPT through CIPUF he mntiss is onverted into seEP nd normlized so tht there is one nonzero digit to the left of the inry pleD djusting the exponent s neessryF he digits to the right of the inry point re then stored in the lowEorder PQ its of the wordF feuse ll numers re normlizedD there is no need to store the leding IF his gives free extr it of preisionF feuse this it is droppedD it9s no longer proper to refer to the stored vlue s the mntissF sn siii prlneD this mntiss minus its leding digit is lled the F pigure IFIT @gonverting from seEIH to siii QPEit formtA shows n exmple onversion from seEIH to siii QPEit formtF

signicand

RH

CHAPTER 1. MODERN COMPUTER ARCHITECTURES


Converting from base-10 to IEEE 32-bit format

Figure 1.16

he TREit formt is similrD exept the exponent is II its longD ised y dding IHPQ to the exponentD nd the signi(nd is SR its longF

1.2.8 IEEE Operations29


he siii stndrd spei(es how omputtions re to e performed on )otingE point vlues on the following opertionsX eddition utrtion wultiplition hivision qure root eminder @moduloA gonversion toGfrom integer gonversion toGfrom printed seEIH

hese opertions re spei(ed in mhineEindependent mnnerD giving )exiility to the g designers to implement the opertions s e0iently s possile while mintining ompline with the stndrdF huring opertionsD the siii stndrd requires the mintenne of two gurd digits nd stiky it for intermedite vluesF he gurd digits ove nd the stiky it re used to indite if ny of the its eyond the seond gurd digit is nonzeroF
29 This
content is available online at <http://cnx.org/content/m32756/1.2/>.

RI

Computation using guard and sticky bits

Figure 1.17

sn pigure IFIU @gomputtion using gurd nd stiky itsAD we hve (ve its of norml preisionD two gurd digitsD nd stiky itF qurd its simply operte s norml its " s if the signi(nd were PS itsF qurd its prtiipte in rounding s the extended opernds re ddedF he stiky it is set to I if ny of the its eyond the gurd its is nonzero in either operndF30 yne the extended sum is omputedD it is rounded so tht the vlue stored in memory is the losest possile vlue to the extended sum inluding the gurd digitsF le IFR shows ll eight possile vlues of the two gurd digits nd the stiky it nd the resulting stored vlue with n explntion s to whyF
30 If
you are somewhat hardware-inclined and you think about it for a moment, you will soon come up with a way to properly You just have to keep track as things get

maintain the sticky bit without ever computing the full innite precision sum. shifted around.

RP

CHAPTER 1. MODERN COMPUTER ARCHITECTURES


ixtended um IFHIHH HHH IFHIHH HHI IFHIHH HIH IFHIHH HII IFHIHH IHH IFHIHH IHI IFHIHH IIH IFHIHH III tored lue IFHIHH IFHIHH IFHIHH IFHIHH IFHIHH IFHIHI IFHIHI IFHIHI hy runted sed on gurd digits runted sed on gurd digits ounded down sed on gurd digits ounded down sed on gurd digits ounded down sed on stiky it ounded up sed on stiky it ounded up sed on gurd digits ounded up sed on gurd digits

Table 1.4

X ixtended ums nd heir tored lues

he (rst priority is to hek the gurd digitsF xever forget tht the stiky it is just hintD not rel digitF o if we n mke deision without looking t the stiky itD tht is goodF he only deision we re mking is to round the lst storle it up or downF hen tht stored vlue is retrieved for the next omputtionD its gurd digits re set to zerosF st is sometimes helpful to think of the stored vlue s hving the gurd digitsD ut set to zeroF wo gurd digits nd the stiky it in the siii formt insures tht opertions yield the sme rounding s if the intermedite result were omputed using unlimited preision nd then rounded to (t within the limits of preision of the (nl omputed vlueF et this pointD you might e skingD hy do s re out this minutiec et some levelD unless you re hrdwre designerD you don9t reF fut when you exmine detils like thisD you n e ssured of one thingX when they developed the siii )otingEpoint stndrdD they looked t the detils refullyF he gol ws to produe the most urte possile )otingEpoint stndrd within the onstrints of (xedElength QPE or TREit formtF feuse they did suh good joD it9s one less thing you hve to worry outF fesidesD this stu' mkes gret exm questionsF

very

1.2.9 Special Values31


sn ddition to speifying the results of opertions on numeri dtD the siii stndrd lso spei(es the preise ehvior on unde(ned opertions suh s dividing y zeroF hese results re indited using severl speil vluesF hese vlues re it ptterns tht re stored in vriles tht re heked efore opertions re performedF he siii opertions re ll de(ned on these speil vlues in ddition to the norml numeri vluesF le IFSX peil lues for n siii QPEfit xumer summrizes the speil vlues for QPEit siii )otingEpoint numerF

Special Values for an IEEE 32-Bit Number


peil lue C or ! H henormlized numer xx @xot xumerA C or ! sn(nity
31 This

ixponent HHHHHHHH HHHHHHHH IIIIIIII IIIIIIII


Table 1.5

igni(nd H nonzero nonzero H

content is available online at <http://cnx.org/content/m32758/1.2/>.

RQ he vlue of the exponent nd signi(nd determines whih type of speil vlue this prtiulr )otingE point numer representsF ero is designed suh tht integer zero nd )otingEpoint zero re the sme it ptternF henormlized numers n our t some point s numer ontinues to get smllerD nd the exponent hs rehed the minimum vlueF e ould delre tht minimum to e the smllest representle vlueF roweverD with denormlized vluesD we n ontinue y setting the exponent its to zero nd shifting the signi(nd its to the rightD (rst dding the leding I tht ws droppedD then ontinuing to dd leding zeros to indite even smller vluesF et some point the lst nonzero digit is shifted o' to the rightD nd the vlue eomes zeroF his pproh is lled where the vlue keeps pprohing zero nd then eventully eomes zeroF xot ll implementtions support denormlized numers in hrdwreY they might trp to softwre routine to hndle these numers t signi(nt performne ostF et the top end of the ised exponent vlueD n exponent of ll Is n represent the @xxA vlue or in(nityF sn(nity ours in omputtions roughly ording to the priniples of mthemtisF sf you ontinue to inrese the mgnitude of numer eyond the rnge of the )otingEpoint formtD one the rnge hs een exeededD the vlue eomes in(nityF yne vlue is in(nityD further dditions won9t inrese itD nd sutrtions won9t derese itF ou n lso produe the vlue in(nity y dividing nonzero vlue y zeroF sf you divide nonzero vlue y in(nityD you get zero s resultF he xx vlue indites numer tht is not mthemtilly de(nedF ou n generte xx y dividing zero y zeroD dividing in(nity y in(nityD or tking the squre root of EIF he di'erene etween in(nity nd xx is tht the xx vlue hs nonzero signi(ndF he xx vlue is very stikyF eny opertion tht hs xx s one of its inputs lwys produes xx resultF

gradual underow

Not a Number

1.2.10 Exceptions and Traps32


sn ddition to de(ning the results of omputtions tht ren9t mthemtilly de(nedD the siii stndrd provides progrmmers with the ility to detet when these speil vlues re eing produedF his wyD progrmmers n write their ode without dding extensive sp tests throughout the ode heking for the mgnitude of vluesF snsted they n register trp hndler for n event suh s under)ow nd hndle the event when it oursF he exeptions de(ned y the siii stndrd inludeX

yver)ow to in(nity nder)ow to zero hivision y zero snvlid opertion snext opertion

eording to the stndrdD these trps re under the ontrol of the userF sn most sesD the ompiler runtime lirry mnges these trps under the diretion from the user through ompiler )gs or runtime lirry llsF rps generlly hve signi(nt overhed ompred to single )otingEpoint instrutionD nd if progrm is ontinully exeuting trp odeD it n signi(ntly impt performneF sn some ses it9s pproprite to ignore trps on ertin opertionsF e ommonly ignored trp is the under)ow trpF sn mny itertive progrmsD it9s quite nturl for vlue to keep reduing to the point where it disppersF hepending on the pplitionD this my or my not e n error sitution so this exeption n e sfely ignoredF sf you run progrm nd then it termintesD you see messge suh sX

yverflow hndler lled IHDHHHDHHH times


st proly mens tht you need to (gure out why your ode is exeeding the rnge of the )otingEpoint formtF st proly lso mens tht your ode is exeuting more slowly euse it is spending too muh time in its error hndlersF
32 This
content is available online at <http://cnx.org/content/m32760/1.2/>.

RR

CHAPTER 1. MODERN COMPUTER ARCHITECTURES

1.2.11 Compiler Issues33


he siii USR )otingEpoint stndrd does good jo desriing how )otingE point opertions re to e performedF roweverD we generlly don9t write ssemly lnguge progrmsF hen we write in higherElevel lnguge suh s pyexD it9s sometimes di0ult to get the ompiler to generte the ssemly lnguge you need for your pplitionF he prolems fll into two tegoriesX

he ompiler is too onservtive in trying to generte siiiEomplint ode nd produes ode tht doesn9t operte t the pek speed of the proessorF yn some proessorsD to fully support grdul underE )owD extr instrutions must e generted for ertin instrutionsF sf your ode will never under)owD these instrutions re unneessry overhedF he optimizer tkes lierties rewriting your ode to improve its performneD eliminting some neesE sry stepsF por exmpleD if you hve the following odeX

a C SHH a E PHH
he optimizer my reple it with a C QHHF roweverD in the se of vlue for tht is lose to over)owD the two sequenes my not produe the sme resultF ometimes user prefers fst ode tht loosely onforms to the siii stndrdD nd t other times the user will e writing numeril lirry routine nd need totl ontrol over eh )otingEpoint opertionF gompilers hve hllenge supporting the needs of oth of these types of usersF feuse of the nture of the high performne omputing mrket nd enhmrksD often the fst nd loose pproh previls in mny ompilersF

1.2.12 Closing Notes34


hile this is reltively long hpter with lot of tehnil detilD it does not even egin to srth the surfe of the siii )otingEpoint formt or the entire (eld of numeril nlysisF e s progrmmers must e reful out the ury of our progrmsD lest the results eome meninglessF rere re few si rules to get you strtedX

vook for ompiler options tht relx or enfore strit siii ompline nd hoose the pproprite option for your progrmF ou my even wnt to hnge these options for di'erent portions of your progrmF se ievBV for omputtions unless you re sure ievBR hs su0ient preisionF qiven tht ievBR hs roughly U digits of preisionD if the ottom digits eome meningless due to rounding nd ompuE ttionsD you re in some dnger of seeing the e'et of the errors in your resultsF ievBV with IQ digits mkes this muh less likely to hppenF fe wre of the reltive mgnitude of numers when you re performing dditionsF hen summing up numersD if there is wide rngeD sum from smllest to lrgestF erform multiplitions efore divisions whenever possileF hen performing omprison with omputed vlueD hek to see if the vlues re lose rther thn identilF wke sure tht you re not performing ny unneessry type onversions during the ritil portions of your odeF
33 This 34 This
content is available online at <http://cnx.org/content/m32762/1.2/>. content is available online at <http://cnx.org/content/m32768/1.2/>.

RS en exellent referene on )otingEpoint issues nd the siii formt is ht ivery gomputer ientist hould unow eout plotingEoint erithmetiD written y hvid qoldergD in egw gomputing urveys mgzine @wrh IWWIAF his rtile gives exmples of the most ommon prolems with )otingEpoint nd outlines the solutionsF st lso overs the siii )otingEpoint formt very thoroughlyF s lso reommend you onsult hrF illim uhn9s home pge @httpXGGwwwFsFerkeleyFeduGwkhnG35 A for some exellent mterils on the siii formt nd hllenges using )otingEpoint rithmetiF hrF uhn ws one of the originl designers of the sntel iVHVU nd the siii USR )otingEpoint formtF

1.2.13 Exercises36
Exercise 1.8
un the following ode to ount the numer of inverses tht re not perfetly urteX

ievBR DD sxiqi s s a H hy aIFHDIHHHFHDIFH a IFH G a B sp @ FxiF IFH A rix s a s C I ixhsp ixhhy sx BD9pound 9Ds ixh

ghnge the type of the vriles to ievBV nd repetF wke sure to keep the optimiztion t su0iently low level @EHHA to keep the ompiler from eliminting the omputtionsF

Exercise 1.9

Exercise 1.10

rite progrm to determine the numer of digits of preision for ievBR nd ievBVF

Exercise 1.11 Exercise 1.12

rite progrm to demonstrte how summing n rry forwrd to kwrd nd kwrd to forwrd n yield di'erent resultF essuming your ompiler supports vrying levels of siii omplineD tke signi(nt omputE tionl ode nd test its overll performne under the vrious siii ompline optionsF ho the results of the progrm hngec

35 http://www.cs.berkeley.edu/wkahan/ 36 This content is available online at <http://cnx.org/content/m32765/1.2/>.

RT

CHAPTER 1. MODERN COMPUTER ARCHITECTURES

Chapter 2
Programming and Tuning Software

2.1 What a Compiler Does


2.1.1 Introduction1
2.1.1.1 What a Compiler Does
he gol of n is the e0ient trnsltion of higherElevel lnguge into the fstest possiE le mhine lnguge tht urtely represents the highElevel lnguge soureF ht mkes representtion good isX it gives the orret nswersD nd it exeutes quiklyF xturllyD it mkes no di'erene how fst progrm runs if it doesn9t produe the right nswersF2 fut given n expression of progrm tht exeutes orretlyD n optimizing ompiler looks for wys to stremline itF es (rst utD this usully mens simplifying the odeD throwing out extrneous instrutionsD nd shring intermedite results etween sttementsF wore dvned optimiztions seek to restruture the progrm nd my tully mke the ode grow in sizeD though the numer of instrutions exeuted will @hopefullyA shrinkF hen it omes to (nlly generting mhine lngugeD the ompiler hs to know out the registers nd rules for issuing instrutionsF por performneD it needs to understnd the osts of those instrutions nd the ltenies of mhine resouresD suh s the pipelinesF his is espeilly true for proessors tht n exeute more thn one instrution t timeF st tkes lned instrution mix " the right proportion of )otingEpointD (xed pointD memory nd rnh opertionsD etF " to keep the mhine usyF snitilly ompilers were tools tht llowed us to write in something more redle thn ssemly lngugeF ody they order on rti(il intelligene s they tke our highElevel soure ode nd trnslte it into highly optimized mhine lnguge ross wide vriety of singleE nd multipleEproessor rhiteturesF sn the re of high performne omputingD the ompiler t times hs greter impt on the performne of our progrm thn either the proessor or memory rhitetureF hroughout the history of high performne omputingD if we re not stis(ed with the performne of our progrm written in highElevel lngugeD we will gldly rewrite ll or prt of the progrm in ssemly lngugeF hnkfullyD tody9s ompilers usully mke tht step unneessryF sn this hpter we over the si opertion of optimizing ompilersF sn lter hpter we will over the tehniques used to nlyze nd ompile progrms for dvned rhitetures suh s prllel or vetor proessing systemsF e strt our look t ompilers exmining how the reltionship etween progrmmers nd their ompilers hs hnged over timeF
1 This content is available online at <http://cnx.org/content/m33690/1.2/>. 2 However, you can sometimes trade accuracy for speed.

optimizing compiler

RU

RV

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

2.1.2 History of Compilers3


sf you hve een in high performne omputing sine its eginning in the IWSHsD you hve progrmmed in severl lnguges during tht timeF huring the IWSHs nd erly IWTHsD you progrmmed in ssemly lngugeF he onstrint on memory nd slow lok rtes mde every instrution preiousF ith smll memoriesD overll progrm size ws typilly smllD so ssemly lnguge ws su0ientF owrd the end of the IWTHsD progrmmers egn writing more of their ode in highElevel lnguge suh s pyexF riting in highElevel lnguge mde your work muh more portleD relileD nd mintinleF qiven the inresing speed nd pity of omputersD the ost of using highElevel lnguge ws something most progrmmers were willing to eptF sn the IWUHs if progrm spent prtiulrly lrge mount of time in prtiulr routineD or the routine ws prt of the operting system or it ws ommonly used lirryD most likely it ws written in ssemly lngugeF huring the lte IWUHs nd erly IWVHsD ontinued to improve to the point tht ll ut the most ritil portions of generlEpurpose progrms were written in highElevel lngugesF yn the vergeD the ompilers generte etter ode thn most ssemly lnguge progrmmersF his ws often euse ompiler ould mke etter use of hrdwre resoures suh s registersF sn proessor with IT registersD progrmmer might dopt onvention regrding the use of registers to help keep trk of wht vlue is in wht registerF e ompiler n use eh register s muh s it likes euse it n preisely trk when register is ville for nother useF roweverD during tht timeD high performne omputer rhiteture ws lso evolvingF gry eserh ws developing vetor proessors t the very top end of the omputing spetrumF gompilers were not quite redy to determine when these new vetor instrutions ould e usedF rogrmmers were fored to write ssemly lnguge or rete highly hndEtuned pyex tht lled the pproprite vetor routines in their odeF sn senseD vetor proessors turned k the lok when it me to trusting the ompiler for whileF rogrmmers never lpsed ompletely into ssemly lngugeD ut some of their pyex strted looking rther unEpyex likeF es the vetor omputers mturedD their ompilers eme inresingly le to detet when vetoriztion ould e performedF et some pointD the ompilers gin eme etter thn progrmmers on these rhiteturesF hese new ompilers redued the need for extensive diretives or lnguge extensionsF4 he sg revolution led to n inresing dependene on the ompilerF rogrmming erly sg proesE sors suh s the sntel iVTH ws pinful ompred to gsg proessorsF utle di'erenes in the wy progrm ws oded in mhine lnguge ould hve signi(nt impt on the overll performne of the progrmF por exmpleD progrmmer might hve to ount the instrution yles etween lod instrution nd the use of the results of the lod in omputtionl instrutionF es superslr proessors were developedD ertin pirs of instrutions ould e issued simultneouslyD nd others hd to e issued serillyF feuse there were lrge numer of di'erent sg proessors produedD progrmmers did not hve time to lern the nunes of wringing the lst it of performne out of eh proessorF st ws muh esier to lok the proessor designer nd the ompiler writer together @hopefully they work for the sme ompnyA nd hve them hsh out the est wy to generte the mhine odeF hen everyone would use the ompiler nd get ode tht mde resonly good use of the hrdwreF he ompiler eme n importnt tool in the proessor design yleF roessor designers hd muh greter )exiility in the types of hnges they ould mkeF por exmpleD it would e good design in the next revision of proessor to exeute existing odes IH7 slower thn new revisionD ut y reompiling the odeD it would perform TS7 fsterF yf ourse it ws importnt to tully provide tht ompiler when the new proessor ws shipped nd hve the ompiler give tht level of performne ross wide rnge of odes rther thn just one prtiulr enhmrk suiteF

optimizing compilers

3 This content is 4 The Livermore

available online at <http://cnx.org/content/m33686/1.2/>. Loops was a benchmark that specically tested the capability of a compiler to eectively optimize a set of

loops. In addition to being a performance benchmark, it was also a compiler benchmark.

RW

2.1.3 Which Language To Optimize5


st hs een sidD s don9t know wht lnguge they will e using to progrm high performne omputers IH yers from nowD ut we do know it will e lled pyexF et the risk of initing outright wrfreD we need to disuss the strengths nd weknesses of lnguges tht re used for high performne omputingF wost omputer sientists @not omputtionl sientistsA trin on stedy diet of gD gCCD6 or some other lnguge foused on dt strutures or ojetsF hen students enounter high performne omputing for the (rst timeD there is n immedite desire to keep progrmming in their fvorite lngugeF roweverD to get the pek performne ross wide rnge of rhiteturesD pyex is the only prtil lngugeF hen students sk why this isD usully the (rst nswer isD feuse it hs lwys een tht wyF sn one wy this is orretF hysiistsD mehnil engineersD hemistsD struturl engineersD nd meteorologists do most progrmming on high performne omputersF pyex is the lnguge of those (eldsF @hen ws the lst time omputer siene student wrote properly working progrm tht omputed for weekcA o nturlly the high performne omputer vendors put more e'ort into mking pyex work well on their rhitetureF his is not the only reson tht pyex is etter lngugeD howeverF here re some fundmentl elements tht mke gD gCCD or ny dt struturesEoriented lnguge unsuitle for high performne progrmmingF sn wordD tht prolem is F ointers @or ddressesA re the wy good omputer sientists onstrut linked listsD inry treesD inomil queuesD nd ll those nifty dt struturesF he prolem with pointers is tht the e'et of pointer opertion is known only t exeution time when the vlue of the pointer is loded from memoryF yne n optimizing ompiler sees pointerD ll ets re o'F st nnot mke ny ssumptions out the e'et of pointer opertion t ompile timeF st must generte onservtive @less optimizedA ode tht simply does extly the sme opertion in mhine ode tht the highElevel lnguge desriedF hile the lk of pointers in pyex is oon to optimiztionD it seriously limits the progrmmer9s ility to rete dt struturesF sn some pplitionsD espeilly highly slle networkEsed pplitionsD the use of good dt strutures n signi(ntly improve the overll performne of the pplitionF o solve thisD in the pyex WH spei(tionD pointers hve een dded to pyexF sn some wysD this ws n ttempt y the pyex ommunity to keep progrmmers from eginning to use g in their pplitions for the dt struture res of their pplitionsF sf progrmmers egin to use pointers throughout their odesD their pyex progrms will su'er from the sme prolems tht inhiit optimiztion in g progrmsF sn sense pyex hs given up its primry dvntge over g y trying to e more like gF he dete over pointers is one reson tht the doption rte of pyex WH somewht slowedF wny progrmmers prefer to do their dt strutureD ommunitionsD nd other ookkeeping work in gD while doing the omputtions in pyex UUF pyex WH lso hs strengths nd weknesses when ompred to pyex UU on high performne omputing pltformsF pyex WH hs strong dvntge over pyex UU in the re of improved semntis tht enle more opportunities for dvned optimiztionsF his dvntge is espeilly true on distriuted memory systems on whih dt deomposition is signi(nt ftorF @ee etion RFIFIFA roweverD until pyex WH eomes populrD vendors won9t e motivted to squeeze the lst it of performne out of pyex WHF o while pyex UU ontinues to e the minstrem lnguge for high performne omputing for the ner futureD other lngugesD like g nd pyex WHD hve their limited nd potentilly inresing roles to plyF sn some wys the strongest potentil hllenger to pyex in the long run my ome in the form of numeril tool set suh s wtlF roweverD pkges suh s wtl hve their own set of optimiztion hllenges tht must e overome efore they topple pyex UU9s domintionF

pointers

5 This 6 Just

content is available online at <http://cnx.org/content/m33687/1.2/>. for the record, both the authors of this book are quite accomplished in C, C++, and FORTRAN, so they have no

preconceived notions.

SH

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

2.1.4 Optimizing Compiler Tour7


e will strt y tking wlk through n optimizing ompiler to see one t workF e think it9s interestingD nd if you n empthize with the ompilerD you will e etter progrmmerY you will know wht the ompiler wnts from youD nd wht it n do on its ownF

2.1.4.1 Compilation Process Basic compiler processes

Figure 2.1

he ompiltion proess is typilly roken down into numer of identi(le stepsD s shown in pigE ure PFI @fsi ompiler proessesAF hile not ll ompilers re implemented in extly this wyD it helps to understnd the di'erent funtions ompiler must performX IF e preompiler or preproessor phse is where some simple textul mnipultion of the soure ode is performedF he preproessing step n e proessing of inlude (les nd mking simple string sustitutions throughout the odeF PF he lexil nlysis phse is where the inoming soure sttements re deomposed into tokens suh s vrilesD onstntsD ommentsD or lnguge elementsF QF he prsing phse is where the input is heked for syntxD nd the ompiler trnsltes the inoming progrm into n intermedite lnguge tht is redy for optimiztionF
7 This
content is available online at <http://cnx.org/content/m33694/1.2/>.

SI RF yne or more optimiztion psses re performed on the intermedite lngugeF SF en ojet ode genertor trnsltes the intermedite lnguge into ssemly odeD tking into onsidE ertion the prtiulr rhiteturl detils of the proessor in questionF es ompilers eome more nd more sophistited in order to wring the lst it of performne from the proessorD some of these steps @espeilly the optimiztion nd odeEgenertion stepsA eome more nd more lurredF sn this hpterD we fous on the trditionl optimizing ompilerD nd in lter hpters we will look more losely t how modern ompilers do more sophistited optimiztionsF

2.1.4.2 Intermediate Language Representation


feuse we re most interested in the optimiztion of our progrmD we strt our disussion t the output of the prse phse of the ompilerF he prse phse output is in the form of n n @svA tht is somewhere etween highElevel lnguge nd ssemly lngugeF he intermedite lnguge expresses the sme lultions tht were in the originl progrmD in form the ompiler n mnipulte more esilyF purthermoreD instrutions tht ren9t present in the soureD suh s ddress expressions for rry referenesD eome visile long with the rest of the progrmD mking them sujet to optimiztions tooF row would n intermedite lnguge lookc sn terms of omplexityD it9s similr to ssemly ode ut not so simple tht the de(nitions8 nd uses of vriles re lostF e9ll need de(nition nd use informtion to nlyze the )ow of dt through the progrmF ypillyD lultions re expressed s strem of " sttements with extly one opertorD @up toA two operndsD nd resultF9 resuming tht nything in the originl soure progrm n e rest in terms of qudruplesD we hve usle intermedite lngugeF o give you n ide of how this worksD e9re going to rewrite the sttement elow s series of four qudruplesX

intermediate language

quadruples

e a Ef C g B h G i
ken ll t oneD this sttement hs four opertors nd four operndsX GD BD CD nd E @negteAD nd fD gD hD nd iF his is lerly too muh to (t into one qudrupleF e need form with extly one opertor ndD t mostD two opernds per sttementF he rest version tht follows mnges to do thisD employing temporry vriles to hold the intermedite resultsX

I P Q e

a a a a

h G i g B I Ef Q C P

e workle intermedite lnguge wouldD of ourseD need some other feturesD like pointersF e9re going to suggest tht we rete our own intermedite lnguge to investigte how optimiztions workF o eginD we need to estlish few rulesX

snstrutions onsist of one opodeD two operndsD nd resultF hepending on the instrutionD the opernds my e emptyF essignments re of the form Xa op D mening gets the result of op pplied to nd F ell memory referenes re expliit lod from or store to temporries t F vogil vlues used in rnhes re lulted seprtely from the tul rnhF

8 By denitions, we mean the assignment of values: not declarations. 9 More generally, code can be cast as n-tuples. It depends on the level of

the intermediate language.

SP

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


tumps go to solute ddressesF

sf we were uilding ompilerD we9d need to e little more spei(F por our purposesD this will doF gonsider the following it of g odeX

while @j < nA { k a k C j B PY m a j B PY jCCY }


his loop trnsltes into the intermedite lnguge representtion shown hereX

eXX tI tP tQ jmp jmp fXX tR tS tT tU k tV tW m tIH tII j jmp gXX

Xa j Xa n Xa tI < tP @fA tQ @gA i Xa k Xa j Xa tS B P Xa tR C tT Xa tU Xa j Xa tV B P Xa tW Xa j Xa tIH C I Xa tII @eA i

ih g soure line is represented y severl sv sttementsF yn mny sg proessorsD our sv ode is so lose to mhine lnguge tht we ould turn it diretly into ojet odeF10 yften the lowest optimiztion level does literl trnsltion from the intermedite lnguge to mhine odeF hen this is doneD the ode generlly is very lrge nd performs very poorlyF vooking t itD you n see ples to sve few instrutionsF por instneD j gets loded into temporries in four plesY surely we n redue thtF e hve to do some nlysis nd mke some optimiztionsF

2.1.4.3 Basic Blocks

efter generting our intermedite lngugeD we wnt to ut it into F hese re ode sequenes tht strt with n instrution tht either follows rnh or is itself trget for rnhF ut nother
10 See
Section 5.2.1 for some examples of machine code translated directly from intermediate language.

basic blocks

SQ wyD eh si lok hs one entrne @t the topA nd one exit @t the ottomAF pigure PFP @sntermedite lnguge divided into si loksA represents our sv ode s group of three si loksF fsi loks mke ode esier to nlyzeF fy restriting )ow of ontrol within si lok from top to ottom nd eliminting ll the rnhesD we n e sure tht if the (rst sttement gets exeutedD the seond one does tooD nd so onF yf ourseD the rnhes hven9t dispperedD ut we hve fored them outside the loks in the form of the onneting rrows " the F

ow graph

Intermediate language divided into basic blocks

Figure 2.2

e re now free to extrt informtion from the loks themselvesF por instneD we n sy with ertinty whih vriles given lok uses nd whih vriles it de(nes @sets the vlue of AF e might not e le to do tht if the lok ontined rnhF e n lso gther the sme kind of informtion out the lultions it performsF efter we hve nlyzed the loks so tht we know wht goes in nd wht omes outD we n modify them to improve performne nd just worry out the intertion etween loksF

2.1.5 Optimization Levels11


here re wide vriety of optimiztion tehniquesD nd they re not ll pplile in ll situtionsF o the user is typilly given some hoies s to whether or not prtiulr optimiztions re performedF yften this
11 This
content is available online at <http://cnx.org/content/m33692/1.2/>.

SR

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE is expressed in the form of n optimization level tht is spei(ed on the ompiler s ommndEline option
suh s !yQF he di'erent levels of optimiztion ontrolled y ompiler )g my inlude the followingX

X qenertes mhine ode diretly from the intermedite lngugeD whih n e very lrge nd slow odeF he primry uses of no optimiztion re for deuggers nd estlishing the orret progrm outputF feuse every opertion is done preisely s the user spei(edD it must e rightF Ximilr to those desried in this hpterF hey generlly work to minimize the intermedite lnguge nd generte fst ompt odeF X vooks eyond the oundries of single routine for optimiztion opportuE nitiesF his optimiztion level might inlude extending si optimiztion suh s opy propgtion ross multiple routinesF enother result of this tehnique is proedure inlining where it will improve performneF X st is possile to use runtime pro(les to help the ompiler generte improved ode sed on its knowledge of the ptterns of runtime exeution gthered from pro(le informtionF X he siii )otingEpoint stndrd @siii USRA spei(es preisely how )otingE point opertions re performed nd the preise side e'ets of these opertionsF he ompiler my identify ertin lgeri trnsformtions tht inrese the speed of the progrm @suh s replE ing division with reiprol nd multiplitionA ut might hnge the output results from the unoptimized odeF X sdenti(es potentil prllelism etween instrutionsD loksD or even suessive loop itertionsF X wy inlude utomti vetoriztionD prlleliztionD or dt deomposition on dvned rhiteture omputersF

No optimization

Basic optimizations Interprocedural analysis

Runtime prole analysis Floating-point optimizations Data ow analysis Advanced optimization

hese optimiztions might e ontrolled y severl di'erent ompiler optionsF st often tkes some time to (gure out the est omintion of ompiler )gs for prtiulr ode or set of odesF sn some sesD progrmmers ompile di'erent routines using di'erent optimiztion settings for est overll performneF

2.1.6 Classical Optimizations12


yne the intermedite lnguge is roken into si loksD there re numer of optimiztions tht n e performed on the ode in these loksF ome optimiztions re very simple nd 'et few tuples within si lokF yther optimiztions move ode from one si lok to nother without ltering the progrm resultsF por exmpleD it is often vlule to move omputtion from the ody of loop to the ode immeditely preeding the loopF sn this setionD we re going to list lssil optimiztions y nme nd tell you wht they re forF e9re not suggesting tht mke the hngesY most ompilers sine the midEIWVHs utomtilly perform these optimiztions t ll ut their lowest optimiztion levelF es we sid t the strt of the hpterD if you understnd wht the ompiler n @nd n9tA doD you will eome etter progrmmer euse you will e le to ply to the ompiler9s strengthsF

you

2.1.6.1 Copy Propagation


o strtD let9s look t tehnique for untngling lultionsF ke look t the following segment of odeX notie the two omputtions involving F

a
12 This
content is available online at <http://cnx.org/content/m33696/1.2/>.

SS

a IFH C
es writtenD the seond sttement requires the results of the (rst efore it n proeed " you need to lulte F nneessry dependenies ould trnslte into dely t runtimeF13 ith little it of rerrngement we n mke the seond sttement independent of the (rstD y opy of F he new lultion for uses the vlue of diretlyX

propagating

a a IFH C
xotie tht we left the (rst sttementD aD inttF ou my skD hy keep itc he prolem is tht we n9t tell whether the vlue of is needed elsewhereF ht is something for nother nlysis to deideF sf it turns out tht no other sttement needs the new vlue of D the ssignment is eliminted lter y ded ode removlF

2.1.6.2 Constant Folding


e lever ompiler n (nd onstnts throughout your progrmF ome of these re ovious onstnts like those de(ned in prmeter sttementsF ythers re less oviousD suh s lol vriles tht re never rede(nedF hen you omine them in lultionD you get F he little progrm elow hs two onstntsD s nd uX

constant expression

yqew wesx sxiqi sDu eewii @s a IHHA u a PHH t a s C u ixh


feuse s nd u re onstnt individullyD the omintion sCu is onstntD whih mens tht t is onstnt tooF he ompiler redues onstnt expressions like sCu into onstnts with tehnique lled F row does onstnt folding workc ou n see tht it is possile to exmine every pth long whih given vrile ould e de(ned en route to prtiulr si lokF sf you disover tht ll pths led k to the sme vlueD tht is onstntY you n reple ll referenes to tht vrile with tht onstntF his replement hs rippleEthrough e'etF sf the ompiler (nds itself looking t n expression tht is mde up solely of onstntsD it n evlute the expression t ompile time nd reple it with onstntF efter severl itertionsD the ompiler will hve loted most of the expressions tht re ndidtes for onstnt foldingF e progrmmer n sometimes improve performne y mking the ompiler wre of the onstnt vlues in your pplitionF por exmpleD in the following ode segmentX

folding

constant

a B
13 This
code is an example of a ow dependence. I describe dependencies in detail in Section 3.1.1.

ST

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

the ompiler my generte quite di'erent runtime ode if it knew tht ws HD ID PD or IUSFQPF sf it does not know the vlue for D it must generte the most onservtive @not neessrily the fstestA ode sequeneF e progrmmer n ommunite these vlues through the use of the eewii sttement in pyexF fy the use of prmeter sttementD the ompiler knows the vlues for these onstnts t runtimeF enother exmple we hve seen isX

hy s a IDIHHHH hy taIDshsw FFFFF ixhhy ixhhy


efter looking t the odeD it9s ler tht shsw ws either ID PD or QD depending on the dt set in useF glerly if the ompiler knew tht shsw ws ID it ould generte muh simpler nd fster odeF

2.1.6.3 Dead Code Removal

rogrms often ontin setions of tht hve no e'et on the nswers nd n e removedF ysionllyD ded ode is written into the progrm y the uthorD ut more ommon soure is the ompiler itselfY mny optimiztions produe ded ode tht needs to e swept up fterwrdsF hed ode omes in two typesX

dead code

snstrutions tht re unrehle snstrutions tht produe results tht re never used
ou n esily write some unrehle ode into progrm y direting the )ow of ontrol round it " permnentlyF sf the ompiler n tell it9s unrehleD it will eliminte itF por exmpleD it9s impossile to reh the sttement s a R in this progrmX

yqew wesx s a P si @BDBA s y s a R si @BDBA s ixh


he ompiler throws out everything fter the y sttement nd proly gives you wrningF nrehle ode produed y the ompiler during optimiztion will e quietly whisked wyF gomputtions with lol vriles n produe results tht re never usedF fy nlyzing vrile9s de(nitions nd usesD the ompiler n see whether ny other prt of the routine referenes itF yf ourse the ompiler n9t tell the ultimte fte of vriles tht re pssed etween routinesD externl or ommonD so those omputtions re lwys kept @s long s they re rehleAF14 sn the following progrmD omputtions involving k ontriute nothing to the (nl nswer nd re good ndidtes for ded ode elimintionX
14 If
a compiler does sucient interprocedural analysis, it can even optimize variables across routine boundaries. Interprocedural analysis can be the bane of benchmark codes trying to time a computation without using the results of the computation.

SU

min @A { int iDkY i a k a IY i Ca IY k Ca PY printf @47d\n4DiAY }


hed ode elimintion hs often produed some mzing enhmrk results from poorly written enhmrksF ee for n exmple of this type of odeF

2.1.6.4 Strength Reduction


ypertions or expressions hve time osts ssoited with themF ometimes it9s possile to reple more expensive lultion with heper oneF e ll this F he following ode frgment ontins two expensive opertionsX

strength reduction

iev D a BBP t a uBP


por the exponentition opertion on the (rst lineD the ompiler generlly mkes n emedded mthemtil suroutine lirry llF sn the lirry routineD is onverted to logrithmD multipliedD then onverted kF yverllD rising to power is expensive " tking perhps hundreds of mhine ylesF he key is to notie tht is eing rised to smll integer powerF e muh heper lterntive would e to express it s BD nd py only the ost of multiplitionF he seond sttement shows integer multiplition of vrile u y PF edding uCu yields the sme nswerD ut tkes less timeF here re mny opportunities for ompilerEgenerted strength redutionsY these re just ouple of themF e will see n importnt speil se when we look t indution vrile simpli(tionF enother exmple of strength redution is repling multiplitions y integer powers of two y logil shiftsF

2.1.6.5 Variable Renaming


sn etion SFIFID we tlked out register renmingF ome proessors n mke runtime deisions to reple ll referenes to register I with register PD for instneD to eliminte ottleneksF egister renming keeps instrutions tht re reyling the sme registers for di'erent purposes from hving to wit until previous instrutions hve (nished with themF he sme sitution n our in progrms " the sme vrile @iFeFD memory lotionA n e reyled for two unrelted purposesF por exmpleD see the vrile x in the following frgmentX

x a y B zY q a r C x C xY

SV

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


x a C Y

hen the ompiler reognizes tht vrile is eing reyledD nd tht its urrent nd former uses re independentD it n sustitute new vrile to keep the lultions seprteX

xH a y B zY q a r C xH C xHY x a C Y

Variable renaming is n importnt tehnique euse it lri(es tht lultions re independent of eh


otherD whih inreses the numer of things tht n e done in prllelF

2.1.6.6 Common Subexpression Elimination


uexpressions re piees of expressionsF por instneD eCf is suexpression of gB@eCfAF sf eCf ppers in severl plesD like it does elowD we ll it X

common subexpression

h a g B @e C fA i a @e C fAGPF
ther thn lulte e C f twieD the ompiler n generte temporry vrile nd use it wherever e C f is requiredX

temp a e C f h a g B temp i a tempGPF


hi'erent ompilers go to di'erent lengths to (nd ommon suexpressionsF wost pirsD suh s eCfD re reognizedF ome n reognize reuse of intrinsisD suh s sx@AF hon9t expet the ompiler to go too fr thoughF uexpressions like eCfCg re not omputtionlly equivlent to ressoited forms like fCgCeD even though they re lgerilly the smeF sn order to provide preditle results on omputtionsD pyex must either perform opertions in the order spei(ed y the user or reorder them in wy to gurntee extly the sme resultF ometimes the user doesn9t re whih wy eCfCg ssoitesD ut the ompiler nnot ssume the user does not reF eddress lultions provide prtiulrly rih opportunity for ommon suexpression elimintionF ou don9t see the lultions in the soure odeY they9re generted y the ompilerF por instneD referene to n rry element e@sDtA my trnslte into n intermedite lnguge expression suh sX

ddress@eA C @sEIABsizeofdttype@eA C @tEIABsizeofdttype@eA B olumndimension@eA

SW sf e@sDtA is used more thn oneD we hve multiple opies of the sme ddress omputtionF gommon suexpression elimintion will @hopefullyA disover the redundnt omputtions nd group them togetherF

2.1.6.7 Loop-Invariant Code Motion


voops re where mny high performne omputing progrms spend mjority of their timeF he omE piler looks for every opportunity to move lultions out of loop ody nd into the surrounding odeF ixpressions tht don9t hnge fter the loop is entered @ A re prime trgetsF he following loop hs two loopEinvrint expressionsX

loop-invariant expressions

hy saIDx e@sA a f@sA C g B h i a q@uA ixhhy


felowD we hve modi(ed the expressions to show how they n e moved to the outsideX

temp a g B h hy saIDx e@sA a f@sA C temp ixhhy i a q@uA


st is possile to move ode efore or fter the loop odyF es with ommon suexpression elimintionD ddress rithmeti is prtiulrly importnt trget for loopE invrint ode motionF lowly hnging portions of index lultions n e pushed into the suursD to e exeuted only when neededF

2.1.6.8 Induction Variable Simplication

voops n ontin wht re lled F heir vlue hnges s liner funtion of the loop itertion ountF por exmpleD u is n indution vrile in the following loopF sts vlue is tied to the loop indexX

induction variables

hy saIDx u a sBR C w FFF ixhhy

Induction variable simplication reples lultions for vriles like u with simpler onesF qiven strting point nd the expression9s (rst derivtiveD you n rrive t u9s vlue for the nth itertion y stepping through the n-1 intervening itertionsX

TH

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


u a w hy saIDx u a u C R FFF ixhhy

he two forms of the loop ren9t equivlentY the seond won9t give you the vlue of u given ny vlue of sF feuse you n9t jump into the middle of the loop on the th itertionD u lwys tkes on the sme vlues it would hve if we hd kept the originl expressionF sndution vrile simpli(tion proly wouldn9t e very importnt optimiztionD exept tht rry ddress lultions look very muh like the lultion for u in the exmple oveF por instneD the ddress lultion for e@sA within loop iterting on the vrile s looks like thisX

ddress a seddress@eA C @sEIA B sizeofdttype@eA


erforming ll tht mth is unneessryF he ompiler n rete new indution vrile for referenes to e nd simplify the ddress lultionsX

outside the loopFFF ddress a seddress@eA E @I B sizeofdttype@eAA indie the loopFFF ddress a ddress C sizeofdttype@eA
sndution vrile simpli(tion is espeilly useful on proessors tht n utomtilly inrement register eh time it is used s pointer for memory refereneF hile stepping through loopD the memory referene nd the ddress rithmeti n oth e squeezed into single instrution" gret svingsF

2.1.6.9 Object Code Generation


reompiltionD lexil nlysisD prsingD nd mny optimiztion tehniques re somewht portleD ut ode genertion is very spei( to the trget proessorF sn some wys this phse is where ompilers ern their keep on singleEproessor sg systemsF enything tht isn9t hndled in hrdwre hs to e ddressed in softwreF ht mens if the proessor n9t resolve resoure on)itsD suh s overuse of register or pipelineD then the ompiler is going to hve to tke re of itF ellowing the ompiler to tke re of it isn9t neessrily d thing " it9s design deisionF e omplited ompiler nd simpleD fst hrdwre might e ost e'etive for ertin pplitionsF wo proessors t opposite ends of this spetrum re the ws PHHH nd the r eEVHHHF he (rst depends hevily on the ompiler to shedule instrutions nd firly distriute resouresF he seond mnges oth things t runtimeD though oth depend on the ompiler to provide lned instrution mixF sn ll omputersD register seletion is hllenge euseD in spite of their numersD registers re preiousF ou wnt to e sure tht the most tive vriles eome register resident t the expense of othersF yn mhines without register renming @see etion SFIFIAD you hve to e sure tht the ompiler doesn9t try to reyle registers too quiklyD otherwise the proessor hs to dely omputtions s it wits for one to e freedF

TI ome instrutions in the repertoire lso sve your ompiler from hving to issue othersF ixmples re utoEinrement for registers eing used s rry indies or onditionl ssignments in lieu of rnhesF hese oth sve the proessor from extr lultions nd mke the instrution strem more omptF vstlyD there re opportunities for inresed prllelismF rogrmmers generlly think serillyD speifying steps in logil suessionF nfortuntelyD seril soure ode mkes seril ojet odeF e ompiler tht hopes to e0iently use the prllelism of the proessor will hve to e le to move instrutions round nd (nd opertions tht n e issued side y sideF his is one of the iggest hllenges for ompiler writers todyF es superslr nd @vsA designs eome ple of exeuting more instrutions per lok yleD the ompiler will hve to dig deeper for opertions tht n exeute t the sme timeF

very long instruction word

2.1.7 Closing Notes15


his hpter hs een si introdution into how n optimizing ompiler opertesF roweverD this is not the lst we will tlk out ompilersF sn order to perform the utomti vetoriztionD prlleliztionD nd dt deompositionD ompilers must further nlyze the soure odeF es we enounter these topisD we will disuss the ompiler impts nd how progrmmers n est intert with ompilersF por singleEproessor modern sg rhiteturesD ompilers usully generte etter ode thn most ssemE ly lnguge progrmmersF snsted of ompensting for simplisti ompiler y dding hnd optimiztionsD we s progrmmers must keep our progrms simple so s not to onfuse the ompilerF fy understnding the ptterns tht ompilers re quite ple of optimizingD we n fous on writing strightforwrd progrms tht re portle nd understndleF

2.1.8 Exercises16
Exercise 2.1
hoes your ompiler reognize ded ode in the progrm elowc row n you e surec hoes the ompiler give you wrningc

min@A { int kaIY if @k aa HA printf @4his sttement is never exeutedF\n4AY }

gompile the following ode nd exeute it under vrious optimiztion levelsF ry to guess the di'erent types of optimiztions tht re eing performed to improve the performne s the optimiztion is inresedF

Exercise 2.2

ievBV e@IHHHHHHA hy saIDIHHHHHH e@sA a QFIRISWPU


15 This 16 This
content is available online at <http://cnx.org/content/m33699/1.2/>. content is available online at <http://cnx.org/content/m33700/1.2/>.

TP

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


ixhhy hy saIDIHHHHHH e@sA a e@sA B sx@e@sAA C gy@e@sAA ixhhy sx BD4ell hone4

Exercise 2.3

ke the following ode segment nd ompile it t vrious optimiztion levelsF vook t the generE ted ssemly lnguge ode @! option on some ompilersA nd (nd the e'ets of eh optimiztion level on the mhine lngugeF ime the progrm to see the performne t the di'erent optimizE tion levelsF sf you hve ess to multiple rhiteturesD look t the ode generted using the sme optimiztion levels on di'erent rhiteturesF

ievBV e@IHHHHHHA gywwyxGfvuGe FFFF gll ime hy saIDIHHHHHH e@sA a e@sA C IFPQR ixhhy FFFF gll ime ixh
hy is it neessry to put the rry into ommon lokc

2.2 Timing and Proling


2.2.1 Introduction17
erhps getting your ode to produe the right nswers is enoughF efter llD if you only pln to use the progrm one in whileD or if it only tkes few minutes to runD exeution time isn9t going to mtter tht muhF fut it might not lwys e tht wyF ypillyD people strt tking interest in the runtime of their progrms for two resonsX

he worklod hs inresedF hey re onsidering new mhineF


st9s ler why you might re out the performne of your progrm if the worklod inresesF rying to rm PS hours of omputing time into PREhour dy is n dministrtive nightmreF fut why should people who re onsidering new mhine re out the runtimec efter llD the new mhine is presumly fster thn the old oneD so everything should tke less timeF he reson is tht when people re evluting new mhinesD they need sis of omprison" enhmrkF eople often use fmilir progrms s enhmrksF st mkes senseX you wnt enhmrk to e representtive of the kind of work you doD nd nothing is more representtive of the work you do thn the work you do3 fenhmrking sounds esy enoughD provided you hve timing toolsF end you lredy know the mening of timeF18 ou just wnt to e sure tht wht those tools re reporting is the sme s wht you think you9re
17 This content is available 18 Time is money.
online at <http://cnx.org/content/m33704/1.2/>.

TQ gettingY espeilly if you hve never used the tools eforeF o illustrteD imgine if someone took your wth nd repled it with nother tht expressed time in some funny units or three overlpping sets of hndsF st would e very onfusingY you might hve prolem reding it t llF ou would lso e justi(ly nervous out onduting your 'irs y wth you don9t understndF xs timing tools re like the sixEhnded wthD reporting three di'erent kinds of time mesurementsF hey ren9t giving on)iting informtion " they just present more informtion thn you n jm into single numerF eginD the trik is lerning to red the wthF ht9s wht the (rst prt of this hpter is outF e9ll investigte the di'erent types of mesurements tht determine how progrm is doingF sf you pln to tune progrmD you need more thn timing informtionF here is time eing spent " in single loopD suroutine ll overhedD or with memory prolemsc por tunersD the ltter setions of this hpter disuss how to pro(le ode t the proedurl nd sttement levelsF e lso disuss wht pro(les men nd how they predit the pproh you hve to tke whenD nd ifD you deide to twek the ode for performneD nd wht your hnes for suess will eF

2.2.2 Timing19
e ssume tht your progrm runs orretlyF st would e rther ridiulous to time progrm tht9s not running rightD though this doesn9t men it doesn9t hppenF hepending on wht you re doingD you my e interested in knowing how muh time is spent overllD or you my e looking t just portion of the progrmF e show you how to time the whole progrm (rstD nd then tlk out timing individul loops or suroutinesF

2.2.2.1 Timing a Whole Program


nder xsD you n time progrm exeution y pling the time ommnd efore everything else you normlly type on the ommnd lineF hen the progrm (nishesD timing summry is produedF por instneD if your progrm is lled D you n time its exeution y typing time fooF sf you re using the g shell or uorn shellD time is one of the shell9s uiltEin ommndsF ith fourne shellD is seprte ommnd exeutle in F sn ny seD the following informtion ppers t the end of the runX

/bin

foo

time

ser time ystem time ilpsed time


hese timing (gures re esier to understnd with little kgroundF es your progrm runsD it swithes k nd forth etween two fundmentlly di'erent modesX nd F he norml operting stte is user modeF st is in user mode tht the instrutions the ompiler generted on your ehlf get exeutedD in ddition to ny suroutine lirry lls linked with your progrmF20 st might e enough to run in user mode foreverD exept tht progrms generlly need other serviesD suh s sGyD nd these require the intervention of the operting system " the kernelF e kernel servie request mde y your progrmD or perhps n event from outside your progrmD uses swith from user mode into kernel modeF ime spent exeuting in the two modes is ounted for seprtelyF he (gure desries time spent in user modeF imilrlyD is mesure of the time spent in kernel modeF es fr s user time goesD eh progrm on the mhine is ounted for seprtelyF ht isD you won9t e hrged for tivity in someody else9s pplitionF ystem time ounting works the sme wyD for the most prtY howeverD you nD in some instnesD e hrged for some system servies performed on other people9s ehlfD in ddition to your ownF snorret hrging ours euse your progrm my e exeuting t the moment some outside tivity uses n interruptF his seems unfirD ut tke onsoltion in the ft tht it works oth wysX other users my e hrged for your system tivity tooD for the sme resonF

user mode

kernel mode

system time

user time

19 This content is available online at <http://cnx.org/content/m33706/1.2/>. 20 Cache miss time is buried in here too.

TR

thn the system timeF ou would expet this euse most pplitions only osionlly sk for system serviesF sn ftD disproportiontely lrge system time proly indites some trouleF por instneD progrms tht re repetedly generting exeption onditionsD suh s pge fultsD misligned memory refE erenesD or )otingEpoint exeptionsD use n inordinte mount of system timeF ime spent doing things like seeking on diskD rewinding tpeD or witing for hrters t the terminl doesn9t show up in g timeF ht9s euse these tivities don9t require the gY the g is free to go o' nd exeute other progrmsF he third piee of informtion @orresponding to the third set of hnds on the wthAD D is mesure of the tul @wll lokA time tht hs pssed sine the progrm ws strtedF por progrms tht spend most of their time omputingD the elpsed time should e lose to the g timeF esons why elpsed time might e greter reX

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE ken togetherD user time nd system time re lled CPU timeF qenerllyD the user time is fr greter

elapsed time

ou re timeshring the mhine with other tive progrmsF21 our pplition performs lot of sGyF our pplition requires more memory ndwidth thn is ville on the mhineF our progrm ws pging or swppedF

eople often reord the g time nd use it s n estimte for elpsed timeF sing g time is oky on single g mhineD provided you hve seen the progrm run when the mhine ws quiet nd notied the two numers were very lose togetherF fut for multiproessorsD the totl g time n e fr di'erent from the elpsed timeF henever there is doutD wit until you hve the mhine to yourE self nd time your progrm thenD using elpsed timeF st is very importnt to produe timing results tht n e veri(ed using nother run when the results re eing used to mke importnt purhsing deisionsF sf you re running on ferkeley xs derivtiveD the g shell9s uiltEin time ommnd n report numer of other useful sttistisF he defult form of the output is shown in pigure PFQ @he uiltEin sh time funtionAF ghek with your mnul pge for more possiilitiesF sn ddition to (gures for g nd elpsed timeD time ommnd produes informtion out g utiliztionD pge fultsD swpsD loked sGy opertions @usully disk tivityAD nd some mesures of how muh physil memory our proE grm oupied when it rnF e desrie eh of them in turnF

csh

csh

2.2.2.1.1 Percent utilization

Percent utilization orresponds to the rtio of elpsed time to g timeF es we mentioned oveD there n
2.2.2.1.2 Average real memory utilization

e numer of resons why the g utiliztion wouldn9t e IHH7 or mighty loseF ou n often get hint from the other (elds s to whether it is prolem with your progrm or whether you were shring the mhine when you rn itF

he two mesurements shown in pigure PFQ @he uiltEin sh time funtionA hrterize the progrm9s resoure requirements s it rnF he (rst mesurementD D ounts for the verge mount of rel memory tken y your progrm9s text segment " the portion tht holds the mhine instrutionsF st is lled shred euse severl onurrently running opies of progrm n shre the sme text segment @to sve memoryAF ers goD it ws possile for the text segment to onsume signi(nt portion of the memory systemD ut these dysD with memory sizes strting round QP wfD you hve to ompile pretty huge soure progrm nd use every it of it to rete shredEmemory usge (gure ig enough to use onernF he shredEmemory spe requirement is usully quite low reltive to the mount of memory ville on your mhineF
21 The
uptime command gives you a rough indication of the other activity on your machine. The last three elds tell the average number of processes ready to run during the last 1, 5, and 15 minutes, respectively.

average memory utilization shared-memory space

TS

The built-in csh time function

Figure 2.3

he seond verge memory utiliztion mesurementD D desries the verge storge dedited to your progrm9s dt strutures s it rnF his storge inludes sved lol vriles nd gywwyx for pyexD nd stti nd externl vriles for gF e stress the word rel here nd ove euse these numers tlk out physil memory usgeD tken over timeF st my e tht you hve lloted rrys with I trillion elements @virtul speAD ut if your progrm only rwls into orner of tht speD your runtime memory requirements will e pretty lowF ht the unshredEmemory spe mesurement doesn9t tell youD unfortuntelyD is your progrm9s demnd for memory t its greediestF en pplition tht requires IHH wf IGIHth of the time nd I uf the rest of the time ppers to need only IH wf on verge " not reveling piture of the progrm9s memory requirementsF

unshared-memory space

real

2.2.2.1.3 Blocked I/O operations


he two (gures for opertions primrily desrie disk usgeD though tpe devies nd some other peripherls my lso e used with loked sGyF ghrter sGy opertionsD suh s terminl input nd outputD do not pper hereF e lrge numer of loked sGy opertions ould explin lowerEthnEexpeted g utiliztionF

blocked I/O

2.2.2.1.4 Page faults and swaps

en unusully high numer of or ny swps proly indites system hoked for memoryD whih would lso explin longerEthnEexpeted elpsed timeF st my e tht other progrms re ompeting for the sme speF end don9t forget tht even under optiml onditionsD every progrm su'ers some numer of pge fultsD s explined in etion IFIFIF ehniques for minimizing pge fults re desried in etion PFRFIF

page faults

TT

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

2.2.2.2 Timing a Portion of the Program


por some enhmrking or tuning e'ortsD mesurements tken on the outside of the progrm tell you everything you need to knowF fut if you re trying to isolte performne (gures for individul loops or portions of the odeD you my wnt to inlude timing routines on the inside tooF he si tehnique is simple enoughX IF PF QF RF eord the time efore you strt doing F ho F eord the time t ompletion of F utrt the strt time from the ompletion timeF

sfD for instneD 9s primry jo is to lulte prtile positionsD divide y the totl time to otin numer for prtile positionsGseondF ou hve to e reful thoughY too mny lls to the timing routinesD nd the oserver eomes prt of the experimentF he timing routines tke time tooD nd their very presene n inrese instrution he miss or pgingF purthermoreD you wnt to tke signi(nt mount of time so tht the mesurements re meningfulF ying ttention to the time etween timer lls is relly importnt euse the lok used y the timing funtions hs limited resolutionF en event tht ours within frtion of seond is hrd to mesure with ny uryF

2.2.2.3 Getting Time Information


sn this setionD we disuss methods for getting vrious timer vlues during the exeution of your progrmF por pyex progrmsD lirry timing funtion found on mny mhines is lled D whih tkes twoEelement ievBR rry s n rgument nd (lls the slots with the user g time nd system g timeD respetivelyF he vlue returned y the funtion is the sum of the twoF rere9s how is often usedX

etime etime

relBR trry@PAD etime relBR strtD finish strt a etime@trryA finish a etime@trryA write @BDBA 9g timeX 9D finish E strt
xot every vendor supplies n funtionY in ftD one doesn9t provide timing routine for pyex t llF ry it (rstF sf it shows up s n unde(ned symol when the progrm is linkedD you n use the following g routineF st provides the sme funtionlity s X

etime

etime

5inlude <sysGtimesFh> 5define sgu IHHF flot etime @prtsA strut { flot userY flot systemY } BprtsY

TU

strut tms lolY times @8lolAY prtsE>usera @flotA lolFtmsutimeGsguY prtsE>system a @flotA lolFtmsstimeGsguY return @prtsE>user C prtsE>systemAY

here re ouple of things you might hve to twek to mke it workF pirst of llD linking g routines with pyex routines on your omputer my require you to dd n undersore @A fter the funtion nmeF his hnges the entry to flot etime @prtsAF purthermoreD you might hve to djust the sgu prmeterF e ssumed tht the system lok hd resolution of IGIHH of seond @true for the rewlettE krd mhines tht this version of ws written forAF IGTH is very ommonF yn n ETHHH the numer would e IHHHF ou my (nd the vlue in (le nmed on your mhineD or you n determine it empirillyF e g routine for retrieving the wll time using lling is shown elowF st is suitle for use with either g or pyex progrms s it uses llEyEvlue prmeter pssingX

etime

/usr/include/sys/param.h gettimeofday

5inlude <stdioFh> 5inlude <stdliFh> 5inlude <sysGtimeFh> void hpwll@doule BretvlA { stti long zse a HY stti long zuse a HY doule eseY strut timevl tpY strut timezone tzpY gettimeofdy@8tpD 8tzpAY if @ zse aa H A zse a tpFtvseY if @ zuse aa H A zuse a tpFtvuseY } Bretvl a @tpFtvse E zseA C @tpFtvuse E zuse A B HFHHHHHI Y

void hpwll@doule BretvlA { hpwll@retvlAY } GB yther onvention BG


qiven tht you will often need oth g nd wll timeD nd you will e ontinuE lly omputing the di'erene etween suessive lls to these routinesD you my wnt to write routine to return the elpsed wll nd g time upon eh ll s followsX

fysxi rgsw@swiDgswiA

TV

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


B swvsgs xyxi iev swiDgswi gywwyxGrgswgGgfiqsxDfiqsx ievBV gfiqsxDgixhDfiqsxDixh iev iswiDggegr@PA gevv rgevv@ixhA gixhaiswi@ggegrA swi a ixh E fiqsx gswi a gixh E gfiqsx fiqsx a ixh gfiqsx a gixh ixh

B B B

2.2.2.4 Using Timing Information


ou n get lot informtion from the timing filities on xs mhineF xot only n you tell how long it tkes to perform given joD ut you n lso get hints out whether the mhine is operting e0ientlyD or whether there is some other prolem tht needs to e ftored inD suh s indequte memoryF yne the progrm is running with ll nomlies explined wyD you n reord the time s selineF sf you re tuningD the seline will e referene with whih you n tell how muh @or littleA tuning hs improved thingsF sf you re enhmrkingD you n use the seline to judge how muh overll inrementl performne new mhine will give youF fut rememer to wth the other (gures " pgingD g utiliztionD etF hese my di'er from mhine to mhine for resons unrelted to rw g performneF ou wnt to e sure you re getting the full pitureF

2.2.3 Subroutine Proling22


ometimes you wnt more detil thn the overll timing of the pplitionF fut you don9t hve time to modify the ode to insert severl hundred lls into your odeF ro(les re lso very useful when you hve een hnded strnge PHDHHHEline pplition progrm nd told to (gure out how it works nd then improve its performneF wost ompilers provide fility to utomtilly insert timing lls into your ode t the entry nd exit of eh routine t ompile timeF hile your progrm runsD the entry nd exit times re reorded nd then dumped into (leF e seprte utility summrizes the exeution ptterns nd produes report tht shows the perentge of the time spent in eh of your routines nd the lirry routinesF he pro(le gives you sense of the shpe of the exeution pro(leF ht isD you n see tht IH7 of the time is spent in suroutine eD S7 in suroutine fD etF xturllyD if you dd ll of the routines together they should ount for IHH7 of the overll time spentF prom these perentges you n onstrut piture " " of how exeution is distriuted when the progrm runsF hough not representtive of ny prtiulr pro(ling toolD the histogrms in pigure PFR @hrp pro(le " dominted y routine IA nd pigure PFS @plt pro(le " no routine predomintesA depit these perentgesD sorted from left to rightD with eh vertil olumn representing di'erent routineF hey help illustrte di'erent pro(le shpesF

etime

prole

22 This

content is available online at <http://cnx.org/content/m33713/1.2/>.

TW

Sharp prole  dominated by routine 1

Figure 2.4

e sys tht most of the time is spent in one or two proeduresD nd if you wnt to improve the progrm9s performne you should fous your e'orts on tuning those proeduresF e minor optimiztion in hevily exeuted line of ode n sometimes hve gret e'et on the overll runtimeD given the right opportunityF e D23 on the other hndD tells you tht the runtime is spred ross mny routinesD nd e'ort spent optimizing ny one or two will hve little ene(t in speeding up the progrmF yf ourseD there re lso progrms whose exeution pro(le flls somewhere in the middleF

sharp prole

at prole

23 The
below.

term at prole is a little overloaded. We are using it to describe a prole that shows an even distribution of time

throughout the program. You will also see the label at prole used to draw distinction from a call graph prole, as described

UH

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


Flat prole  no routine predominates

Figure 2.5

e nnot predit with solute ertinty wht you re likely to (nd when you pro(le your progrmsD ut there re some generl trendsF por instneD engineering nd sienti( odes uilt round mtrix solutions often exhiit very shrp pro(lesF he runtime is dominted y the work performed in hndful of routinesF o tune the odeD you need to fous your e'orts on those routines to mke them more e0ientF st my involve restruturing loops to expose prllelismD providing hints to the ompilerD or rerrnging memory referenesF sn ny seD the hllenge is tngileY you n see the prolems you hve to (xF here re limits to how muh tuning one or two routines will improve your runtimeD of ourseF en often quoted rule of thum is D derived from remrks mde in IWTU y one of the designers of the sfw QTH seriesD nd founder of emdhl gomputerD qene emdhlF tritly spekingD his remrks were out the performne potentil of prllel omputersD ut people hve dpted emdhl9s vw to desrie other things tooF por our purposesD it goes like thisX y you hve progrm with two prtsD one tht n e optimized so tht it goes in(nitely fst nd nother tht n9t e optimized t llF iven if the optimizle portion mkes up SH7 of the initil runtimeD t est you will e le to ut the totl runtime in hlfF ht isD your runtime will eventully e dominted y the portion tht n9t e optimizedF his puts n upper limit on your expettions when tuningF iven given the (nite return on e'ort suggested y emdhl9s vwD tuning progrm with shrp pro(le n e rewrdingF rogrms with )t pro(les re muh more di0ult to tuneF hese re often system odesD nonnumeri pplitionsD nd vrieties of numeril odes without mtrix solutionsF st tkes glol tuning pproh to redueD to ny justi(le degreeD the runtime of progrm with )t pro(leF por instneD you n sometimes optimize instrution he usgeD whih is omplited euse of the progrm9s equl distriution of tivity mong lrge numer of routinesF st n lso help to redue suroutine ll overhed y folding llees into llersF ysionllyD you n (nd memory referene prolem tht is endemi to the whole progrm " nd one tht n e (xed ll t oneF

Amdahl's Law

UI hen you look t pro(leD you might (nd n unusully lrge perentge of time spent in the lirry routines suh s logD expD or sinF yften these funtions re done in softwre routines rther thn inlineF ou my e le to rewrite your ode to eliminte some of these opertionsF enother importnt pttern to look for is when routine tkes fr longer thn you expetF nexpeted exeution time my indite you re essing memory in pttern tht is d for performne or tht some spet of the ode nnot e optimized properlyF sn ny seD to get pro(leD you need pro(lerF yne or two ome stndrd with the softwre development environments on ll xs mhinesF e disuss two of themX nd F sn dditionD we mention few lineEyEline pro(lersF uroutine pro(lers n give you generl overll view of where time is eing spentF ou proly should strt with D if you hve it @most mhines doAF ytherwiseD use F efter thtD you n move to lineEyE line pro(ler if you need to know whih sttements tke the most timeF

subroutine prolers

gprof

prof

prof

gprof

prof is the most ommon of the xs pro(ling toolsF sn senseD it is n extension of the ompilerD linkerD nd ojet lirriesD plus few extr utilitiesD so it is hrd to look t ny one thing nd sy this pro(les your odeF prof works y periodilly smpling the progrm ounter s your pplition runsF o enle pro(lingD you must reompile nd relink using the !p )gF por exmpleD if your progrm hs two modulesD stu.c nd junk.cD you need to ompile nd link ording to the following odeX
7 stuffF Ep Ey E 7 junkF Ep Ey E 7 stuffFo junkFo Ep Eo stuff
his retes stu' inry tht is redy for pro(lingF ou don9t need to do nything speil to run itF tust tret it normlly y entering stuffF feuse runtime sttistis re eing gtheredD it tkes little longer thn usul to exeuteF24 et ompletionD there is new (le lled in the diretory where you rn itF his (le ontins the history of in inry formD so you n9t look t it diretlyF se the utility to red nd rete pro(le of F fy defultD the informtion is written to your sreen on stndrd outputD though you n esily rediret it to (leX

2.2.3.1 prof

mon.out

stu

stu

mon.out

prof

7 prof stuff > stuffFprof


o explore how the ommnd worksD we hve reted the following ridiulous little pplitionD F st ontins min routine nd three suroutines for whih you n predit the time distriution just y looking t the odeF

prof

loops.c

min @A { int lY
p
24 Remember:
code with proling enabled takes longer to run. You should recompile and relink the whole thing

without

the

ag when you have nished proling.

UP

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


for @laHYl<IHHHYlCCA { if @l aa PB@lGPAA foo @AY r@AY z@AY }

} foo @A{ int jY for @jaHYj<PHHYjCCA } r @A { int iY for @iaHYi<PHHYiCCAY } z @A { int kY for @kaHYk<QHHYkCCAY }
eginD you need to ompile nd link to extrt pro(leD s followsX

loops with the !p )gD run the progrmD nd then run the prof utility

7 loopsF Ep Eo loops 7 FGloops 7 prof loops > loopsFprof


he following exmple shows wht

loops.prof should look likeF here re six olumnsF

7ime eonds gumses 5glls mseGll xme STFV HFSH HFSH IHHH HFSHH z PUFQ HFPR HFUR IHHH HFPRH r ISFW HFIR HFVV SHH HFPV foo HFH HFHH HFVV I HF ret HFH HFHH HFVV P HF profil HFH HFHH HFVV I HF min HFH HFHH HFVV Q HF getenv HFH HFHH HFVV I HF strpy HFH HFHH HFVV I HF write
he olumns n e desried s followsX

7ime erentge of g time onsumed y this routine eonds g time onsumed y this routine gumses e running totl of time onsumed y this nd ll preeding routines in the list glls he numer of times this prtiulr routine ws lled

UQ

mseGll eonds divided y numer of lls giving the verge length of time tken y eh invoE tion of the routine xme he nme of this routine
he top three routines listed re from itselfF ou n see n entry for the min routine more thn hlfwy down the listF hepending on the vendorD the nmes of the routines my ontin leding or triling undersoresD nd there will lwys e some routines listed you don9t reognizeF hese re ontriutions from the g lirry nd possily the pyex lirriesD if you re using pyexF ro(ling lso introdues some overhed into the runD nd often shows up s one or two suroutines in the outputF sn this seD the entry for profil represents ode inserted y the linker for olleting runtime pro(ling dtF sf it ws our intention to tune D we would onsider pro(le like the one in the (gure ove to e firly good signF he led routine tkes SH7 of the runtimeD so t lest there is hne we ould do something with it tht would hve signi(nt impt on the overll runtimeF @yf ourse with progrm s trivil s D there is plenty we n doD sine does nothingFA

loops.c

loops

prof

loops

loops

2.2.3.2 gprof
tust s it9s importnt to know how time is distriuted when your progrm runsD it9s lso vlule to e le to tell who lled who in the list of routinesF smgineD for instneD if something leled exp showed up high in the list in the prof outputF ou might syX rmmmD s don9t rememer lling nything nmed exp@AF s wonder where tht me fromF e ll tree helps you (nd itF uroutines nd funtions n e thought of s memers of fmily treeF he top of the treeD or rootD is tully routine tht preedes the min routine you oded for the pplitionF st lls your min routineD whih in turn lls othersD nd so onD ll the wy down to the lef nodes of the treeF his tree is properly known s F25 he reltionship etween routines nd nodes in the grph is one of prents nd hildrenF xodes seprted y more thn one hop re referred to s nestors nd desendntsF pigure TER grphilly depits the kind of ll grph you might see in smll pplitionF min is the prent or nestor of most of the rest of the routinesF q hs two prentsD i nd gF enother routineD eD doesn9t pper to hve ny nestors or desendnts t llF his prolem n hppen when routines re not ompiled with pro(ling enledD or when they ren9t invoked with suroutine ll " suh s would e the se if e were n exeption hndlerF he xs pro(ler tht n extrt this kind of informtion is lled F st replites the ilities of D plus it gives ll grph pro(le so you n see who lls whomD nd how oftenF he ll grph pro(le is hndy if you re trying to (gure out how piee of ode works or where n unknown routine me fromD or if you re looking for ndidtes for suroutine inliningF o use ll grph pro(ling you need go through the sme steps s with D exept tht !pg )g is sustituted for the !p )gF26 edditionllyD when it omes time to produe the tul pro(leD you use the utility insted of F yne other di'erene is tht the nme of the sttistis (le is insted of X

call graph

prof

gprof

gprof mon.out
7 7 7 7
25 It

prof

prof

gmon.out

Epg stuffF E stuffFo Epg Eo stuff stuff gprof stuff > stuffFgprof
Any subroutine can have more than one parent. Furthermore, recursive subroutine calls

doesn't have to be a tree. HP machines, the ag is

introduce cycles into the graph, in which a child calls one of its parents.

26 On

G.

UR

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


Simple call graph

Figure 2.6

he output from gprof is divided into three setionsX

gll grph pro(le iming pro(le sndex


he (rst setion textully mps out the ll grphF he seond setion lists routinesD the perentge of time devoted to ehD the numer of llsD etF @similr to AF he third setion is ross referene so tht you n lote routines y numerD rther thn y nmeF his setion is espeilly useful for lrge pplitions euse routines re sorted sed on the mount of time they useD nd it n e di0ult to lote prtiulr routine y snning for its nmeF vet9s invent nother trivil pplition to illustrte how worksF pigure PFU @pyex exmpleA shows short piee of pyex odeD long with digrm of how the routines re onneted togetherF uroutines e nd f re oth lled y wesxD ndD in turnD eh lls gF he following exmple shows setion of the output from 9s ll grph pro(leX27

prof

gprof

gprof

27 In

the interest of conserving space, we clipped out the section most relevant to our discussion and included it in this

example. There was a lot more to it, including calls of setup and system routines, the likes of which you will see when you run

gprof.

US

FORTRAN example

Figure 2.7

index FFFF Q

7time

self desendnts FFFF

lledGtotl prents lledCself nme index lledGtotl hildren FFFF IGI I IGI IGI min P wesx Q R S

WWFW

HFHH HFHH QFPQ IFTP

VFHV VFHV IFTP IFTP

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE R SWFW QFPQ QFPQ IFTP IFTP IFTP HFHH IGI I IGP wesx Q R T

UT

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE S RHFH IFTP IFTP IFTP IFTP IFTP HFHH IGI I IGP wesx Q S T

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE IFTP IFTP QFPQ HFHH HFHH HFHH IGP IGP P S R T

QWFW

ndwihed etween eh set of dshed lines is informtion desriing given routine nd its reltionship to prents nd hildrenF st is esy to tell whih routine the lok represents euse the nme is shifted frther to the left thn the othersF rents re listed oveD hildren elowF es with profD undersores re tked onto the lelsF28 e desription of eh of the olumns followsX

index ou will notie tht eh routine nme is ssoited with numer in rkets @nAF his is rossEreferene for loting the routine elsewhere in the pro(leF sfD for exmpleD you were looking t the lok desriing wesx nd wnted to know more out one of its hildrenD sy D you ould (nd it y snning down the left side of the pge for its indexD SF 7time he mening of the 7time (eld is little di'erent thn it ws for F sn this se it desries the perentge of time spent in this routine the time spent in ll of its hildrenF st gives you quik wy to determine where the usiest setions of the ll grph n e foundF self visted in seondsD the self olumn hs di'erent menings for prentsD the routine in questionD nd its hildrenF trting with the middle entry " the routine itself " the self (gure shows how muh overll time ws dedited to the routineF sn the se D for instneD this mounts to QFPQ seondsF ih self olumn entry shows the mount of time tht n e ttriuted to lls from the prentsF sf you look t routine D for exmpleD you will see tht it onsumed totl time of QFPQ seondsF fut note tht it hd two prentsX IFTP seonds of the time ws ttriutle to lls from D nd IFTP seonds to F por the hildrenD the self (gure shows how muh time ws spent exeuting eh hild due to lls from this routineF he hildren my hve onsumed more time overllD ut the only time ounted for is timeEttriutle to lls from this routineF por exmpleD umulted QFPQ seonds overllD ut if you look t the lok desriing D you see listed s hild with only IFTP seondsF ht9s the totl time spent exeuting on ehlf of F desendnts es with the self olumnD (gures in the desendnts olumn hve di'erent menings for the routineD its prentsD nd hildrenF por the routine itselfD it shows the numer of seonds spent in ll of its desendntsF por the routine9s prentsD the desendnts (gure desries how muh time spent in the routine n e tred k to lls y eh prentF vooking t routine ginD you n see tht of its totl timeD QFPQ seondsD IFTP seonds were ttriutle to eh of its two prentsD nd F por the hildrenD the desendnts olumn shows how muh of the hild9s time n e ttriuted to lls from this routineF he hild my hve umulted more time overllD ut the only time displyed is time ssoited with lls from this routineF lls he lls olumn shows the numer of times eh routine ws invokedD s well s the distriE ution of those lls ssoited with oth prents nd hildrenF trting with the routine itselfD the

plus

prof

28 You

may have noticed that there are two main routines: It's called as a subroutine by

FORTRAN main routine.

proling C code, you won't see

_MAIN_.

_MAIN_ and _main. In a FORTRAN program, _MAIN_ is the _main, provided from a system library at link time. When

actual you're

UU (gure in the lls olumn shows the totl numer of entries into the routineF sn situtions where the routine lled itselfD you will lso see immeditely ppendedD showing tht dditionl lls were mde reursivelyF rent nd hild (gures re expressed s rtiosF por the prentsD the rtio sys of the times the routine ws lledD of those lls me from this prentF por the hildD it sys of the times this hild ws lledD of those lls me from this routineF

+n

m/n

n n

2.2.3.3 gprof's Flat Prole

es we mentioned previouslyD lso produes timing pro(le @lso lled )t pro(leD just to onfuse thingsA similr to the one produed y F e few of the (elds re di'erent from D nd there is some extr informtionD so it will help if we explin it rie)yF he following exmple shows the (rst few lines from )t pro(le for F ou will reognize the top three routines from the originl progrmF he others re lirry funtions inluded t linkEtimeF

gprof

gprof

stu

prof

prof

7 time QWFW QWFW PHFH HFI HFH HFH HFH FFF

umultive seonds QFPQ TFRT VFHV VFHW VFHW VFHW VFHW FFFF

self seonds QFPQ QFPQ IFTP HFHI HFHH HFHH HFHH FFFF

lls P I I Q TR TR PH F

self totl msGll msGll nme ITISFHU ITISFHU T QPQHFIR RVRSFPH R ITPHFHU QPQSFIR S QFQQ QFQQ iotl W HFHH HFHH Frem IP HFHH HFHH flos IUU HFHH HFHH siglok IUV F F FFFFFF

rere9s wht eh olumn mensX

7time eginD we see (eld tht desries the runtime for eh routine s perentE ge of the overll time tken y the progrmF es you might expetD ll the entries in this olumn should totl IHH7 @nerlyAF umultive seonds por ny given routineD the olumn lled umultive seonds tllies running sum of the time tken y ll the preeding routines plus its own timeF es you sn towrds the ottomD the numers symptotilly pproh the totl runtime for the progrmF self seonds ih routine9s individul ontriution to the runtimeF lls he numer of times this prtiulr routine ws lledF self msGll eonds spent inside the routineD divided y the numer of llsF his gives the verge length of time tken y eh invotion of the routineF he (gure is presented in milliseondsF totl msGll eonds spent inside the routine plus its desendntsD divided y the numer of llsF nme he nme of the routineF xotie tht the rossEreferene numer ppers here tooF

2.2.3.4 Accumulating the Results of Several gprof Runs


st is possile to umulte sttistis from multiple runs so tht you n get piture of how progrm is doing with vriety of dt setsF por instneD sy tht you wnted to pro(le n pplition " ll it " with three di'erent sets of input dtF ou ould perform the runs seprtelyD sving the (les s you goD nd then omine the results into single pro(le t the endX

gmon.out

bar

UV

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


7 7 7 7 7 7 7 fUU Epg rFf Eo r r < dtIFinput mv gmonFout gmonFI r < dtPFinput mv gmonFout gmonFP r < dtQFinput gprof r Es gmonFI gmonFP gmonFout > gprofFsummryFout

sn the exmple pro(leD eh run long the wy retes new (le tht we renmed to mke room for the next oneF et the endD omines the inforE mtion from eh of the dt (les to produe summry pro(le of in the (le F edditionlly @you don9t see it hereAD retes (le nmed tht ontins the merged dt from the originl three dt (lesF hs the sme formt s D so you n use it s input for other merged pro(les down the rodF sn formD the output from merged pro(le looks extly the sme s for n individul runF here re ouple of interesting things you will noteD howeverF por one thingD the min routine ppers to hve een invoked more thn one " one time for eh runD in ftF purthermoreD depending on the pplitionD multiple runs tend to either smooth the ontour of the pro(le or exggerte its feturesF ou n imgine how this might hppenF sf single routine is onsistently lled while others ome nd go s the input dt hngesD it tkes on inresing importne in your tuning e'ortsF

bar gmon.sum gmon.out

gprof gprof.summary.out

gmon.out

gprof gmon.sum

2.2.3.5 A Few Words About Accuracy


por proessors running t THH wrz nd moreD the time etween TH rz nd IHH rz smples is veritle eternityF purthermoreD you n experiene quntiztion errors when the smpling frequeny is (xedD s is true of stedy IGIHHth or IGTHth of seond smplesF o tke n exggerted exmpleD ssume tht the timeline in pigure PFV @untiztion errors in pro(lingA shows lternting lls to two suroutinesD fe nd pyyF he tik mrks represent the smple points for pro(lingF

Quantization errors in proling

Figure 2.8

fe nd pyy tke turns runningF sn ftD fe tkes more time thn pyyF fut euse the smpling intervl losely mthes the frequeny t whih the two suroutines lternteD we get quntizing errorX most of the

UW smples hppen to e tken while pyy is runningF hereforeD the pro(le tells us tht pyy took more g time thn feF e hve desried the tried nd true xs suroutine pro(lers tht hve een ville for yersF sn mny sesD vendors hve muh etter tools ville for the sking or for feeF sf you re doing some serious tuningD sk your vendor representtive to look into other tools for youF

2.2.4 Basic Block Prolers29


here re severl good resons to desire (ner level of detil thn you n see with suroutine pro(lerF por humns trying to understnd how suroutine or funtion is usedD pro(ler tht tells whih lines of soure ode were tully exeutedD nd how oftenD is invluleY few lues out where to fous your tuning e'orts n sve you timeF elsoD suh pro(ler sves you from disovering tht prtiulrly lever optimiztion mkes no di'erene euse you put it in setion of ode tht never gets exeutedF es prt of n overll strtegyD suroutine pro(le n diret you to hndful of routines tht ount 30 for most of the runtimeD ut it tkes to get you to the ssoited soure ode linesF fsi lok pro(lers n lso provide ompilers with informtion they need to perform their own optiE miztionsF wost ompilers work in the drkF hey n restruture nd unroll loopsD ut they nnot tell when it will py o'F orse yetD mispled optimiztions often hve n dverse e'et of slowing down the ode3 his n e the result of dded instrution he urdenD wsted tests introdued y the ompilerD or inorret ssumptions out whih wy rnh would go t runtimeF sf the ompiler n utomtilly interpret the results of si lok pro(leD or if you n supply the ompiler with hintsD it often mens redued runE time with little e'ort on your prtF here re severl si lok pro(lers in the worldF he losest thing to stndrdD D is shipped with un worksttionsY it9s stndrd euse the instlled se is so igF yn wsEsed worksttionsD suh s those from ilion qrphis nd higD the pro(ler @pkged s n extension to A is lled F e explin rie)y how to run eh pro(ler using resonle set of swithesF ou n onsult your mnul pges for other optionsF

basic block proler

prof

tcov

pixie

2.2.4.1 tcov

tcovD ville on un worksttions nd other eg mhines tht run unyD gives exeution sttistis tht desrie the numer of times eh soure sttement ws exeutedF st is very esy to useF essume for illustrtion tht we hve soure progrm lled foo.cF he following steps rete si lok pro(leX
7 E fooF Eo foo 7 foo 7 tov fooF
he E option tells the ompiler to inlude the neessry support for F31 everl (les re reted in the proessF yne lled umultes history of the exeE ution frequenies within the progrm F ht isD old dt is updted with new dt eh time is runD so you n get n overll piture of wht hppens inside D given vriety of dt setsF tust rememer to len out the old dt if you wnt to strt overF he pro(le itself goes into (le lled F vet9s look t n illustrtionF felow is short g progrm tht performs ule sort of IH integersX

foo

foo.d

foo.tcov

foo

tcov

foo

29 This content is available online at <http://cnx.org/content/m33710/1.2/>. 30 A basic block is a section of code with only one entrance and one exit. If you
of a basic block is explained in detail in Section 2.1.1

know how many times the block was entered,

you know how many times each of the statements in the block was executed, which gives you a line-by-line prole. The concept

31 On

Sun Solaris systems, the

xa

option is used.

VH

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


int n a {PQDIPDRQDPDWVDUVDPDSIDUUDV}Y min @A { int iD jD ktempY for @iaIHY i>HY iEEA { for @jaHY j<iY jCCA { if @nj < njCIA { ktemp a njCID njCI a njD nj a ktempY } } } }

tcov produes si lok pro(le tht ontins exeution ounts for eh soure lineD plus some summry
sttistis @not shownAX

int n a {PQDIPDRQDPDWVDUVDPDSIDUUDV}Y min @A I E> { int iD jD ktempY IH E> for @iaIHY i>HY iEEA { IHD SS E> for @jaHY j<iY jCCA { SS E> if @nj < njCIA { PQ E> ktemp a njCID njCI a njD nj a ktempY } } } I E> }
he numers to the left tell you the numer of times eh lok ws enteredF por instneD you n see tht the routine ws entered just oneD nd tht the highest ount ours t the test nj < njCIF shows more thn one ount on line in ples where the ompiler hs reted more thn one lokF

tcov

2.2.4.2 pixie

is little di'erent from F ther thn reporting the numer of times eh soure line ws exeutedD pixie reports the numer of mhine lok yles devoted to exeuting eh lineF sn theoryD you ould use this to lulte the mount of time spent per sttementD lthough nomlies like he misses re not representedF works y pixifying n exeutle (le tht hs een ompiled nd linked in the norml wyF felow we run on to rete new exeutle lled X

pixie

tcov

pixie pixie foo

foo.pixie

VI

7 7 7 7

fooF Eo foo pixie foo fooFpixie prof Epixie foo

elso reted ws (le nmed D whih ontins ddresses for the si loks within F hen the new progrmD D is runD it retes (le lled D ontining exeution ounts for the si loks whose ddresses re stored in F dt umultes from run to runF he sttistis re retrieved using nd speil pixie )gF 9s defult output omes in three setions nd showsX

pixie

foo.pixie prof

foo.Addrs foo.Counts foo.Addrs pixie

foo

gyles per routine roedure invotion ounts gyles per si line


felowD we hve listed the output of the third setion for the ule sortX

proedure @fileA min @fooFA lenup @flsufFA flose @flsufFA flose @flsufFA lenup @flsufFA flose @flsufFA min @fooFA min @fooFA FFFF

line U SW VI WR SR UT IH V FF

ytes RR PH PH PH PH IT PR QT FF

yles THS SHH SHH SHH SHH RHH PWV PHU FF

7 IPFII IHFHI IHFHI IHFHI IHFHI VFHI SFWU RFIR FFF

um 7 IPFII PPFIQ QPFIR RPFIS SPFIT THFIU TTFIR UHFPV FFF

rere you n see three entries for the min routine from D plus numer of system lirry routinesF he entries show the ssoited line numer nd the numer of mhine yles dedited to exeuting tht line s the progrm rnF por instneD line U of took THS yles @IP7 of the runtimeAF

foo.c

foo.c

2.2.5 Virtual Memory32


sn ddition to the negtive performne impt due to he missesD the virtul memory system n lso slow your progrm down if it is too lrge to (t in the memory of the system or is ompeting with other lrge jos for sre memory resouresF nder most xs implementtionsD the operting system utomtilly pges piees of progrm tht re too lrge for the ville memory out to the swp reF he progrm won9t e tossed out ompletelyY tht only hppens when memory gets extremely tightD or when your progrm hs een intive for whileF therD individul pges re pled in the swp re for lter retrievlF pirst of llD you need to e wre tht this is hppening if you don9t lredy know out itF eondD if it is hppeningD the memory ess ptterns re ritilF hen referenes re too widely stteredD your runtime will e ompletely dominted y disk sGyF sf you pln in dvneD you n mke virtul memory system work for you when your progrm is too lrge for the physil memory on the mhineF he tehniques re extly the sme s those for tuning
32 This
content is available online at <http://cnx.org/content/m33712/1.2/>.

VP

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

softwreEmnged outEofEore solutionD or loop nestsF he proess of loking memory referenes so tht dt onsumed in neighorhoods uses igger portion of eh virtul memory pge efore rotting it out to disk to mke room for notherF33

2.2.5.1 Gauging the Size of Your Program and the Machine's Memory
row n you tell if you re running outEofEorec here re wys to hek for pgE ing on the mhineD ut perhps the most strightforwrd hek is to ompre the size of your progrm ginst the mount of ville memoryF ou do this with the ommndX

size

7 size myprogrm
yn ystem xs mhineD the output looks something like thisX

SQVUP C SQRTH C IHHIHUUP a IHIIVIHR


yn ferkeley xs derivtive you see something like thisX

text SQVUP

dt SQRTH

ss IHHIHUUP

hex WTQdV

deiml IHIIVIHR

he (rst three (elds desrie the mount of memory required for three di'erent portions of your progrmF he (rstD textD ounts for the mhine instrutions tht mke up your progrmF he seondD dtD inludes initilized vlues in your proE grm suh s the ontents of dt sttementsD ommon loksD externlsD hrter stringsD etF he third omponentD ssD @lok strted y symolAD is usully the lrgestF st desries n uninitilized dt re in your progrmF his re would e mde of ommon loks tht re not set y lok dtF he lst (eld is totl for ll three setions dded togetherD in ytesF34 xextD you need to know how muh memory you hve in your systemF nfortuntelyD there isn9t stndrd xs ommnd for thisF yn the GTHHHD tells youF yn n qs mhineD does itF wny ystem xs implementtions hve n ommndF yn ny ferkeley derivtiveD you n typeX

/etc/lscfg /etc/memsize

/etc/hinv

7 ps ux
33 We examine the techniques for blocking in Chapter 8. 34 Warning: The size command won't give you the full picture
in COMMON.

if your program allocates memory dynamically, or keeps data

on the stack. This area is especially important for C programs and FORTRAN programs that create large arrays that are not

VQ his ommnd gives you listing of ll the proesses running on the mhineF pind the proess with the lrgest vlue in the 7wiwF hivide the vlue in the (eld y the perentge of memory used to get rough (gure for how muh memory your mhine hsX

memory a G@7wiwGIHHA
por instneD if the lrgest proess shows S7 memory usge nd resident set size @A of VRH ufD your mhine hs VRHHHHG@SGIHHA a IT wf of memoryF35 sf the nswer from the size ommnd shows totl tht is nywhere ner the mount of memory you hveD you stnd good hne of pging when you run " espeilly if you re doing other things on the mhine t the sme timeF

2.2.5.2 Checking for Page Faults


our system9s performne monitoring tools tell you if progrms re pgingF ome pging is yuY pge fults nd pgeEins our nturlly s progrms runF elsoD e reful if you re ompeting for system resoures long with other usersF he piE ture you get won9t e the sme s when you hve the omputer to yourselfF o hek for pging tivity on ferkeley xs derivtiveD use the ommndF gommonly people invoke it with time inrement so tht it reports pging t regulr intervlsX

vmstat

7 vmstt S
his ommnd produes output every (ve seondsF

vots of vlule informtion is produedF por our purposesD the importnt (elds re vm or active virtual memoryD the fre or free real memoryD nd the pi nd po numers showing pging tivityF hen the fre

pros r w H H H H H H H H H H H H

memory vm fre re t VPR PISTV H H VRH PISHV H H VRT PIRTH H H WIV PIRRR H H

pi H H H H

pge po fr H H H H H H H H

de sr H H H H H H H H

disk fults pu sH dI dP dQ in sy s us sy id H H H H PH QU IQ H I WV I H H H PSI IVT IST H IH WH P H H H PRV IRW ISP I W VW R H H H PSV IRQ ISP P IH VW

(gure drops to ner zeroD nd the po (eld shows lot of tivityD it9s n indition tht the memory system is overworkedF yn ys mhineD pging tivity n e seen with the sr ommndX

7 sr Er S S
his ommnd shows you the mount of free memory nd swp spe presently villeF sf the free memory (gure is lowD you n ssume tht your progrm is pgingX
35 You
could also reboot the machine! It will tell you how much memory is available when it comes up.

VR

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


t epr IV PHXRPXIW r freemem freeswp RHQP VPIRR

es we mentioned erlierD if you must run jo lrger thn the size of the memory on your mhineD the sme sort of dvie tht pplied to onserving he tivity pplies to pging tivityF36 ry to minimize the stride in your odeD nd where you n9tD loking memory referenes helps whole lotF e note on memory performne monitoring toolsX you should hek with your worksttion vendor to see wht they hve ville eyond or F here my e muh more sophistited @nd often grphilA tools tht n help you understnd how your progrm is using memoryF

vmstat sar

2.2.6 Closing Notes37


e hve seen some of the tools for timing nd pro(lingF iven though it seems like we overed lotD there re other kinds of pro(les we would like to e le to over " he miss mesurementsD runtime dependeny nlysisD )op mesurementsD nd so onF hese pro(les re good when you re looking for prtiulr nomliesD suh s he miss or )otingEpoint pipeline utiliztionF ro(lers for these quntities exist for some mhinesD ut they ren9t widely distriutedF yne thing to keep in mindX when you pro(le ode you sometimes get very limited view of the wy progrm is usedF his is espeilly true if it n perform mny types of nlyses for mny di'erent sets of input dtF orking with just one or two pro(les n give you distorted piture of how the ode opertes overllF smgine the following senrioX someone invites you to tke your very (rst ride in n utomoileF ou get in the pssenger9s set with sketh pd nd penD nd reord everything tht hppensF our oservtions inlude some of the followingX

he rdio is lwys onF he windshield wipers re never usedF he r moves only in forwrd diretionF
he dnger is thtD given this limited view of the wy r is opertedD you might wnt to disonnet the rdio9s onGo' knoD remove the windshield wipersD nd eliminte the reverse gerF his would ome s rel surprise to the next person who tries to k the r out on riny dy3 he point is tht unless you re reful to gther dt for of usesD you my not relly hve piture of how the progrm opertesF e single pro(le is (ne for tuning enhmrkD ut you my miss some importnt detils on multipurpose pplitionF orse yetD if you optimize it for one se nd ripple it for notherD you my do fr more hrm thn goodF ro(lingD s we sw in this hpterD is pretty mehnilF uning requires insightF st9s only fir to wrn you tht it isn9t lwys rewrdingF ometimes you pour your soul into lever modi(tion tht tully inreses the runtimeF ergh3 ht went wrongc ou9ll need to depend on your pro(ling tools to nswer thtF

all kinds

2.2.7 Exercises38
Exercise 2.4
36 By

ro(le the following progrm using gprofF ss there ny wy to tell how muh of the time spent in routine ws due to reursive llsc
the way, are you getting the message Out of memory? If you are running content is available online at <http://cnx.org/content/m33714/1.2/>. content is available online at <http://cnx.org/content/m33718/1.2/>.

csh, try typing unlimit to see if the message

goes away. Otherwise, it may mean that you don't have enough swap space available to run the job.

37 This 38 This

VS

min@A { int iD naIHY for @iaHY i<IHHHY iCCA { @nAY @nAY } } @nA int nY { if @n > HA { @nEIAY @nEIAY } } @nA int nY { @nAY }

Exercise 2.5 Exercise 2.6

ro(le n engineering ode @)otingEpoint intensiveA with full optimiztion on nd o'F row does the pro(le hngec gn you explin the hngec rite progrm to determine the overhed of the getrusge nd the etime llsF yther thn onsuming proessor timeD how n mking system ll to hek the time too often lter the pplition performnec

2.3 Eliminating Clutter


2.3.1 Introduction39
e hve looked t ode from the ompiler9s point of view nd t how to pro(le ode to (nd the troule spotsF his is good informtionD ut if you re disstis(ed with ode9s performneD you might still e wondering wht to do out itF yne possiility is tht your ode is too otuse for the ompiler to optimize properlyF ixess odeD too muh modulriztionD or even previous optimiztionErelted improvements n lutter up your ode nd onfuse the ompilersF glutter is nything tht ontriutes to the runtime without ontriuting to the nswerF st omes in two formsX

Things that contribute to overhead


39 This

uroutine llsD indiret memory referenesD tests within loopsD wordy testsD type onversionsD vriles preserved unneessrily
content is available online at <http://cnx.org/content/m33720/1.2/>.

VT

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE fences

uroutine llsD indiret memory referenesD tests within loopsD miguous pointers st9s not mistke tht some of the sme items pper in oth listsF uroutine lls or ifEsttements within loops n oth ite nd srth you y tking too muh time nd y reting " ples in the progrm where instrutions tht pper efore n9t e sfely intermixed with instrutions tht pper fterD t lest not without gret del of reF he gol of this hpter is to show you how to eliminte lutterD so you n restruture wht9s left over for the fstest exeutionF e sve few spei( topis tht might (t hereD espeilly those regrding memory referenesD for lter hpters where they re treted s sujets y themselvesF fefore we strtD we9ll remind youX s you look for wys to improve wht you hveD keep your eyes nd mind open to the possiility tht there might e fundmentlly etter wy to do something" more e0ient sorting tehniqueD rndom numer genertorD or solverF e di'erent lgorithm my uy you fr more speed thn tuningF elgorithms re eyond the sope of this ookD ut wht we re disussing here should help you reognize good odeD or help you to ode new lgorithm to get the est performneF

Things that restrict compiler exibility

2.3.2 Subroutine Calls40


e typil orportion is full of frightening exmples of overhedF y your deprtment hs prepred stk of pperwork to e ompleted y nother deprtmentF ht do you hve to do to trnsfer tht workc pirstD you hve to e sure tht your portion is ompletedY you n9t sk them to tke over if the mterils they need ren9t redyF xextD you need to pkge the mterils " dtD formsD hrge numersD nd the likeF end (nlly omes the o0il trnsferF pon reeiving wht you sentD the other deprtment hs to unpk itD do their joD repkge itD nd send it kF e lot of time gets wsted moving work etween deprtmentsF yf ourseD if the overhed is miniml ompred to the mount of useful work eing doneD it won9t e tht ig delF fut it might e more e0ient for smll jos to sty within one deprtmentF he sme is true of suroutine nd funtion llsF sf you only enter nd exit modules one in reltive whileD the overhed of sving registers nd prepring rgument lists won9t e signi(ntF roweverD if you re repetedly lling few smll suroutinesD the overhed n uoy them to the top of the pro(leF st might e etter if the work styed where it wsD in the lling routineF edditionllyD suroutine lls inhiit ompiler )exiilityF qiven the right opportunityD you9d like your ompiler to hve the freedom to intermix instrutions tht ren9t dependent upon eh otherF hese re found on either side of suroutine llD in the ller nd lleeF fut the opportunity is lost when the ompiler n9t peer into suroutines nd funtionsF snstrutions tht might overlp very niely hve to sty on their respetive sides of the rti(il feneF st helps if we illustrte the hllenge tht suroutine oundries present with n exggerted exmpleF he following loop runs very well on wide rnge of proessorsX

hy saIDx e@sA a e@sA C f@sA B g ixhhy


he ode elow performs the sme lultionsD ut look t wht we hve doneX

40 This

content is available online at <http://cnx.org/content/m33721/1.2/>.

VU

hy saIDx gevv wehh @e@sAD f@sAD gA ixhhy fysxi wehh @eDfDgA e a e C f B g ix ixh
ih itertion lls suroutine to do smll mount of work tht ws formerly within the loopF his is prtiulrly pinful exmple euse it involves )otingE point lultionsF he resulting loss of prllelismD oupled with the proedure ll overhedD might produe ode tht runs IHH times slowerF ememerD these opertions re pipelinedD nd it tkes ertin mount of windEup time efore the throughput rehes one opertion per lok yleF sf there re few )otingEpoint opertions to perform etween suroutine llsD the time spent winding up nd winding down pipelines (gures prominentlyF uroutine nd funtion lls omplite the ompiler9s ility to e0iently mnE ge gywwyx nd externl vrilesD delying until the lst possile moment tully storing them in memoryF he ompiler uses registers to hold the live vlues of mny vrilesF hen you mke llD the ompiler nnot tell whether the suroutine will e hnging vriles tht re delred s externl or gywwyxF hereforeD it9s fored to store ny modi(ed externl or gywwyx vriles k into memory so tht the llee n (nd themF vikewiseD fter the ll hs returnedD the sme vriles hve to e reloded into registers euse the ompiler n no longer trust the oldD registerEresident opiesF he penlty for sving nd restoring vriles n e sustntilD espeilly if you re using lots of themF st n lso e unwrrnted if vriles tht ought to e lol re spei(ed s externl or gywwyxD s in the following odeX

gywwyx GiviG u hy uaIDIHHH sp @u FiF IA gevv e ixhhy


sn this exmpleD u hs een delred s gywwyx vrileF st is used only s doEloop ounterD so there relly is no reson for it to e nything ut lolF roweverD euse it is in gywwyx lokD the ll to e fores the ompiler to store nd relod u eh itertionF his is euse the side e'ets of the ll re unknownF o frD it looks s if we re prepring se for huge min progrms without ny suroutines or funtions3 xot t llF wodulrity is importnt for keeping soure ode ompt nd understndleF end frnklyD the need for mintinility nd modulrity is lwys more importnt thn the need for performne improvementsF roweverD there re few pprohes for stremlining suroutine lls tht don9t require you to srp modulr oding tehniquesX mros nd proedure inliningF ememerD if the funtion or suroutine does resonle mount of workD proedure ll overhed isn9t going to mtter very muhF roweverD if one smll routine ppers s lef node in one of the usiest setions of the ll grphD you might wnt to think out inserting it in pproprite ples in the progrmF

small

2.3.2.1 Macros

Macros re little proedures tht re sustituted inline t ompile timeF

nlike suroutines or funtionsD whih re inluded one during the linkD mros re replited every ple they re usedF hen the ompiler mkes its (rst pss through your progrmD it looks for ptterns tht mth previous mro de(nitions nd expnds them inlineF sn ftD in lter stgesD the ompiler sees n expnded mro s soure odeF

VV

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE wros re prt of oth g nd pyex @lthough the pyex notion of mroD the statement functionD is reviled y the pyex ommunityD nd won9t survive muh longerAF por g progrmsD
41

mros re reted with 5define onstrutD s demonstrted hereX

5define verge@xDyA @@xCyAGPA min @A { flot q a IHHD p a SHY flot Y a verge@pDqAY printf @47f\n4DAY }
he (rst ompiltion step for g progrm is pss through the g preproessorD F his hppens utomtilly when you invoke the ompilerF expnds 5define sttements inlineD repling the pttern mthed y the mro de(nitionF sn the progrm oveD the sttementX

cpp

cpp

a verge@pDqAY
gets repled withX

a @@pCqAGPAY
ou hve to e reful how you de(ne the mro euse it literlly reples the pttern loted y por instneD if the mro de(nition sidX

cppF

5define multiply@DA @BA


nd you invoked it sX

a multiply@xCtDyCvAY
41 The
statement function has been eliminated in FORTRAN 90.

VW the resulting expnsion would e xCtByCv " proly not wht you intendedF sf you re g progrmmer you my e using mros without eing onsious of itF wny g heder (les @ A ontin mro de(nitionsF sn ftD some stndrd g lirry funtions re relly de(ned s mros in the heder (lesF por instneD the funtion n e linked in when you uild your progrmF sf you hve sttementX

.h

getchar

5inlude <stdioFh>
in your (leD is repled with mro de(nition t ompile timeD repling the g lirry funtionF ou n mke mros work for pyex progrms tooF42 por exmpleD pyex version of the g progrm ove might look like thisX

getchar cpp

5define eieq@DA @@CAGPA g yqew wesx iev eDD hee D GSHFDIHHFG e a eieq@DA si @BDBA e ixh
ithout little preprtionD the 5define sttement is rejeted y the pyex ompilerF he progrm (rst hs to e preproessed through to reple the use of eieq with its mro de(nitionF st mkes ompiltion twoEstep proedureD ut tht shouldn9t e too muh of urdenD espeilly if you re uilding your progrms under the ontrol of the utilityF e would lso suggest you store pyex progrms ontining diretives under to distinguish them from undorned pyexF tust e sure you mke your hnges only to the (les nd not to the output from F his is how you would preproess pyex (les y hndX

cpp .F

cpp make lename.F .F

cpp

7 GliGpp E < vergeFp > vergeFf 7 fUU vergeFf E


he pyex ompiler never sees the originl odeF snstedD the mro de(nition is sustituted inline s if you hd typed it yourselfX

yqew wesx iev eDD


programmers use the standard UNIX

42 Some

m4

preprocessor for FORTRAN

WH

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


hee D GSHFDIHHFG e a @@CAGPA si @BDBA e ixh

fy the wyD some pyex ompilers reognize the extension lredyD mking the twoEstep proess unneessryF sf the ompiler sees the extension it invokes utomtillyD ompiles the outputD nd throws wy the intermedite (leF ry ompiling on your omputer to see if it worksF elsoD e wre tht mro expnsions my mke soure lines extend pst olumn UPD whih will proly mke your pyex ompiler omplin @or worseX it might pss unnotiedAF ome ompilers support input lines longer thn UP hrtersF yn the un ompilers the e option llows extended input lines up to IQP hrters longF

.f

.F

.F cpp .F

2.3.2.2 Procedure Inlining


wro de(nitions tend to e pretty shortD usully just single sttementF omeE times you hve slightly longer @ut not too longA its of ode tht might lso ene(t from eing opied inlineD rther thn lled s suroutine or funtionF eginD the reson for doing this is to eliminte proedure ll overhed nd expose prlE lelismF sf your ompiler is ple of suroutine nd funtion de(nitions into the modules tht ll themD then you hve very nturlD very portle wy to write modulr ode without su'ering the ost of suroutine llsF hepending on the vendorD you n sk the ompiler for proedure inlining yX

inlining

peifying whih routines should e inlined on the ompiler9s ommnd line utting inlining diretives into the soure progrm vetting the ompiler inline utomtilly
he diretives nd ompile line options re not stndrdD so you hve to hek your ompiler doumenttionF nfortuntelyD you my lern tht there is no suh feture @yetD lwys yetAD or tht it9s n expensive extrF he third form of inlining in the listD utomtiD is ville from just few vendorsF eutomti inlining depends on sophistited ompiler tht n view the de(nitions of severl modules t oneF here re some words of ution with regrd to proedure inliningF ou n esily do too muh of itF sf everything nd nything is ingested into the ody of its prentsD the resulting exeutle my e so lrge tht it repetedly spills out of the instrution he nd eomes net performne lossF yur dvie is tht you use the llerGllee informtion pro(lers give you nd mke some intelligent deisions out inliningD rther thn trying to inline every suroutine villeF eginD smll routines tht re lled often re generlly the est ndidtes for inliningF

2.3.3 Branches43
eople sometimes tke week to mke deisionD so we n9t fult omputer if it tkes few tens of nnoseondsF roweverD if n ifEsttement ppers in some hevily trveled setion of the odeD you might get tired of the delyF here re two si pprohes to reduing the impt of rnhesX

tremline themF wove them out to the omputtionl suursF rtiulrlyD get them out of loopsF
sn etion PFQFR we show you some esy wys to reorgnize onditionls so they exeute more quiklyF
43 This
content is available online at <http://cnx.org/content/m33722/1.2/>.

WI

2.3.4 Branches With Loops44


xumeril odes usully spend most of their time in loopsD so you don9t wnt nything inside loop tht doesn9t hve to e thereD espeilly n ifEsttementF xot only do ifEsttements gum up the works with extr instrutionsD they n fore strit order on the itertions of loopF yf ourseD you n9t lwys void onditionlsF ometimesD thoughD people ple them in loops to proess events tht ould hve een hndled outsideD or even ignoredF o tke you k few yersD the following ode shows loop with test for vlue lose to zeroX

eewii @wevv a IFiEPHA hy saIDx sp @ef@e@sAA FqiF wevvA rix f@sA a f@sA C e@sA B g ixhsp ixhhy
he ide ws tht if the multiplierD e@sAD were resonly smllD there would e no reson to perform the mth in the enter of the loopF feuse )otingEpoint opertions weren9t pipelined on mny mhinesD omprison nd rnh ws heperY the test would sve timeF yn n older gsg or erly sg proessorD omprison nd rnh is proly still svingsF fut on other rhiteturesD it osts lot less to just perform the mth nd skip the testF iliminting the rnh elimintes ontrol dependeny nd llows the ompiler to pipeline more rithmeti opertionsF yf ourseD the nswer ould hnge slightly if the test is elimintedF st then eomes question of whether the di'erene is signi(ntF rere9s nother exmple where rnh isn9t neessryF he loop (nds the solute vlue of eh element in n rryX

hy saIDx sp @e@sA FvF HFA e@sA a Ee@sA ixhhy


fut why perform the test t llc yn most mhinesD it9s quiker to perform the s@A opertion on every element of the rryF e do hve to give you wrningD thoughX if you re oding in gD the solute vlueD fs@AD is suroutine llF sn this prtiulr seD you re etter o' leving the onditionl in the loopF45 hen you n9t lwys throw out the onditionlD there re things you n do to minimize negtive performneF pirstD we hve to lern to reognize whih onditionls within loops n e restrutured nd whih nnotF gonditionls in loops fll into severl tegoriesX

voop invrint onditionls voop index dependent onditionls sndependent loop onditionls hependent loop onditionls edutions
at <http://cnx.org/content/m33723/1.2/>. a oating-point number starts with a sign bit. If the bit is 0, the number is positive. If

44 This content is available online 45 The machine representation of

it is 1, the number is negative. The fastest absolute value function is one that merely ands out the sign bit. See macros in

/usr/include/macros.h and /usr/include/math.h.

WP

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


gonditionls tht trnsfer ontrol

vet9s look t these types in turnF

2.3.4.1 Loop Invariant Conditionals


he following loop ontins n

invariant testX

hy saIDu sp @x FiF HA rix e@sA a e@sA C f@sA B g ivi e@sA a HF ixhsp ixhhy
snvrint mens tht the outome is lwys the smeF egrdless of wht hppens to the vriles eD fD gD nd sD the vlue of x won9t hngeD so neither will the outome of the testF ou n rest the loop y mking the test outside nd repliting the loop ody twie " one for when the test is trueD nd one for when it is flseD s in the following exmpleX

sp @x FiF HA rix hy saIDu e@sA a e@sA C f@sA B g ixhhy ivi hy saIDu e@sA a H ixhhy ixhsp
he e'et on the runtime is drmtiF xot only hve we eliminted uEI opies of the testD we hve lso ssured tht the omputtions in the middle of the loop re not ontrolEdependent on the ifEsttementD nd re therefore muh esier for the ompiler to pipelineF e rememer helping someone optimize progrm with loops ontining similr onditionlsF hey were heking to see whether deug output should e printed eh itertion inside n otherwise highly optimizle loopF e n9t fult the person for not relizing how muh this slowed the progrm downF erformne wsn9t importnt t the timeF he progrmmer ws just trying to get the ode to produe good nswersF fut lter onD when performne mtteredD y lening up invrint onditionlsD we were le to speed up the progrm y ftor of IHHF

2.3.4.2 Loop Index Dependent Conditionals


por onditionlsD the test is true for ertin rnges of the loop index vrilesF st isn9t lwys true or lwys flseD like the onditionl we just looked tD ut it does hnge with preditle ptternD nd one tht we n use to our dvntgeF he following loop hs two index vrilesD s nd tF

loop index dependent

WQ

hy saIDx hy taIDx sp @t FvF sA e@tDsA a e@tDsA C f@tDsA B g ivi e@tDsA a HFH ixhsp ixhhy ixhhy
xotie how the ifEsttement prtitions the itertions into distint setsX those for whih it is true nd those for whih it is flseF ou n tke dvntge of the preditility of the test to restruture the loop into severl loops " eh ustomEmde for di'erent prtitionX

hy saIDx hy taIDsEI e@tDsA a e@tDsA C f@tDsA B g ixhhy hy tasDx e@tDsA a HFH ixhhy ixhhy
he new version will lmost lwys e fsterF e possile exeption is when x is smll vlueD like QD in whih se we hve reted more lutterF fut thenD the loop proly hs suh smll impt on the totl runtime tht it won9t mtter whih wy it9s odedF

2.3.4.3 Independent Loop Conditionals


st would e nie if you ould optimize every loop y prtitioning itF fut more often thn notD the onditionl doesn9t diretly depend on the vlue of the index vrilesF elthough n index vrile my e involved in ddressing n rryD it doesn9t rete reognizle pttern in dvne " t lest not one you n see when you re writing the progrmF rere9s suh loopX

hy saIDx hy taIDx sp @f@tDsA FqF IFHA e@tDsA a e@tDsA C f@tDsA B g ixhhy ixhhy
here is not muh you n do out this type of onditionlF fut euse every itertion is independentD the loop n e unrolled or n e performed in prllelF

WR

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

2.3.4.4 Dependent Loop Conditionals


hen the onditionl is sed on vlue tht hnges with eh itertion of the loopD the ompiler hs no hoie ut to exeute the ode extly s writtenF por instneD the following loop hs n ifEsttement with uiltEin slr reursionX

hy saIDx sp @ FvF e@sAA a C f@sABPF ixhhy


ou n9t know whih wy the rnh will go for the next itertion until you re done with the urrent itertionF o reognize the dependenyD try to unroll the loop slightly y hndF sf you n9t strt the seond test until the (rst hs (nishedD you hve onditionlF ou my wnt to look t these types of loops to see if you n eliminte the itertionEtoEitertion vlueF

dependent loop

2.3.4.5 Reductions
ueep n eye out for loops in whih the ifEsttement is performing mx or min funtion on rryF his is D so lled euse it redues rry to slr result @the previous exmple ws redution tooD y the wyAF eginD we re getting little it hed of ourselvesD ut sine we re tlking out ifEsttements in loopsD s wnt to introdue trik for restruturing redutions mx nd min to expose more prllelismF he following loop serhes for the mximum vlueD zD in the rry y going through the elements one t timeX

reduction

for @iaHY i<nY iCCA z a i > z c i X zY


es writtenD it9s reursive like the loop from the previous setionF ou need the result of given itertion efore you n proeed to the nextF roweverD sine we re looking for the gretest element in the whole rryD nd sine tht will e the sme element @essentillyA no mtter how we go out looking for itD we n restruture the loop to hek severl elements t time @we ssume n is evenly divisile y P nd do not inlude the preonditioning loopAX

zH a HFY zI a HFY for @iaHY i< nEIY iCaPA { zH a zH < i c i X zHY zI a zI < iCI c iCI X zIY } z a zH < zI c zI X zHY
ho you see how the new loop lultes two new mximum vlues eh itertionc hese mximums re then ompred with one notherD nd the winner eomes the new o0il F st9s nlogous to plyEo'

max

WS rrngement in ingEong tournmentF heres the old loop ws more like two plyers ompeting t time while the rest st roundD the new loop runs severl mthes side y sideF sn generl this prtiulr optimiztion is not good one to ode y hndF yn prllel proessorsD the ompiler performs the redution in its own wyF sf you hndEode similr to this exmpleD you my indvertently limit the ompiler9s )exiility on prllel systemF

2.3.4.6 Conditionals That Transfer Control


vet9s step k seondF rve you notied similrity mong ll the loops so frc e hve looked only t prtiulr type of onditionlD " sed on the outome of the testD vrile gets ressignedF yf ourseD not every onditionl ends up in n ssignmentF ou n hve sttements tht trnsfer )ow of ontrolD suh s suroutine lls or goto sttementsF sn the following exmpleD the progrmmer is refully heking efore dividing y zeroF roweverD this test hs n extremely negtive impt on the performne euse it fores the itertions to e done preisely in orderX

conditional assignments

hy saIDx hy taIDx sp @f@tDsA FiF H A rix sx BDsDt y ixhsp e@tDsA a e@tDsA G f@tDsA ixhhy ixhhy
evoiding these tests is one of the resons tht the designers of the siii )otingE point stndrd dded the trp feture for opertions suh s dividing y zeroF hese trps llow the progrmmer in performneE ritil setion of the ode to hieve mximum performne yet still detet when n error oursF

2.3.5 Other Clutter46


glutter omes in mny formsF gonsider the previous setions s hving delt with lrge piees of junk you might (nd in the front hll losetX n ironing ordD hokey stiksD nd pool uesF xow we re down to the little thingsX widowed hekerD tennis llD nd ht noody ownsF e wnt to mention few of them hereF e pologize in dvne for hnging sujets lotD ut tht9s the nture of lening out loset3

2.3.5.1 Data Type Conversions


ttements tht ontin runtime type onversions su'er little performne penlty eh time the sttement is exeutedF sf the sttement is loted in portion of the progrm where there is lot of tivityD the totl penlty n e signi(ntF eople hve their resons for writing pplitions with mixed typingF yften it is mtter of sving memory speD memory ndwidthD or timeF sn the pstD for instneD douleEpreision lultions took twie s long s their singleEpreision ounterprtsD so if some of the lultions ould e rrnged to tke ple in single preisionD there ould e performne winF47 fut ny time sved y performing prt of the lultions in single preision nd prt in doule preision hs to e mesured ginst the dditionl
46 This content is available online at <http://cnx.org/content/m33724/1.2/>. 47 Nowadays, single-precision calculations can take longer than double-precision
calculations from register to register.

WT

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

overhed used y the runtime type onversionsF sn the following odeD the ddition of e@sA to f@sA is X

mixed type

sxiqi xwivD s eewii @xwiv a IHHHA ievBV e@xwivA ievBR f@xwivA hy saIDxwiv e@sA a e@sA C f@sA ixhhy
sn eh itertionD f@sA hs to e promoted to doule preision efore the ddition n ourF ou don9t see the promotion in the soure odeD ut it9s thereD nd it tkes timeF g progrmmers ewreX in uernighn nd ithie @u8A gD ll )otingEpoint lultions in g progrms tke ple in doule preision " even if ll the vriles involved re delred s F st is possile for you to write whole uC pplition in one preisionD yet su'er the penlty of mny type onversionsF enother dt type!relted mistke is to use hrter opertions in sp testsF yn mny systemsD hrter opertions hve poorer performne thn integer opertions sine they my e done vi proedure llsF elsoD the optimizers my not look t ode using hrter vriles s good ndidte for optimiztionF por exmpleD the following odeX

oat

hy saIDIHHHH sp @ gre@sA FiF 99 A rix e@sA a e@sA C f@sABg ixhsp ixhhy


might e etter written using n integer vrile to indite whether or not omputtion should e performedX

hy saIDIHHHH sp @ spveq@sA FiF I A rix e@sA a e@sA C f@sABg ixhsp ixhhy


enother wy to write the odeD ssuming the spveq vrile ws H or ID would e s followsX

hy saIDIHHHH

WU

e@sA a e@sA C f@sABgBspveq@sA ixhhy


he lst pproh might tully perform slower on some omputer systems thn the pproh using the sp nd the integer vrileF

2.3.5.2 Doing Your Own Common Subexpression Elimination

o fr we hve given your ompiler the ene(t of the doutF " the ility of the ompiler to reognize repeted ptterns in the ode nd reple ll ut one with temporry vrile " proly works on your mhine for simple expressionsF sn the following lines of odeD most ompilers would reognize C s ommon suexpressionX

Common subexpression elimination

a C C d e a q C C
eomesX

temp a C a temp C d e a q C temp


ustituting for C elimintes some of the rithmetiF sf the expression is reused mny timesD the svings n e signi(ntF roweverD ompiler9s ility to reognize ommon suexpressions is limitedD espeilly when there re multiple omponentsD or their order is permutedF e ompiler might not reognize tht CC nd CC re equivlentF48 por importnt prts of the progrmD you might onsider doing ommon suexpression elimintion of omplited expressions y hndF his gurntees tht it gets doneF st ompromises euty somewhtD ut there re some situtions where it is worth itF rere9s nother exmple in whih the funtion sin is lled twie with the sme rgumentX

x a rBsin@ABos@AY y a rBsin@ABsin@AY z a rBos@AY


eomesX

temp a rBsin@AY
48 And
because of overow and round-o errors in oating-point, in some situations they might not be equivalent.

WV

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


x a tempBos@AY y a tempBsin@AY z a rBos@AY

e hve repled one of the lls with temporry vrileF e greeD the svings for eliminting one trnsendentl funtion ll out of (ve won9t win you xoel prizeD ut it does ll ttention to n importnt pointX ompilers typilly do not perform ommon suexpression elimintion over suroutine or funtion llsF he ompiler n9t e sure tht the suroutine ll doesn9t hnge the stte of the rgument or some other vriles tht it n9t seeF he only time ompiler might eliminte ommon suexpressions ontining funtion lls is when they re intrinsisD s in pyexF his n e done euse the ompiler n ssume some things out their side e'etsF ouD on the other hndD n see into suroutinesD whih mens you re etter quli(ed thn the ompiler to group together ommon suexpressions involving suroutines or funtionsF

2.3.5.3 Doing Your Own Code Motion


ell of these optimiztions hve their iggest pyk within loops euse tht9s where ll of progrm9s tivity is onentrtedF yne of the est wys to ut down on runtime is to move unneessry or repeted @invrintA instrutions out of the min )ow of the ode nd into the suursF por loopsD it9s lled instrutions when they re pulled out from the top nd when they re pushed down elowF rere9s n exmpleX

sinking

hoisting

hy saIDx e@sA a e@sA G @B C BA ixhhy


eomesX

iw a I G @B C BA hy saIDx e@sA a e@sA B iw ixhhy


e hoisted n expensiveD invrint opertion out of the loop nd ssigned the result to temporry vrileF xotieD tooD tht we mde n lgeri simpli(tion when we exhnged division for multiplition y n inverseF he multiplition will exeute muh more quiklyF our ompiler might e smrt enough to mke these trnsformtions itselfD ssuming you hve instruted the ompiler tht these re legl trnsformtionsY ut without rwling through the ssemly lngugeD you n9t e positiveF yf ourseD if you rerrnge ode y hnd nd the runtime for the loop suddenly goes downD you will know tht the ompiler hs een sndgging ll longF ometimes you wnt to sink n opertion elow the loopF sullyD it9s some lultion performed eh itertion ut whose result is only needed for the lstF o illustrteD here9s sort of loop tht is di'erent from the ones we hve een looking tF st serhes for the (nl hrter in hrter stringX

WW

while @Bp 3a 9 9A a BpCCY


eomesX

while @BpCC 3a 9 9AY a B@pEIAY


he new version of the loop moves the ssignment of eyond the lst itertionF edmittedlyD this trnsforE mtion would e reh for ompiler nd the svings wouldn9t even e tht gretF fut it illustrtes the notion of sinking n opertion very wellF eginD hoisting or sinking instrutions to get them out of loops is something your ompiler should e ple of doingF fut often you n slightly restruture the lultions yourself when you move them to get n even greter ene(tF

2.3.5.4 Handling Array Elements in Loops


rere9s nother re where you would like to trust the ompiler to do the right thingF hen mking repeted use of n rry element within loopD you wnt to e hrged just one for loding it from memoryF ke the following loop s n exmpleF st reuses @sA twieX

hy saIDx yvh@sA a @sA @sAa @sA C sxg@sA ixhhy


sn relityD the steps tht go into retrieving @sA re just dditionl ommon suexE pressionsX n ddress lultion @possilyA nd memory lod opertionF ou n see tht the opertion is repeted y rewriting the loop slightlyX

hy saIDx iwa @sA yvh@sA a iw @sAa iw C sxg@sA ixhhy


pyex ompilers reognize tht the sme @sA is eing used twie nd tht it only needs to e loded oneD ut ompilers ren9t lwys so smrtF ou sometimes hve to rete temporry slr

should

IHH

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

vrile to hold the vlue of n rry element over the ody of loopF his is prtiulrly true when there re suroutine lls or funtions in the loopD or when some of the vriles re externl or gywwyxF wke sure to mth the types etween the temporry vriles nd the other vrilesF ou don9t wnt to inur type onversion overhed just euse you re helping the ompilerF por g ompilersD the sme kind of indexed expresE sions re n even greter hllengeF gonsider this odeX

doin@int xoldDint xDint xinDint nA { for @iaHY i<nY iCCA { xoldi a xiY xia xi C xiniY } }
nless the ompiler n see the de(nitions of xD xinD nd xoldD it hs to ssume tht they re pointers leding k to the sme storgeD nd repet the lods nd storesF sn this seD introduing temporry vriles to hold the vlues xD xinD nd xold is n optimiztion the ompiler wsn9t free to mkeF snterestinglyD while putting slr temporries in the loop is useful for sg nd superslr mhinesD it doesn9t help ode tht runs on prllel hrdwreF e prllel ompiler looks for opportunities to eliminte the slrs orD t the very lestD to reple them with temporry vetorsF sf you run your ode on prllel mhine from time to timeD you might wnt to e reful out introduing slr temporry vriles into loopF e duious performne gin in one instne ould e rel performne loss in notherF

2.3.6 Closing Notes49


sn this hpterD we introdued tuning tehniques for eliminting progrm lutter " nything tht ontriutes to the runtime without ontriuting to the nswerF e sw mny exmples of tuning tehniques " enough tht you my e sking yourselfD ht9s leftc ellD s we will see in the upoming hptersD there re ouple of wys we n help the ompilerX

pind more prllelism se memory s e'etively s possile


ometimes this mens we mke hnges tht re not eutifulF roweverD they re often quikF

2.3.7 Exercises50
Exercise 2.7
row would you simplify the following loop onditionlc

hy saIDx e@sA a e@sA B f sp @s FiF xGPA e@sA a HF ixhhy


49 This 50 This
content is available online at <http://cnx.org/content/m33725/1.2/>. content is available online at <http://cnx.org/content/m33727/1.2/>.

IHI

Exercise 2.8

ime this loop on your omputerD oth with nd without the testF un it with three sets of dtX one with ll e@sAs less thn wevvD one with ll e@sAs greter thn wevvD nd one with n even splitF hen is it etter to leve the test in the loopD if everc

eewii @wevv a IFiEPHA hy saIDx sp @ef@e@sAA FqiF wevvA rix f@sA a f@sA C e@sA B g ixhsp ixhhy

Exercise 2.9

rite simple progrm tht lls simple suroutine in its inner loopF ime the progrm exeutionF hen tell the ompiler to inline the routine nd test the performne ginF pinllyD modify the ode to perform the opertions in the ody of the loop nd time the odeF hih option rn fsterc ou my hve to look t the generted mhine ode to (gure out whyF

2.4 Loop Optimizations


2.4.1 Introduction51
sn nerly ll high performne pplitionsD loops re where the mjority of the exeution time is spentF sn etion PFQFI we exmined wys in whih pplition developers introdued lutter into loopsD possily slowing those loops downF sn this hpter we fous on tehniques used to improve the performne of these lutterEfree loopsF ometimes the ompiler is lever enough to generte the fster versions of the loopsD nd other times we hve to do some rewriting of the loops ourselves to help the ompilerF st9s importnt to rememer tht one ompiler9s performne enhning modi(tions re nother omE piler9s lutterF hen you mke modi(tions in the nme of performne you must mke sure you9re helping y testing the performne with nd without the modi(tionsF elsoD when you move to nother rhiteture you need to mke sure tht ny modi(tions ren9t hindering performneF por this resonD you should hoose your performneErelted modi(tions wiselyF ou should lso keep the originl @simpleA version of the ode for testing on new rhiteturesF elso if the ene(t of the modi(tion is smllD you should proly keep the ode in its most simple nd ler formF e look t numer of di'erent loop optimiztion tehniquesD inludingX

voop unrolling xested loop optimiztion voop interhnge wemory referene optimiztion floking yutEofEore solutions

omedyD it my e possile for ompiler to perform ll these loop optimiztions utomtillyF ypilly loop unrolling is performed s prt of the norml ompiler optimiztionsF yther optimiztions my hve to
51 This
content is available online at <http://cnx.org/content/m33728/1.2/>.

IHP

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

e triggered using expliit ompileEtime optionsF es you ontemplte mking mnul hngesD look refully t whih of these optimiztions n e done y the ompilerF elso run some tests to determine if the ompiler optimiztions re s good s hnd optimiztionsF

2.4.2 Operation Counting52


fefore you egin to rewrite loop ody or reorgnize the order of the loopsD you must hve some ide of wht the ody of the loop does for eh itertionF is the proess of surveying loop to understnd the opertion mixF ou need to ount the numer of lodsD storesD )otingEpointD integerD nd lirry lls per itertion of the loopF prom the ountD you n see how well the opertion mix of given loop mthes the pilities of the proessorF yf ourseD opertion ounting doesn9t gurntee tht the ompiler will generte n e0ient representtion of loopF53 fut it generlly provides enough insight to the loop to diret tuning e'ortsF fer in mind tht n instrution mix tht is lned for one mhine my e imlned for notherF roessors on the mrket tody n generlly issue some omintion of one to four opertions per lok yleF eddress rithmeti is often emedded in the instrutions tht referene memoryF feuse the ompiler n reple omplited loop ddress lultions with simple expressions @provided the pttern of ddresses is preditleAD you n often ignore ddress rithmeti when ounting opertionsF54 vet9s look t few loops nd see wht we n lern out the instrution mixX

Operation counting

hy saIDx e@sDtDuA a e@sDtDuA C f@tDsDuA ixhhy


his loop ontins one )otingEpoint ddition nd three memory referenes @two lods nd storeAF here re some omplited rry index expressionsD ut these will proly e simpli(ed y the ompiler nd exeuted in the sme yle s the memory nd )otingEpoint opertionsF por eh itertion of the loopD we must inrement the index vrile nd test to determine if the loop hs ompletedF e QXI rtio of memory referenes to )otingEpoint opertions suggests tht we n hope for no more thn IGQ pek )otingEpoint performne from the loop unless we hve more thn one pth to memoryF ht9s d newsD ut good informtionF he rtio tells us tht we ought to onsider memory referene optimiztions (rstF he loop elow ontins one )otingEpoint ddition nd two memory opertions " lod nd storeF ypernd f@tA is loopEinvrintD so its vlue only needs to e loded oneD upon entry to the loopX

hy saIDx e@sA a e@sA C f@tA ixhhy


eginD our )otingEpoint throughput is limitedD though not s severely s in the previous loopF he rtio of memory referenes to )otingEpoint opertions is PXIF
52 This content is available online at <http://cnx.org/content/m33729/1.2/>. 53 Take a look at the assembly language output to be sure, which may be going
listing on most machines, compile with the

54 The

S ag.

On an RS/6000, use the

qlist ag.

a bit overboard. To get an assembly language

compiler reduces the complexity of loop index expressions with a technique called

induction variable simplication.

See Section 2.1.1.

IHQ he next exmple shows loop with etter prospetsF st performs elementEwise multiplition of two vetors of omplex numers nd ssigns the results k to the (rstF here re six memory opertions @four lods nd two storesA nd six )otingEpoint opertions @two dditions nd four multiplitionsAX

for @iaHY i<nY iCCA { xri a xri B yri E xii B yiiY xii a xri B yii C xii B yriY }
st ppers tht this loop is roughly lned for proessor tht n perform the sme numer of memory opertions nd )otingEpoint opertions per yleF roweverD it might not eF wny proessors perform )otingEpoint multiply nd dd in single instrutionF sf the ompiler is good enough to reognize tht the multiplyEdd is ppropriteD this loop my lso e limited y memory referenesY eh itertion would e ompiled into two multiplitions nd two multiplyEddsF eginD opertion ounting is simple wy to estimte how well the requirements of loop will mp onto the pilities of the mhineF por mny loopsD you often (nd the performne of the loops dominted y memory referenesD s we hve seen in the lst three exmplesF his suggests tht memory referene tuning is very importntF

2.4.3 Basic Loop Unrolling55


he most si form of loop optimiztion is loop unrollingF st is so si tht most of tody9s ompilers do it utomtilly if it looks like there9s ene(tF here hs een gret del of lutter introdued into old dustyEdek pyex progrms in the nme of loop unrolling tht now serves only to onfuse nd misled tody9s ompilersF e9re not suggesting tht you unroll ny loops y hndF he purpose of this setion is twofoldF pirstD one you re fmilir with loop unrollingD you might reognize ode tht ws unrolled y progrmmer @not youA some time go nd simplify the odeF eondD you need to understnd the onepts of loop unrolling so tht when you look t generted mhine odeD you reognize unrolled loopsF he primry ene(t in loop unrolling is to perform more omputtions per itertionF et the end of eh itertionD the index vlue must e inrementedD testedD nd the ontrol is rnhed k to the top of the loop if the loop hs more itertions to proessF fy unrolling the loopD there re less loopEends per loop exeutionF nrolling lso redues the overll numer of rnhes signi(ntly nd gives the proessor more instrutions etween rnhes @iFeFD it inreses the size of the si loksAF por illustrtionD onsider the following loopF st hs single sttement wrpped in doEloopX

hy saIDx e@sA a e@sA C f@sA B g ixhhy


ou n unroll the loopD s we hve elowD giving you the sme opertions in fewer itertions with less loop overhedF ou n imgine how this would help on ny omputerF feuse the omputtions in one itertion do not depend on the omputtions in other itertionsD lultions from di'erent itertions n e exeuted togetherF yn superslr proessorD portions of these four sttements my tully exeute in prllelX
55 This
content is available online at <http://cnx.org/content/m33732/1.2/>.

IHR

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


hy saIDxDR e@sA a e@sA C f@sA B g e@sCIA a e@sCIA C f@sCIA B g e@sCPA a e@sCPA C f@sCPA B g e@sCQA a e@sCQA C f@sCQA B g ixhhy

roweverD this loop is not the sme s the previous loopF he loop is unrolled four timesD ut wht if x is not divisile y Rc sf notD there will e oneD twoD or three spre itertions tht don9t get exeutedF o hndle these extr itertionsD we dd nother little loop to sok them upF he extr loop is lled X

exactly

preconditioning loop

ss a swyh @xDRA hy saIDss e@sA a e@sA C f@sA B g ixhhy hy saICssDxDR e@sA a e@sA C e@sCIA a e@sCIA e@sCPA a e@sCPA e@sCQA a e@sCQA ixhhy f@sA B g C f@sCIA B g C f@sCPA B g C f@sCQA B g

he numer of itertions needed in the preonditioning loop is the totl itertion ount modulo for this unrolling mountF sfD t runtimeD x turns out to e divisile y RD there re no spre itertionsD nd the preonditioning loop isn9t exeutedF peultive exeution in the postEsg rhiteture n redue or eliminte the need for unrolling loop tht will operte on vlues tht must e retrieved from min memoryF feuse the lod opertions tke suh long time reltive to the omputtionsD the loop is nturlly unrolledF hile the proessor is witing for the (rst lod to (nishD it my speultively exeute three to four itertions of the loop hed of the (rst lodD e'etively unrolling the loop in the snstrution eorder fu'erF

2.4.4 Qualifying Candidates for Loop Unrolling Up one level 56


essuming lrge vlue for xD the previous loop ws n idel ndidte for loop unrollingF he itertions ould e exeuted in ny orderD nd the loop innrds were smllF fut s you might suspetD this isn9t lwys the seY some kinds of loops n9t e unrolled so esilyF edditionllyD the wy loop is used when the progrm runs n disqulify it for loop unrollingD even if it looks promisingF sn this setion we re going to disuss few tegories of loops tht re generlly not prime ndidtes for unrollingD nd give you some ides of wht you n do out themF e tlked out severl of these in the previous hpter s wellD ut they re lso relevnt hereF
56 This
content is available online at <http://cnx.org/content/m33733/1.2/>.

IHS

2.4.4.1 Loops with Low Trip Counts


o e e'etiveD loop unrolling requires firly lrge numer of itertions in the originl loopF o understnd whyD piture wht hppens if the totl itertion ount is lowD perhps less thn IHD or even less thn RF ith trip ount this lowD the preonditioning loop is doing proportiontely lrge mount of the workF st9s not supposed to e tht wyF he preonditioning loop is supposed to th the few leftover itertions missed y the unrolledD min loopF roweverD when the trip ount is lowD you mke one or two psses through the unrolled loopD plus one or two psses through the preonditioning loopF sn other wordsD you hve more lutterY the loop shouldn9t hve een unrolled in the (rst pleF roly the only time it mkes sense to unroll loop with low trip ount is when the numer of itertions is onstnt nd known t ompile timeF por instneD suppose you hd the following loopX

eewii @xsi a QA hy saIDxsi e@sA a f@sA B g ixhhy


feuse xsi is hrdwired to QD you n sfely unroll to depth of Q without worrying out preonE ditioning loopF sn ftD you n throw out the loop struture ltogether nd leve just the unrolled loop innrdsX

eewii @xsi a QA e@IA a f@IA B g e@PA a f@PA B g e@QA a e@QA B g


yf ourseD if loop9s trip ount is lowD it proly won9t ontriute signi(ntly to the overll runtimeD unless you (nd suh loop t the enter of lrger loopF hen you either wnt to unroll it ompletely or leve it loneF

2.4.4.2 Fat Loops


voop unrolling helps performne euse it fttens up loop with more lultions per itertionF fy the sme tokenD if prtiulr loop is lredy ftD unrolling isn9t going to helpF he loop overhed is lredy spred over fir numer of instrutionsF sn ftD unrolling ft loop my even slow your progrm down euse it inreses the size of the text segmentD pling n dded urden on the memory system @we9ll explin this in greter detil shortlyAF e good rule of thum is to look elsewhere for performne when the loop innrds exeed three or four sttementsF

2.4.4.3 Loops Containing Procedure Calls


es with ft loopsD loops ontining suroutine or funtion lls generlly ren9t good ndidtes for unrollingF here re severl resonsF pirstD they often ontin fir numer of instrutions lredyF end if the suroutine eing lled is ftD it mkes the loop tht lls it ft s wellF he size of the loop my not e pprent when you look t the loopY the funtion ll n onel mny more instrutionsF

IHT

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

eondD when the lling routine nd the suroutine re ompiled seprtelyD it9s impossile for the ompiler to intermix instrutionsF e loop tht is unrolled into series of funtion lls ehves muh like the originl loopD efore unrollingF vstD funtion ll overhed is expensiveF egisters hve to e svedY rgument lists hve to e prepredF he time spent lling nd returning from suroutine n e muh greter thn tht of the loop overhedF nrolling to mortize the ost of the loop struture over severl lls doesn9t uy you enough to e worth the e'ortF he generl rule when deling with proedures is to (rst try to eliminte them in the remove lutter phseD nd when this hs een doneD hek to see if unrolling gives n dditionl performne improvementF

2.4.4.4 Loops with Branches in Them


sn etion PFQFI we showed you how to eliminte ertin types of rnhesD ut of ourseD we ouldn9t get rid of them llF sn ses of itertionEindependent rnhesD there might e some ene(t to loop unrollingF he sp test eomes prt of the opertions tht must e ounted to determine the vlue of loop unrollingF felow is douly nested loopF he inner loop tests the vlue of f@tDsAX

hy saIDx hy taIDx sp @f@tDsA FqF IFHA e@tDsA a e@tDsA C f@tDsA B g ixhhy ixhhy
ih itertion is independent of every otherD so unrolling it won9t e prolemF e9ll just leve the outer loop undisturedX

ss a swyh @xDRA hy saIDx hy taIDss sp @f@tDsA FqF IFHA C e@tDsA a e@tDsA C f@tDsA B g ixhhy hy tassCIDxDR sp @f@tDsA FqF IFHA C e@tDsA a e@tDsA C f@tDsA B g sp @f@tCIDsA FqF IFHA C e@tCIDsA a e@tCIDsA C f@tCIDsA B g sp @f@tCPDsA FqF IFHA C e@tCPDsA a e@tCPDsA C f@tCPDsA B g sp @f@tCQDsA FqF IFHA C e@tCQDsA a e@tCQDsA C f@tCQDsA B g ixhhy ixhhy
his pproh works prtiulrly well if the proessor you re using supports onditionl exeutionF es desried erlierD onditionl exeution n reple rnh nd n opertion with single onditionlly

IHU exeuted ssignmentF yn superslr proessor with onditionl exeutionD this unrolled loop exeutes quite nielyF

2.4.5 Nested Loops57

hen you emed loops within other loopsD you rete F he loop or loops in the enter re lled the loopsF he surrounding loops re lled loopsF hepending on the onstrution of the loop nestD we my hve some )exiility in the ordering of the loopsF et timesD we n swp the outer nd inner loops with gret ene(tF sn the next setions we look t some ommon loop nestings nd the optimiztions tht n e performed on these loop nestsF yften when we re working with nests of loopsD we re working with multidimensionl rrysF gomputing in multidimensionl rrys n led to nonEunitEstride memory essF wny of the optimiztions we perform on loop nests re ment to improve the memory ess ptternsF pirstD we exmine the omputtionErelted optimiztions followed y the memory optimiztionsF

inner

loop nest outer

2.4.5.1 Outer Loop Unrolling


sf you re fed with loop nestD one simple pproh is to unroll the inner loopF nrolling the innermost loop in nest isn9t ny di'erent from wht we sw oveF ou just pretend the rest of the loop nest doesn9t exist nd pproh it in the norE ml wyF roweverD there re times when you wnt to pply loop unrolling not just to the inner loopD ut to outer loops s well " or perhps only to the outer loopsF rere9s typil loop nestX

for @iaHY i<nY iCCA for @jaHY j<nY jCCA for @kaHY k<nY kCCA ijk a ijk C ijk B Y
o unroll n outer loopD you pik one of the outer loop index vriles nd replite the innermost loop ody so tht severl itertions re performed t the sme timeD just like we sw in the etion PFRFRF he di'erene is in the index vrile for whih you unrollF sn the ode elowD we hve unrolled the middle @jA loop twieX

for @iaHY i<nY iCCA for @jaHY j<nY jCaPA for @kaHY k<nY kCCA { ijk a ijk C ikj B Y ijCIk a ijCIk C ikjCI B Y }
e left the k loop untouhedY howeverD we ould unroll tht oneD tooF ht would give us outer loop unrolling t the sme timeX
57 This
content is available online at <http://cnx.org/content/m33734/1.2/>.

and inner

IHV

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


for @iaHY i<nY iCCA for @jaHY j<nY jCaPA for @kaHY k<nY kCaPA { ijk a ijk ijCIk a ijCIk ijkCI a ijkCI ijCIkCI a ijCIkCI }

C C C C

ikj B Y ikjCI B Y ikCIj B Y ikCIjCI B Y

e ould even unroll the i loop tooD leving eight opies of the loop innrdsF @xotie tht we ompletely ignored preonditioningY in rel pplitionD of ourseD we ouldn9tFA

2.4.5.2 Outer Loop Unrolling to Expose Computations


y tht you hve douly nested loop nd tht the inner loop trip ount is low " perhps R or S on vergeF snner loop unrolling doesn9t mke sense in this se euse there won9t e enough itertions to justify the ost of the preonditioning loopF roweverD you my e le to unroll n outer loopF gonsider this loopD ssuming tht w is smll nd x is lrgeX

hy saIDx hy taIDw e@tDsA a f@tDsA C g@tDsA B h ixhhy ixhhy


nrolling the s loop gives you lots of )otingEpoint opertions tht n e overlppedX

ss a swyh @xDRA hy saIDss hy taIDw e@tDsA a f@tDsA C g@tDsA B h ixhhy ixhhy hy sassDxDR hy taIDw e@tDsA e@tDsCIA e@tDsCPA e@tDsCQA ixhhy ixhhy

a a a a

f@tDsA f@tDsCIA f@tDsCPA f@tDsCQA

C C C C

g@tDsA B g@tDsCIA g@tDsCPA g@tDsCQA

h B h B h B h

IHW sn this prtiulr seD there is d news to go with the good newsX unrolling the outer loop uses strided memory referenes on eD fD nd gF roweverD it proly won9t e too muh of prolem euse the inner loop trip ount is smllD so it nturlly groups referenes to onserve he entriesF yuter loop unrolling n lso e helpful when you hve nest with reursion in the inner loopD ut not in the outer loopsF sn this next exmpleD there is (rstE order liner reursion in the inner loopX

hy taIDw hy saPDx e@sDtA a e@sDtA C e@sEIDtA B f ixhhy ixhhy


feuse of the reursionD we n9t unroll the inner loopD ut we n work on severl opies of the outer loop t the sme timeF hen unrolledD it looks like thisX

tt a swyh @wDRA hy taIDtt hy saPDx e@sDtA a e@sDtA C e@sEIDtA B f ixhhy ixhhy hy taICttDwDR hy saPDx e@sDtA a e@sDtCIA a e@sDtCPA a e@sDtCQA a ixhhy ixhhy

e@sDtA e@sDtCIA e@sDtCPA e@sDtCQA

C C C C

e@sEIDtA e@sEIDtCIA e@sEIDtCPA e@sEIDtCQA

B B B B

f f f f

ou n see the reursion still exists in the s loopD ut we hve sueeded in (nding lots of work to do nywyF ometimes the reson for unrolling the outer loop is to get hold of muh lrger hunks of things tht n e done in prllelF sf the outer loop itertions re independentD nd the inner loop trip ount is highD then eh outer loop itertion represents signi(ntD prllel hunk of workF yn single g tht doesn9t mtter muhD ut on tightly oupled multiproessorD it n trnslte into tremendous inrese in speedsF

2.4.6 Loop Interchange58


voop interhnge is tehnique for rerrnging loop nest so tht the right stu' is t the enterF ht the right stu' is depends upon wht you re trying to omplishF sn mny situtionsD loop interhnge lso lets you swp high trip ount loops for low trip ount loopsD so tht tivity gets pulled into the enter of the loop nestF59
58 This content is available online at <http://cnx.org/content/m33736/1.2/>. 59 It's also good for improving memory access patterns.

IIH

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

2.4.6.1 Loop Interchange to Move Computations to the Center


hen someone writes progrm tht represents some kind of relEworld modelD they often struture the ode in terms of the modelF his mkes perfet senseF he omputer is n nlysis toolY you ren9t writing the ode on the omputer9s ehlfF roweverD model expressed nturlly often works on one point in spe t timeD whih tends to give you insigni(nt inner loops " t lest in terms of the trip ountF por performneD you might wnt to interhnge inner nd outer loops to pull the tivity into the enterD where you n then do some unrollingF vet9s illustrte with n exmpleF rere9s loop where uhsw timeEdependent quntities for points in twoEdimensionl mesh re eing updtedX

eewii @shsw a IHHHD thsw a IHHHD uhsw a QA FFF hy saIDshsw hy taIDthsw hy uaIDuhsw h@uDtDsA a h@uDtDsA C @uDtDsA B h ixhhy ixhhy ixhhy
sn prtieD uhsw is proly equl to P or QD where t or sD representing the numer of pointsD my e in the thousndsF he wy it is writtenD the inner loop hs very low trip ountD mking it poor ndidte for unrollingF fy interhnging the loopsD you updte one quntity t timeD ross ll of the pointsF por tuning purposesD this moves lrger trip ounts into the inner loop nd llows you to do some strtegi unrollingX

hy uaIDuhsw hy taIDthsw hy saIDshsw h@uDtDsA a h@uDtDsA C @uDtDsA B h ixhhy ixhhy ixhhy


his exmple is strightforwrdY it9s esy to see tht there re no interEitertion dependeniesF fut how n you tellD in generlD when two loops n e interE hngedc snterhnging loops might violte some dependenyD or worseD only violte it osionllyD mening you might not th it when optimizingF gn we interhnge the loops elowc

hy saIDxEI hy taPDx e@sDtA a e@sCIDtEIA B f@sDtA g@sDtA a f@tDsA

III

ixhhy ixhhy
hile it is possile to exmine the loops y hnd nd determine the dependeniesD it is muh etter if the ompiler n mke the determintionF ery few singleEproessor ompilers utomtilly perform loop interhngeF roweverD the ompilers for highEend vetor nd prllel omputers generlly interhnge loops if there is some ene(t nd if interhnging the loops won9t lter the progrm resultsF60

2.4.7 Memory Access Patterns61


he est pttern is the most strightforwrdX inresing nd unit sequentilF por n rry with single dimensionD stepping through one element t time will omplish thisF por multiplyEdimensioned rrysD ess is fstest if you iterte on the rry susript o'ering the smllest or step sizeF sn pyex progrmsD this is the leftmost susriptY in gD it is the rightmostF he pyex loop elow hs unit strideD nd therefore will run quiklyX

stride

hy taIDx hy saIDx e@sDtA a f@sDtA C g@sDtA B h ixhhy ixhhy


sn ontrstD the next loop is slower euse its stride is x @whihD we ssumeD is greter thn IAF es x inreses from one to the length of the he line @djusting for the length of eh elementAD the performne worsensF yne x is longer thn the length of the he line @gin djusted for element sizeAD the performne won9t dereseX

hy taIDx hy saIDx e@tDsA a f@tDsA C g@tDsA B h ixhhy ixhhy


rere9s unitEstride loop like the previous oneD ut written in gX

for @iaHY i<nY iCCA for @jaHY j<nY jCCA ij a ij C ij B dY


60 When 61 This
the compiler performs automatic parallel optimization, it prefers to run the outermost loop in parallel to minimize

overhead and unroll the innermost loop to make best use of a superscalar or vector processor. For this reason, the compiler needs to have some exibility in ordering the loops in a loop nest. content is available online at <http://cnx.org/content/m33738/1.2/>.

IIP

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

nit stride gives you the est performne euse it onserves he entriesF ell how dt he worksF62 our progrm mkes memory refereneY if the dt is in the heD it gets returned immeditelyF sf notD your progrm su'ers he miss while new he line is fethed from min memoryD repling n old oneF he line holds the vlues tken from hndful of neighoring memory lotionsD inluding the one tht used the he missF sf you loded he lineD took one piee of dt from itD nd threw the rest wyD you would e wsting lot of time nd memory ndwidthF roweverD if you rought line into the he nd onsumed everything in itD you would ene(t from lrge numer of memory referenes for smll numer of he missesF his is extly wht you get when your progrm mkes unitEstride memory referenesF he worstEse ptterns re those tht jump through memoryD espeilly lrge mount of memoryD nd prtiulrly those tht do so without pprent rhyme or reson @viewed from the outsideAF yn jos tht operte on very lrge dt struturesD you py penlty not only for he missesD ut for vf misses tooF63 st would e nie to e le to rein these jos in so tht they mke etter use of memoryF yf ourseD you n9t eliminte memory referenesY progrms hve to get to their dt one wy or notherF he question isD thenX how n we restruture memory ess ptterns for the est performnec sn the next few setionsD we re going to look t some triks for restruturing loops with stridedD leit preditleD ess ptternsF he triks will e fmilirY they re mostly loop optimiztions from eE tion PFQFID used here for di'erent resonsF he underlying gol is to minimize he nd vf misses s muh s possileF ou will see tht we n do quite lotD lthough some of this is going to e uglyF

2.4.7.1 Loop Interchange to Ease Memory Access Patterns


voop interhnge is good tehnique for lessening the impt of strided memory referenesF vet9s revisit our pyex loop with nonEunit strideF he good news is tht we n esily interhnge the loopsY eh itertion is independent of every otherX

hy taIDx hy saIDx e@tDsA a f@tDsA C g@tDsA B h ixhhy ixhhy


efter interhngeD eD fD nd g re referened with the leftmost susript vrying most quiklyF his modi(E tion n mke n importnt di'erene in performneF e trded three xEstrided memory referenes for unit stridesX

hy saIDx hy taIDx e@tDsA a f@tDsA C g@tDsA B h ixhhy ixhhy


62 See Section 1.1.1. 63 The Translation Lookaside

Buer (TLB) is a cache of translations from virtual memory addresses to physical memory

addresses. For more information, refer back to Section 1.1.1.

IIQ

2.4.7.2 Matrix Multiplication


wtrix multiplition is ommon opertion we n use to explore the options tht re ville in optimizing loop nestF e progrmmer who hs just (nished reding liner lger textook would proly write mtrix multiply s it ppers in the exmple elowX

hy saIDx hy taIDx w a H hy uaIDx w a w C e@sDuA B f@uDtA ixhhy g@sDtA a w ixhhy ixhhy


he prolem with this loop is tht the e@sDuA will e nonEunit strideF ih itertion in the inner loop onsists of two lods @one nonEunit strideAD multiplitionD nd n dditionF qiven the nture of the mtrix multiplitionD it might pper tht you n9t eliminte the nonEunit strideF roweverD with simple rewrite of the loops ll the memory esses n e mde unit strideX

hy taIDx hy saIDx g@sDtA a HFH ixhhy ixhhy hy uaIDx hy taIDx gevi a f@uDtA hy saIDx g@sDtA a g@sDtA C e@sDuA B gevi ixhhy ixhhy ixhhy
xowD the inner loop esses memory using unit strideF ih itertion performs two lodsD one storeD multiplitionD nd n dditionF hen ompring this to the previous loopD the nonEunit stride lods hve een elimintedD ut there is n dditionl store opertionF essuming tht we re operting on heEsed systemD nd the mtrix is lrger thn the heD this extr store won9t dd muh to the exeution timeF he store is to the lotion in g@sDtA tht ws used in the lodF sn most sesD the store is to line tht is lredy in the in the heF he f@uDtA eomes onstnt sling ftor within the inner loopF

IIR

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

2.4.8 When Interchange Won't Work64


sn the mtrix multiplition odeD we enountered nonEunit stride nd were le to eliminte it with quik interhnge of the loopsF nfortuntelyD life is rrely this simpleF yften you (nd some mix of vriles with unit nd nonEunit stridesD in whih se interhnging the loops moves the dmge roundD ut doesn9t mke it go wyF he loop to perform mtrix trnspose represents simple exmple of this dilemmX

hy saIDx hy taIDw e@tDsA a f@sDtA ixhhy ixhhy

hy PH taIDw hy IH saIDx e@tDsA a f@sDtA ixhhy ixhhy

hihever wy you interhnge themD you will rek the memory ess pttern for either e or fF iven more interestingD you hve to mke hoie etween strided lods vsF strided storesX whih will it ec65 e relly need generl method for improving the memory ess ptterns for e nd fD not one or the otherF e9ll show you suh method in etion PFRFWF

both

2.4.9 Blocking to Ease Memory Access Patterns66

Blocking is nother kind of memory referene optimiztionF

es with loop interhngeD the hllenge is to retrieve s muh dt s possile with s few he misses s possileF e9d like to rerrnge the loop nest so tht it works on dt in little neighorhoodsD rther thn striding through memory like mn on stiltsF qiven the following vetor sumD how n we rerrnge the loopc

hy saIDx hy taIDx e@tDsA a e@tDsA C f@sDtA ixhhy ixhhy


his loop involves two vetorsF yne is referened with unit strideD the other with stride of xF e n interhnge the loopsD ut one wy or nother we still hve xEstrided rry referenes on either e or fD either of whih is undesirleF he trik is to referenes so tht you gr few elements of eD nd then few of fD nd then few of eD nd so on " in neighorhoodsF e mke this hppen y omining inner nd outer loop unrollingX

block

64 This content is available 65 I can't tell you which is 66 This

online at <http://cnx.org/content/m33741/1.2/>. the better way to cast it; it depends on the brand of computer. Some perform better with the

loops left as they are, sometimes by more than a factor of two. Others perform better with them interchanged. The dierence is in the way the processor handles updates of main memory from cache. content is available online at <http://cnx.org/content/m33756/1.2/>.

IIS

hy saIDxDP hy taIDxDP e@tDsA e@tCIDsA e@tDsCIA e@tCIDsCIA ixhhy ixhhy

a a a a

e@tDsA e@tCIDsA e@tDsCIA e@tCIDsCIA

C C C C

f@sDtA f@sDtCIA f@sCIDtA f@sCIDtCIA

se your imgintion so we n show why this helpsF sullyD when we think of twoEdimensionl rryD we think of retngle or squre @see pigure PFW @errys e nd fAAF ememerD to mke progrmming esierD the ompiler provides the illusion tht twoEdimensionl rrys e nd f re retngulr plots of memory s in pigure PFW @errys e nd fAF etullyD memory is sequentil storgeF sn pyexD twoEdimensionl rry is onstruted in memory y logilly lining memory strips up ginst eh otherD like the pikets of edr feneF @st9s the other wy round in gX rows re stked on top of one notherFA erry storge strts t the upper leftD proeeds down to the ottomD nd then strts over t the top of the next olumnF tepping through the rry with unit stride tres out the shpe of kwrds xD repeted over nd overD moving to the rightF

Arrays A and B

Figure 2.9

IIT

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


How array elements are stored

Figure 2.10

smgine tht the thin horizontl lines of pigure PFIH @row rry elements re storedA ut memory storge into piees the size of individul he entriesF iture how the loop will trverse themF feuse of their index expressionsD referenes to e go from top to ottom @in the kwrds x shpeAD onsuming every it of eh he lineD ut referenes to f dsh o' to the rightD using one piee of eh he entry nd disrding the rest @see pigure PFII @PP squresAD topAF his low usge of he entries will result in high numer of he missesF sf we ould somehow rerrnge the loop so tht it onsumed the rrys in smll retnglesD rther thn stripsD we ould onserve some of the he entries tht re eing disrdedF his is extly wht we omplished y unrolling oth the inner nd outer loopsD s in the following exmpleF erry e is referened in severl strips side y sideD from top to ottomD while f is referened in severl strips side y sideD from left to right @see pigure PFII @PP squresAD ottomAF his improves he performne nd lowers runtimeF por relly ig prolemsD more thn he entries re t stkeF yn virtul memory mhinesD memory referenes hve to e trnslted through vfF sf you re deling with lrge rrysD vf missesD in ddition to he missesD re going to dd to your runtimeF

IIU

22 squares

Figure 2.11

rere9s something tht my surprise youF sn the ode elowD we rewrite this loop yet ginD this time loking referenes t two di'erent levelsX in PP squres to sve he entriesD nd y utting the originl loop in two prts to sve vf entriesX

hy saIDxDP hy taIDxGPDP e@tDsA a e@tDsA C f@sDtA e@tCIDsA a e@tCIDsA C f@sCIDtA e@tDsCIA a e@tDsCIA C f@sCIDtA e@tCIDsCIA a e@tCIDsCIA C f@sCIDtCIA

IIV

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


ixhhy ixhhy hy saIDxDP hy taxGPCIDxDP e@tDsA a e@tDsA C f@sDtA e@tCIDsA a e@tCIDsA C f@sCIDtA e@tDsCIA a e@tDsCIA C f@sCIDtA e@tCIDsCIA a e@tCIDsCIA C f@sCIDtCIA ixhhy ixhhy

ou might guess tht dding more loops would e the wrong thing to doF fut if you work with resonly lrge vlue of xD sy SIPD you will see signi(nt inrese in performneF his is euse the two rrys e nd f re eh PST uf V ytes a P wf when x is equl to SIP " lrger thn n e hndled y the vfs nd hes of most proessorsF he two oxes in pigure PFIP @iture of unloked versus loked referenesA illustrte how the (rst few referenes to e nd f look superimposed upon one nother in the loked nd unloked sesF nloked referenes to f zing o' through memoryD eting through he nd vf entriesF floked referenes re more spring with the memory systemF

Picture of unblocked versus blocked references

Figure 2.12

ou n tke loking even further for lrger prolemsF his ode shows nother method tht limits the size of the inner loop nd visits it repetedlyX

IIW

ss a wyh @xDITA tt a wyh @xDRA hy saIDx hy taIDtt e@tDsA a e@tDsA C f@tDsA ixhhy ixhhy hy saIDss hy tattCIDx e@tDsA a e@tDsA C f@tDsA e@tDsA a e@tDsA C IFHhH ixhhy ixhhy hy sassCIDxDIT hy tattCIDxDR hy uasDsCIS e@tDuA a e@tCIDuA a e@tCPDuA a e@tCQDuA a ixhhy ixhhy ixhhy

e@tDuA e@tCIDuA e@tCPDuA e@tCQDuA

C C C C

f@uDtA f@uDtCIA f@uDtCPA f@uDtCQA

here the inner s loop used to exeute x itertions t timeD the new u loop exeutes only IT itertionsF his divides nd onquers lrge memory ddress spe y utting it into little pieesF hile these loking tehniques egin to hve diminishing returns on singleEproessor systemsD on lrge multiproessor systems with nonuniform memory ess @xweAD there n e signi(nt ene(t in refully rrnging memory esses to mximize reuse of oth he lines nd min memory pgesF eginD the omined unrolling nd loking tehniques we just showed you re for loops with expressionsF hey work very well for loop nests like the one we hve een looking tF roweverD if ll rry referenes re strided the sme wyD you will wnt to try loop unrolling or loop interhnge (rstF

stride

mixed

2.4.10 Programs That Require More Memory Than You Have67


eople osionlly hve progrms whose memory size requirements re so gret tht the dt n9t (t in memory ll t oneF et ny timeD some of the dt hs to reside outside of min memory on seondry @usully diskA storgeF hese outEofE ore solutions fll into two tegoriesX

oftwreEmngedD outEofEore solutions irtul memory!mngedD outEofEore solutions


ith softwreEmnged pprohD the progrmmer hs reognized tht the prolem is too ig nd hs modi(ed the soure ode to move setions of the dt out to disk for retrievl t lter timeF he other method depends on the omputer9s memory system hndling the seondry storge requirements on its ownD someE times t gret ost in runtimeF
67 This
content is available online at <http://cnx.org/content/m33770/1.2/>.

IPH

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE

2.4.10.1 Software-Managed, Out-of-Core Solutions


wost odes with softwreEmngedD outEofEore solutions hve djustmentsY you n tell the progrm how muh memory it hs to work withD nd it tkes re of the restF st is importnt to mke sure the djustment is set orretlyF gode tht ws tuned for mhine with limited memory ould hve een ported to nother without tking into ount the storge villeF erhps the whole prolem will (t esilyF sf we re writing n outEofEore solutionD the trik is to group memory referenes together so tht they re lolizedF his usully ours nturlly s side e'et of prtitioningD syD mtrix ftoriztion into groups of olumnsF floking referenes the wy we did in the previous setion lso orrls memory referenes together so you n tret them s memory pgesF unowing when to ship them o' to disk entils eing losely involved with wht the progrm is doingF

2.4.11 Closing Notes68


voops re the hert of nerly ll high performne progrmsF he (rst gol with loops is to express them s simply nd lerly s possile @iFeFD elimintes the lutterAF henD use the pro(ling nd timing tools to (gure out whih routines nd loops re tking the timeF yne you (nd the loops tht re using the most timeD try to determine if the performne of the loops n e improvedF pirst try simple modi(tions to the loops tht don9t redue the lrity of the odeF ou n lso experiment with ompiler options tht ontrol loop optimiztionsF yne you9ve exhusted the options of keeping the ode looking lenD nd if you still need more performneD resort to hndEmodifying to the odeF ypilly the loops tht need little hndEoxing re loops tht re mking d use of the memory rhiteture on heEsed systemF ropefully the loops you end up hnging re only few of the overll loops in the progrmF roweverD efore going too fr optimizing on single proessor mhineD tke look t how the progrm exeutes on prllel systemF ometimes the modi(tions tht improve performne on singleEproessor system onfuses the prllelEproessor ompilerF he ompilers on prllel nd vetor systems generlly hve more powerful optimiztion pilitiesD s they must identify res of your ode tht will exeute well on their speilized hrdwreF hese ompilers hve een interhnging nd unrolling loops utomtilly for some time nowF

2.4.12 Exercises69
Exercise 2.10
hy is n unrolling mount of three or four itertions generlly su0ient for simple vetor loops on sg proessorc ht reltionship does the unrolling mount hve to )otingEpoint pipeline depthsc

Exercise 2.11

yn proessor tht n exeute one )otingEpoint multiplyD one )otingEpoint ddiE tionGsutrtionD nd one memory referene per yleD wht9s the est performne you ould expet from the following loopc

hy s a IDIHHHH e@sA a f@sA B g@sA E h@sA B i@sA ixhhy


68 This 69 This
content is available online at <http://cnx.org/content/m33773/1.2/>. content is available online at <http://cnx.org/content/m33777/1.2/>.

IPI

ry unrollingD interhngingD or loking the loop in suroutine fepe to inrese the performneF ht method or omintion of methods works estc vook t the ssemly lnguge reted y the ompiler to see wht its pproh is t the highest level of optimiztionF
note: gompile the min routine nd fepe seprtelyY djust xswi so tht the untuned run tkes out one minuteY nd use the ompiler9s defult optimiztion levelF

Exercise 2.12

yqew wesx swvsgs xyxi sxiqi wDxDsDt eewii @x a SIPD w a TRHD xswi a SHHA hyfvi igssyx @xDwAD @wDxA hy saIDw hy taIDx @tDsA a IFHhH @sDtA a IFHhH ixhhy ixhhy hy saIDxswi gevv fepe @DDxDwA ixhhy ixh fysxi fepe @DDxDwA swvsgs xyxi sxiqi wDxDsDt hyfvi igssyx @xDwAD @xDwA hy saIDx hy taIDw @sDtA a @sDtA B @tDsA ixhhy ixhhy ixh

Exercise 2.13

gode the mtrix multiplition lgorithm in the strightforwrd mnner nd ompile it with vrious optimiztion levelsF ee if the ompiler performs ny type of loop interhngeF ry the sme experiment with the following odeX

hy saIDx

IPP

CHAPTER 2. PROGRAMMING AND TUNING SOFTWARE


hy taIDx e@sDtA a e@sDtA C IFQ ixhhy ixhhy
ho you see di'erene in the ompiler9s ility to optimize these two loopsc sf you see di'ereneD explin itF

Exercise 2.14

gode the mtrix multiplition lgorithm oth the wys shown in this hpterF ixeute the progrm for rnge of vlues for xF qrph the exeution time divided y xQ for vlues of x rnging from SHSH to SHHSHHF ixplin the performne you seeF

Chapter 3
Shared-Memory Parallel Processors

3.1 Understanding Parallelism


3.1.1 Introduction1
sn senseD we hve een tlking out prllelism from the eginning of the ookF snsted of lling it prllelismD we hve een using words like pipelinedD superslrD nd ompiler )exiilityF es we move into progrmming on multiproessorsD we must inrese our understnding of prllelism in order to understnd how to e'etively progrm these systemsF sn shortD s we gin more prllel resouresD we need to (nd more prllelism in our odeF hen we tlk of prllelismD we need to understnd the onept of grnulrityF he grnulrity of prllelism indites the size of the omputtions tht re eing performed t the sme time etween synhroniztionsF ome exmples of prllelism in order of inresing grin size reX

hen performing QPEit integer dditionD using rry lookhed dderD you n prtilly dd its H nd I t the sme time s its P nd QF yn pipelined proessorD while deoding one instrutionD you n feth the next instrutionF yn twoEwy superslr proessorD you n exeute ny omintion of n integer nd )otingEpoint instrution in single yleF yn multiproessorD you n divide the itertions of loop mong the four proessors of the systemF ou n split lrge rry ross four worksttions tthed to networkF ih worksttion n operte on its lol informtion nd then exhnge oundry vlues t the end of eh time stepF

sn this hpterD we strt t @pipelined nd superslrA nd move towrd D whih is wht we need for multiproessor systemsF st is importnt to note tht the di'erent levels of prllelism re generlly not in on)itF snresing thred prllelism t orser grin size often exposes more (neEgrined prllelismF he following is loop tht hs plenty of prllelismX

level parallelism

instruction-level parallelism

thread-

hy saIDITHHH e@sA a f@sA B QFIRISW ixhhy


1 This
content is available online at <http://cnx.org/content/m32775/1.2/>.

IPQ

IPR

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

e hve expressed the loop in wy tht would imply tht e@IA must e omputed (rstD followed y e@PAD nd so onF roweverD one the loop ws ompletedD it would not hve mttered if e@ITHHHAD were omputed (rst followed y e@ISWWWAD nd so onF he loop ould hve omputed the even vlues of s nd then omputed the odd vlues of sF st would not even mke di'erene if ll ITDHHH of the itertions were omputed simultneously using ITDHHHEwy superslr proessorF2 sf the ompiler hs )exiility in the order in whih it n exeute the instrutions tht mke up your progrmD it n exeute those instrutions simultneously when prllel hrdwre is villeF yne tehnique tht omputer sientists use to formlly nlyze the potentil prllelism in n lgorithm is to hrterize how quikly it would exeute with n in(niteEwy superslr proessorF xot ll loops ontin s muh prllelism s this simple loopF e need to identify the things tht limit the prllelism in our odes nd remove them whenever possileF sn previous hpters we hve lredy looked t removing lutter nd rewriting loops to simplify the ody of the loopF his hpter lso supplements etion PFIFID in mny wysF e looked t the mehnis of ompiling odeD ll of whih pply hereD ut we didn9t nswer ll of the whysF fsi lok nlysis tehniques form the sis for the work the ompiler does when looking for more prllelismF vooking t two piees of dtD instrutionsD or dt nd instrutionsD modern ompiler sks the questionD ho these things depend on eh otherc he three possile nswers re yesD noD nd we don9t knowF he third nswer is e'etively the sme s yesD euse ompiler hs to e onservtive whenever it is unsure whether it is sfe to twek the ordering of instrutionsF relping the ompiler reognize prllelism is one of the si pprohes speilists tke in tuning odeF e slight rewording of loop or some supplementry informtion supplied to the ompiler n hnge we don9t know nswer into n opportunity for prllelismF o e ertinD there re other fets to tuning s wellD suh s optimizing memory ess ptterns so tht they est suit the hrdwreD or resting n lgorithmF end there is no single est pproh to every prolemY ny tuning e'ort hs to e omintion of tehniquesF

3.1.2 Dependencies3
smgine symphony orhestr where eh musiin plys without regrd to the ondutor or the other musiinsF et the (rst tp of the ondutor9s tonD eh musiin goes through ll of his or her sheet musiF ome (nish fr hed of othersD leve the stgeD nd go homeF he ophony wouldn9t resemle musi @ome to think of itD it would resemle experimentl jzzA euse it would e totlly unoordintedF yf ourse this isn9t how musi is plyedF e omputer progrmD like musil pieeD is woven on fri tht unfolds in time @though perhps woven more looselyAF gertin things must hppen efore or long with othersD nd there is rte to the whole proessF ith omputer progrmsD whenever event e must our efore event f nD we sy tht f is on eF e ll the reltionship etween them dependenyF ometimes dependenies exist euse of lultions or memory opertionsY we ll these F yther timesD we re witing for rnh or doEloop exit to tke pleY this is lled F ih is present in every progrm to vrying degreesF he gol is to eliminte s mny dependenies s possileF errnging progrm so tht two hunks of the omputtion re less dependent exposes D or opportunities to do severl things t oneF

data dependencies control dependency parallelism

dependent

3.1.2.1 Control Dependencies


tust s vrile ssignments n depend on other ssignmentsD vrile9s vlue n lso depend on the within the progrmF por instneD n ssignment within n ifEsttement n our only if the onditionl evlutes to trueF he sme n e sid of n ssignment within loopF sf the loop is never enteredD no sttements inside the loop re exeutedF

ow of control
2 Interestingly, 3 This

this is not as far-fetched as it might seem. On a single instruction multiple data (SIMD) computer such as

the Connection CM-2 with 16,384 processors, it would take three instruction cycles to process this entire loop. content is available online at <http://cnx.org/content/m32777/1.2/>.

IPS hen lultions our s onsequene of the )ow of ontrolD we sy there is D s in the ode elow nd shown grphilly in pigure QFI @gontrol dependenyAF he ssignment loted inside the lokEif my or my not e exeutedD depending on the outome of the test FxiF HF sn other wordsD the vlue of depends on the )ow of ontrol in the ode round itF eginD this my sound to you like onern for ompiler designersD not progrmmersD nd tht9s mostly trueF fut there re times when you might wnt to move ontrolEdependent instrutions round to get expensive lultions out of the wy @provided your ompiler isn9t smrt enough to do it for youAF por exmpleD sy tht pigure QFP @e little setion of your progrmA represents little setion of your progrmF plow of ontrol enters t the top nd goes through two rnh deisionsF purthermoreD sy tht there is squre root opertion t the entry pointD nd tht the )ow of ontrol lmost lwys goes from the topD down to the leg ontining the sttement eaHFHF his mens tht the results of the lultion ea@fA re lmost lwys disrded euse e gets new vlue of HFH eh time throughF e squre root opertion is lwys expensive euse it tkes lot of time to exeuteF he troule is tht you n9t just get rid of itY osionlly it9s neededF roweverD you ould move it out of the wy nd ontinue to oserve the ontrol dependenies y mking two opies of the squre root opertion long the less trveled rnhesD s shown in pigure QFQ @ixpensive opertion moved so tht it9s rrely exeutedAF his wy the would exeute only long those pths where it ws tully neededF

control dependency

Control dependency

Figure 3.1

IPT

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


A little section of your program

Figure 3.2

his kind of instrution sheduling will e ppering in ompilers @nd even hrdwreA more nd more s time goes onF e vrition on this tehnique is to lulte results tht might e needed t times when there is gp in the instrution strem @euse of dependeniesAD thus using some spre yles tht might otherwise e wstedF

IPU

Expensive operation moved so that it's rarely executed

Figure 3.3

3.1.2.2 Data Dependencies


e lultion tht is in some wy ound to previous lultion is sid to e dt dependent upon tht lultionF sn the ode elowD the vlue of f is dt dependent on the vlue of eF ht9s euse you n9t lulte f until the vlue of e is villeX

e a C C gy@A f a e B g
his dependeny is esy to reognizeD ut others re not so simpleF et other timesD you must e reful not to rewrite vrile with new vlue efore every other omputtion hs (nished using the old vlueF e n group ll dt dependenies into three tegoriesX @IA )ow dependeniesD @PA ntidependeniesD nd @QA output dependeniesF pigure QFR @ypes of dt dependeniesA ontins some simple exmples to demonstrte eh type of dependenyF sn eh exmpleD we use n rrow tht strts t the soure of the dependeny nd ends t the sttement tht must e delyed y the dependenyF he key prolem in eh of these dependenies is tht the seond sttement n9t exeute until the (rst hs ompletedF yviously in the prtiulr output dependeny exmpleD the (rst omputtion is ded ode nd n e eliminted unless there is some intervening ode tht needs the vluesF here re other tehniques to eliminte either output or ntidependeniesF he following exmple ontins )ow dependeny followed y n output dependenyF

IPV

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


Types of data dependencies

Figure 3.4

a e G f a C PFH a h E i
hile we n9t eliminte the )ow dependenyD the output dependeny n e eliminted y using srth vrileX

temp a eGf a temp C PFH a h E i


es the numer of sttements nd the intertions etween those sttements inreseD we need etter wy to identify nd proess these dependeniesF pigure QFS @wultiple dependeniesA shows four sttements with four dependeniesF

IPW

Multiple dependencies

Figure 3.5

xone of the seond through fourth instrutions n e strted efore the (rst instrution ompletesF

3.1.2.3 Forming a DAG

yne method for nlyzing sequene of instrutions is to orgnize it into @heqAF4 vike the instrutions it representsD heq desries ll of the lultions nd reltionships etween vrilesF he dt )ow within heq proeeds in one diretionY most often heq is onstruted from top to ottomF sdenti(ers nd onstnts re pled t the lef  nodes " the ones on the topF ypertionsD possily with vrile nmes tthedD mke up the internl nodesF riles pper in their (nl sttes t the ottomF he heq9s edges order the reltionships etween the vriles nd opertions within itF ell dt )ow proeeds from top to ottomF o onstrut heqD the ompiler tkes eh intermedite lnguge tuple nd mps it onto one or more nodesF por instneD those tuples tht represent inry opertionsD suh s ddition @aeCfAD form portion of the heq with two inputs @e nd fA ound together y n opertion @CAF he result of the opertion my feed into yet other opertions within the si lok @nd the heqA s shown in pigure QFT @e trivil dt )ow grphAF
4A
graph is a collection of nodes connected by edges. By directed, we mean that the edges can only be traversed in specied directions. The word acyclic means that there are no cycles in the graph; that is, you can't loop anywhere within it.

directed acyclic graph

IQH

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


A trivial data ow graph

Figure 3.6

por si lok of odeD we uild our heq in the order of the instrutionsF he heq for the previous four instrutions is shown in pigure QFU @e more omplex dt )ow grphAF his prtiulr exmple hs mny dependeniesD so there is not muh opportunity for prllelismF pigure QFV @ixtrting prllelism from heqA shows more strightforwrd exmple shows how onstruting heq n identify prllelismF prom this heqD we n determine tht instrutions I nd P n e exeuted in prllelF feuse we see the omputtions tht operte on the vlues e nd f while proessing instrution RD we n eliminte ommon suexpression during the onstrution of the heqF sf we n determine tht is the only vrile tht is used outside this smll lok of odeD we n ssume the omputtion is ded odeF

IQI

A more complex data ow graph

Figure 3.7

fy onstruting the heqD we tke sequene of instrutions nd determine whih must e exeuted in prtiulr order nd whih n e exeuted in prllelF his type of dt )ow nlysis is very importnt in the odegenertion phse on superEslr proessorsF e hve introdued the onept of dependenies nd how to use dt )ow to (nd opportunities for prllelism in ode sequenes within si lokF e n lso use dt )ow nlysis to identify dependeniesD opportunities for prllelismD nd ded ode etween si loksF

3.1.2.4 Uses and Denitions

es the heq is onstrutedD the ompiler n mke lists of vrile uses nd D s well s other informtionD nd pply these to glol optimiztions ross mny si loks tken togetherF vooking t the heq in pigure QFV @ixtrting prllelism from heqAD we n see tht the vriles de(ned re D D D gD nd hD nd the vriles used re e nd fF gonsidering mny si loks t oneD we n sy how fr prtiulr vrile de(nition rehes " where its vlue n e seenF prom this we n reognize situtions where lultions re eing disrdedD where two uses of given vrile re ompletely independentD or where we n overwrite registerEresident vlues without sving them k to memoryF e ll this investigtion F

denitions

data ow analysis

IQP

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


Extracting parallelism from a DAG

Figure 3.8

o illustrteD suppose tht we hve the )ow grph in pigure QFW @plow grph for dt )ow nlysisAF feside eh si lok we9ve listed the vriles it uses nd the vriles it de(nesF ht n dt )ow nlysis tell usc xotie tht vlue for e is de(ned in lok ut only used in lok F ht mens tht e is ded upon exit from lok or immeditely upon tking the rightEhnd rnh leving Y none of the other si loks uses the vlue of eF ht tells us tht ny ssoited resouresD suh s registerD n e freed for other usesF vooking t pigure QFW @plow grph for dt )ow nlysisA we n see tht h is de(ned in si lok D ut never usedF his mens tht the lultions de(ning h n e disrdedF omething interesting is hppening with the vrile qF floks nd oth use itD ut if you look losely you9ll see tht the two uses re distint from one notherD mening tht they n e treted s two independent vrilesF e ompiler feturing dvned instrution sheduling tehniques might notie tht is the only lok tht uses the vlue for iD nd so move the lultions de(ning i out of lok nd into D where they re neededF

IQQ

Flow graph for data ow analysis

Figure 3.9

sn ddition to gthering dt out vrilesD the ompiler n lso keep informtion out suexpresE sionsF ixmining oth togetherD it n reognize ses where redundnt lultions re eing mde @ross si loksAD nd sustitute previously omputed vlues in their pleF sfD for instneD the expression rBs ppers in loks D D nd D it ould e lulted just one in lok nd propgted to the others tht use itF

3.1.3 Loops5
voops re the enter of tivity for mny pplitionsD so there is often high pyk for simplifying or moving lultions outsideD into the omputtionl suursF irly ompilers for prllel rhitetures used pttern mthing to identify the ounds of their loopsF his limittion ment tht hndEonstruted loop using ifEsttements nd gotoEsttements would not e orretly identi(ed s loopF feuse modern ompilers use dt )ow grphsD it9s prtil to identify loops s prtiulr suset of nodes in the )ow grphF o dt )ow grphD hnd onstruted loop looks the sme s ompilerEgenerted loopF yptimiztions n therefore e pplied to either type of loopF yne we hve identi(ed the loopsD we n pply the sme kinds of dtE)ow nlysis we pplied oveF emong the things we re looking for re lultions tht re unhnging within the loop nd vriles tht hnge in preditle @linerA fshion from itertion to itertionF
5 This
content is available online at <http://cnx.org/content/m32784/1.2/>.

IQR

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


row does the ompiler identify loop in the )ow grphc pundmentllyD two onditions hve to e metX

e given node hs to dominte ll other nodes within the suspeted loopF his mens tht ll pths to ny node in the loop hve to pss through one prtiulr nodeD the domintorF he domintor node forms the heder t the top of the loopF here hs to e yle in the grphF qiven domintorD if we n (nd pth k to it from one of the nodes it domintesD we hve loopF his pth k is known s the of the loopF

back edge

he )ow grph in pigure QFIH @plowgrph with loop in itA ontins one loop nd one red herringF ou n see tht node f domintes every node elow it in the suset of the )ow grphF ht stis(es gondition I @listD pF IQRA nd mkes it ndidte for loop hederF here is pth from i to fD nd f domintes iD so tht mkes it k edgeD stisfying gondition P @listD pF IQRAF hereforeD the nodes fD gD hD nd i form loopF he loop goes through n rry of linked list strt pointers nd trverses the lists to determine the totl numer of nodes in ll listsF vetters to the extreme right orrespond to the si lok numers in the )ow grphF

Flowgraph with a loop in it

Figure 3.10

et (rst glneD it ppers tht the nodes g nd h form loop tooF he prolem is tht g doesn9t dominte h @nd vie versAD euse entry to either n e mde from fD so ondition I @listD pF IQRA isn9t stis(edF

IQS qenerllyD the )ow grphs tht ome from ode segments written with even the wekest ppreition for strutured design o'er etter loop ndidtesF efter identifying loopD the ompiler n onentrte on tht portion of the )ow grphD looking for instrutions to remove or push to the outsideF gertin types of suexpressionsD suh s those found in rry index expressionsD n e simpli(ed if they hnge in preditle fshion from one itertion to the nextF sn the ontinuing quest for prllelismD loops re generlly our est soures for lrge mounts of prlE lelismF roweverD loops lso provide new opportunities for those prllelismEkilling dependeniesF

3.1.4 Loop-Carried Dependencies 6


he notion of dt dependene is prtiulrly importnt when we look t loopsD the hu of tivity inside numeril pplitionsF e wellEdesigned loop n produe millions of opertions tht n ll e performed in prllelF roweverD single mispled dependeny in the loop n fore it ll to e run in serilF o the stkes re higher when looking for dependenies in loopsF ome onstruts re ompletely independentD right out of the oxF he question we wnt to sk is gn two di'erent itertions exeute t the sme timeD or is there dt dependeny etween themc gonsider the following loopX

hy saIDx e@sA a e@sA C f@sA ixhhy


por ny two vlues of s nd uD n we lulte the vlue of e@sA nd e@uA t the sme timec felowD we hve mnully unrolled severl itertions of the previous loopD so they n e exeuted togetherX

e@sA a e@sA C f@sA e@sCIA a e@sCIA C f@sCIA e@sCPA a e@sCPA C f@sCPA


ou n see tht none of the results re used s n opernd for nother lultionF por instneD the lultion for e@sCIA n our t the sme time s the lultion for e@sA euse the lultions re independentY you don9t need the results of the (rst to determine the seondF sn ftD mixing up the order of the lultions won9t hnge the results in the lestF elxing the seril order imposed on these lultions mkes it possile to exeute this loop very quikly on prllel hrdwreF

3.1.4.1 Flow Dependencies


por omprisonD look t the next ode frgmentX

hy saPDx e@sA a e@sEIA C f@sA ixhhy


6 This
content is available online at <http://cnx.org/content/m32782/1.2/>.

IQT

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

his loop hs the regulrity of the previous exmpleD ut one of the susripts is hngedF eginD it9s useful to mnully unroll the loop nd look t severl itertions togetherX

e@sA a e@sEIA C f@sA e@sCIA a e@sA C f@sCIA e@sCPA a e@sCIA C f@sCPA


sn this seD there is dependeny prolemF he vlue of e@sCIA depends on the vlue of e@sAD the vlue of e@sCPA depends on e@sCIAD nd so onY every itertion depends on the result of previous oneF hependenies tht extend k to previous lultion nd perhps previous itertion @like this oneAD re loop rried or F ou often see suh dependenies in pplitions tht perform qussin elimintion on ertin types of mtriesD or numeril solutions to systems of di'erentil equtionsF roweverD it is impossile to run suh loop in prllel @s writtenAY the proessor must wit for intermedite results efore it n proeedF sn some sesD )ow dependenies re impossile to (xY lultions re so dependent upon one nother tht we hve no hoie ut to wit for previous ones to ompleteF yther timesD dependenies re funtion of the wy the lultions re expressedF por instneD the loop ove n e hnged to redue the dependenyF fy repliting some of the rithmetiD we n mke it so tht the seond nd third itertions depend on the (rstD ut not on one notherF he opertion ount goes up " we hve n extr ddition tht we didn9t hve efore " ut we hve redued the dependeny etween itertionsX

ow dependencies backward dependencies

hy saPDxDP e@sA a e@sEIA C f@sA e@sCIA a e@sEIA C f@sA C f@sCIA ixhhy


he speed inrese on worksttion won9t e gret @most mhines run the rest loop more slowlyAF roweverD some prllel omputers n trde o' dditionl lultions for redued dependeny nd hlk up net winF

3.1.4.2 Antidependencies
st9s di'erent story when there is loopErried ntidependenyD s in the ode elowX

hy saIDx e@sA a f@sA B i f@sA a e@sCPA B g ixhhy


sn this loopD there is n ntidependeny etween the vrile e@sA nd the vrile e@sCPAF ht isD you must e sure tht the instrution tht uses e@sCPA does so efore the previous one rede(nes itF glerlyD this is not prolem if the loop is exeuted serillyD ut rememerD we re looking for opportunities to overlp

IQU instrutionsF eginD it helps to pull the loop prt nd look t severl itertions togetherF e hve rest the loop y mking mny opies of the (rst sttementD followed y opies of the seondX

e@sA e@sCIA e@sCPA FFF f@sA f@sCIA f@sCPA

a f@sA B i a f@sCIA B i a f@sCPA B i a e@sCPA B g ssignment mkes use of the new a e@sCQA B g vlue of e@sCPA inorretF a e@sCRA B g

he referene to e@sCPA needs to ess n old vlueD rther thn one of the new ones eing lultedF sf you perform ll of the (rst sttement followed y ll of the seond sttementD the nswers will e wrongF sf you perform ll of the seond sttement followed y ll of the (rst sttementD the nswers will lso e wrongF sn senseD to run the itertions in prllelD you must either sve the e vlues to use for the seond sttement or store ll of the f vlue in temporry re until the loop ompletesF e n lso diretly unroll the loop nd (nd prllelismX

some

I P Q R S T

e@sA f@sA e@sCIA f@sCIA e@sCPA f@sCPA

a a a a a a

f@sA e@sCPA f@sCIA e@sCQA f@sCPA e@sCRA

B B B B B B

i g i | yutput dependeny g | i g

ttements I!R ould ll e exeuted simultneouslyF yne those sttements ompleted exeutionD sttements S!V ould exeute in prllelF sing this pprohD there re su0ient intervening sttements etween the dependent sttements tht we n see some prllel performne improvement from superslr sg proessorF

3.1.4.3 Output Dependencies

he third lss of dt dependeniesD D is of prtiulr interest to users of prllel omputersD prtiulrly multiproessorsF yutput dependenies involve getting the right vlues to the right vriles when ll lultions hve een ompletedF ytherwiseD n output dependeny is violtedF he loop elow ssigns new vlues to two elements of the vetor e with eh itertionX

output dependencies

hy saIDx e@sA a g@sA B PF e@sCPA a h@sA C i ixhhy

IQV

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

es lwysD we won9t hve ny prolems if we exeute the ode sequentillyF fut if severl itertions re performed togetherD nd sttements re reorderedD then inorret vlues n e ssigned to the lst elements of eF por exmpleD in the nive vetorized equivlent elowD e@sCPA tkes the wrong vlue euse the ssignments our out of orderX

e@sA e@sCIA e@sCPA e@sCPA e@sCQA e@sCRA

a a a a a a

g@sA g@sCIA g@sCPA h@sA h@sCIA h@sCPA

B B B C C C

PF PF PF i yutput dependeny violted i i

hether or not you hve to worry out output dependenies depends on whether you re tully prlE lelizing the odeF our ompiler will e onsious of the dngerD nd will e le to generte legl ode " nd possily even fst odeD if it9s lever enoughF fut output dependenies osionlly eome prolem for progrmmersF

3.1.4.4 Dependencies Within an Iteration


e hve looked t dependenies tht ross itertion oundries ut we hven9t looked t dependenies within the sme itertionF gonsider the following ode frgmentX

hy s a IDx h a f@sA B IU e@sA a h C IR ixhhy


hen we look t the loopD the vrile h hs )ow dependenyF he seond sttement nnot strt until the (rst sttement hs ompletedF et (rst glne this might pper to limit prllelism signi(ntlyF hen we look loser nd mnully unroll severl itertions of the loopD the sitution gets worseX

h a f@sA B e@sA a h C h a f@sCIA e@sCIA a h h a f@sCPA e@sCPA a h

IU IR B IU C IR B IU C IR

xowD the vrile h hs )owD outputD nd ntidependeniesF st looks like this loop hs no hope of running in prllelF roweverD there is simple solution to this prolem t the ost of some extr memory speD using tehnique lled F e de(ne h s n rry withx elements nd rewrite the ode s followsX

promoting a scalar to a vector

IQW

hy s a IDx h@sA a f@sA B IU e@sA a h@sA C IR ixhhy


xow the itertions re ll independent nd n e run in prllelF ithin eh itertionD the (rst sttement must run efore the seond sttementF

3.1.4.5 Reductions

he sum of n rry of numers is one exmple of " so lled euse it redues vetor to slrF he following loop to determine the totl of the vlues in n rry ertinly looks s though it might e le to e run in prllelX

reduction

w a HFH hy saIDx w a w C e@sA ixhhy


roweverD if we perform our unrolling trikD it doesn9t look very prllelX

w a w C e@sA w a w C e@sCIA w a w C e@sCPA


his loop lso hs ll three types of dependenies nd looks impossile to prllelizeF sf we re willing to ept the potentil e'et of roundingD we n dd some prllelism to this loop s follows @gin we did not dd the preonditioning loopAX

wH a HFH wI a HFH wP a HFH wQ a HFH hy saIDxDR wH a wH wI a wI wP a wP wQ a wQ ixhhy

C C C C

e@sA e@sCIA e@sCPA e@sCQA

IRH

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


w a wH C wI C wP C wQ

eginD this is not preisely the sme omputtionD ut ll four prtil sums n e omputed independentlyF he prtil sums re omined t the end of the loopF voops tht look for the mximum or minimum elements in n rryD or multiply ll the elements of n rryD re lso redutionsF vikewiseD some of these n e reorgnized into prtil resultsD s with the sumD to expose more omputtionsF xote tht the mximum nd minimum re ssoitive opertorsD so the results of the reorgnized loop re identil to the sequentil loopF

3.1.5 Ambiguous References7


ivery dependeny we hve looked t so fr hs een ler utY you ould see extly wht you were deling with y looking t the soure odeF fut other timesD desriing dependeny isn9t so esyF ell this loop from the entidependenies setion etion QFIFRFP @entidependeniesA erlier in this hpterX

hy saIDx e@sA a f@sA B i f@sA a e@sCPA B g ixhhy


feuse eh vrile referene is solely funtion of the indexD sD it9s ler wht kind of dependeny we re deling withF purthermoreD we n desrie how fr prt @in itertionsA vrile referene is from its de(nitionF his is lled the F e negtive vlue represents )ow dependenyY positive vlue mens there is n ntidependenyF e vlue of zero sys tht no dependeny exists etween the referene nd the de(nitionF sn this loopD the dependeny distne for e is CP itertionsF roweverD rry susripts my e funtions of other vriles esides the loop indexF st my e di0ult to tell the distne etween the use nd de(nition of prtiulr elementF st my even e impossile to tell whether the dependeny is )ow dependeny or n ntidependenyD or whether dependeny exists t llF gonsequentlyD it my e impossile to determine if it9s sfe to overlp exeution of di'erent sttementsD s in the following loopX

dependency distance

hy saIDx e@sA a f@sA B i f@sA a e@sCuA B g u unknown ixhhy


sf the loop mde use of e@sCuAD where the vlue of u ws unknownD we wouldn9t e le to tell @t lest y looking t the odeA nything out the kind of dependeny we might e fingF sf u is zeroD we hve dependeny within the itertion nd no loopErried dependeniesF sf u is positive we hve n ntidependeny with distne uF hepending on the vlue for uD we might hve enough prllelism for superslr proessorF sf u is negtiveD we hve loopErried )ow dependenyD nd we my hve to exeute the loop serillyF D like e@sCuA oveD hve n e'et on the prllelism we n detet in loopF prom the ompiler perspetiveD it my e tht this loop does ontin two independent lultions tht the uthor

Ambiguous references

7 This

content is available online at <http://cnx.org/content/m32788/1.2/>.

IRI whimsilly deided to throw into single loopF fut when they pper togetherD the ompiler hs to tret them onservtivelyD s if they were interreltedF his hs ig e'et on performneF sf the ompiler hs to ssume tht onseutive memory referenes my ultimtely ess the sme lotionD the instrutions involved nnot e overlppedF yne other option is for the ompiler to generte two versions of the loop nd hek the vlue for u t runtime to determine whih version of the loop to exeuteF e similr sitution ours when we use integer index rrys in loopF he loop elow ontins only single sttementD ut you n9t e sure tht ny itertion is independent without knowing the ontents of the u nd t rrysX

hy saIDx e@u@sAA a e@u@sAA C f@t@sAA B g ixhhy


por instneD wht if ll of the vlues for u@sA were the smec his uses the sme element of the rry e to e rereferened with eh itertion3 ht my seem ridiulous to youD ut the ompiler n9t tellF ith ode like thisD it9s ommon for every vlue of u@sA to e uniqueF his is lled F sf you n tell ompiler tht it is deling with permuttionD the penlty is lessened in some sesF iven soD there is insult eing dded to injuryF sndiret referenes require more memory tivity thn diret referenesD nd this slows you downF

permutation

3.1.5.1 Pointer Ambiguity in Numerical C Applications


pyex ompilers depend on progrmmers to oserve lising rulesF ht isD progrmmers re not supposed to modify lotions through pointers tht my e lises of one notherF hey n eome lises in severl wysD suh s when two dummy rguments reeive pointers to the sme storge lotionsX

gevv fyf @eDeA FFF ixh fysxi fyf @DA D eome lises
g ompilers don9t enjoy the sme restritions on lisingF sn ftD there re ses where lising ould e desirleF edditionllyD g is lessed with pointer typesD inresing the opportunities for lising to ourF his mens tht g ompiler hs to pproh opertions through pointers more onservtively thn pyex ompiler wouldF vet9s look t some exmples to see whyF he following loop nest looks like pyex loop st in gF he rrys re delred or lloted ll t one t the top of the routineD nd the strting ddress nd leding dimensions re visile to the ompilerF his is importnt euse it mens tht the storge reltionship etween the rry elements is well knownF reneD you ould expet good performneX

5define x FFF doule BxxD xxD dY for @iaHY i<xY iCCA

IRP

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


for @jaHY j<xY jCCA ij a ij C ji B dY

xow imgine wht hppens if you llote the rows dynmillyF his mkes the ddress lultions more omplitedF he loop nest hsn9t hngedY howeverD there is no gurnteed stride tht n get you from one row to the nextF his is euse the storge reltionship etween the rows is unknownX

5define x FFF doule BxD BxD dY for @iaHY i<xY iCCA { i a @doule BA mllo @xBsizeof@douleAAY i a @doule BA mllo @xBsizeof@douleAAY } for @iaHY i<xY iCCA for @jaHY j<xY jCCA ij a ij C ji B dY
sn ftD your ompiler knows even less thn you might expet out the storge reltionshipF por instneD how n it e sure tht referenes to nd ren9t lisesc st my e ovious to you tht they9re notF ou might point out tht never overlps storgeF fut the ompiler isn9t free to ssume thtF ho knowsc ou my e sustituting your own version of 3 vet9s look t di'erent exmpleD where storge is lloted ll t oneD though the delrtions re not visile to ll routines tht re using itF he following suroutine o performs the sme omputtion s our previous exmpleF roweverD euse the ompiler n9t see the delrtions for nd @they9re in the min routineAD it doesn9t hve enough informtion to e le to overlp memory referenes from suessive itertionsY the referenes ould e lisesX

malloc

malloc

5define xFFF min@A { doule xxD xxD dY FFF o @DDdDxAY } o @doule BDdoule BDdoule dDint nA { int iDjY doule BpD BpY for @iaHYi<nYiCCA { p a C @iBnAY p a C iY for @jaHY j<nY jCCA B@pCjA a B@pCjA C B@pC@jBnAA B dY } }

IRQ o get the est performneD mke ville to the ompiler s mny detils out the size nd shpe of your dt strutures s possileF ointersD whether in the form of forml rguments to suroutine or expliitly delredD n hide importnt fts out how you re using memoryF he more informtion the ompiler hsD the more it n overlp memory referenesF his informtion n ome from ompiler diretives or from mking delrtions visile in the routines where performne is most ritilF

3.1.6 Closing Notes8


ou lredy knew there ws limit to the mount of prllelism in ny given progrmF xow you know whyF glerlyD if progrm hd no dependeniesD you ould exeute the whole thing t oneD given suitle hrdwreF fut progrms ren9t in(nitely prllelY they re often hrdly prllel t llF his is euse they ontin dependenies of the types we sw oveF hen we re writing ndGor tuning our loopsD we hve numer of @sometimes on)itingA gols to keep in mindX

flne memory opertions nd omputtionsF winimize unneessry opertionsF eess memory using unit stride if t ll possileF ellow ll of the loop itertions to e omputed in prllelF

sn the oming hptersD we will egin to lern more out exeuting our progrms on prllel multiproessorsF et some point we will espe the onds of ompiler utomti optimiztion nd egin to expliitly ode the prllel portions of our odeF o lern more out ompilers nd dt)owD red he ert of gompiler hesignX heory nd rtie y homs ittmn nd tmes eters @rentieErllAF

3.1.7 Exercises9
Exercise 3.1
sdentify the dependenies @if there re nyA in the following loopsF gn you think of wys to orgnize eh loop for more prllelismc F

hy saIDxEP e@sCPA a e@sA C IF ixhhy


F

hy saIDxEIDP e@sCIA a e@sA C IF ixhhy


F
8 This 9 This
content is available online at <http://cnx.org/content/m32789/1.2/>. content is available online at <http://cnx.org/content/m32792/1.2/>.

IRR

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


hy saPDx e@sA a e@sEIA B PF f a e@sEIA ixhhy
dF

hy saIDx sp@x FqF wA e@sA a IF ixhhy


eF

hy saIDx e@sDtA a e@sDuA C f ixhhy


fF

hy saIDxEI e@sCIDtA a e@sDuA C f ixhhy


gF

for @iaHY i<nY iCCA i a iY

Exercise 3.2

smgine tht you re prllelizing ompilerD trying to generte ode for the loop elowF hy re referenes to e hllengec hy would it help to know tht u is equl to zeroc ixplin how you ould prtilly vetorize the sttements involving e if you knew tht u hd n solute vlue of t lest VF

hy saIDx

IRS

i@sDwA a i@sEIDwCIA E IFH f@sA a e@sCuA B g e@sA a h@sA B PFH ixhhy

he following three sttements ontin )ow dependenyD n ntidependeny nd n output dependenyF gn you identify ehc qiven tht you re llowed to reorder the sttementsD n you (nd permuttion tht produes the sme vlues for the vriles g nd fc how how you n redue the dependenies y omining or rerrnging lultions nd using temporry vrilesF

Exercise 3.3

f a e C g f a g C h g a f C h

3.2 Shared-Memory Multiprocessors


3.2.1 Introduction10
sn the midEIWVHsD shredEmemory multiproessors were pretty expensive nd pretty rreF xowD s hrdwre osts re droppingD they re eoming ommonpleF wny home omputer systems in the underE6QHHH rnge hve soket for seond gF rome omputer operting systems re providing the pility to use more thn one proessor to improve system performneF ther thn speilized resoures loked wy in entrl omputing filityD these shredEmemory proessors re often viewed s logil extension of the desktopF hese systems run the sme operting system @xs or xA s the desktop nd mny of the sme pplitions from worksttion will exeute on these multiproessor serversF ypilly worksttion will hve from I to R proessors nd server system will hve R to TR proessorsF hredEmemory multiproessors hve signi(nt dvntge over other multiproessors euse ll the proessors shre the sme view of the memoryD s shown in pigure QFII @e shredEmemory multiproessorAF hese proessors re lso desried s @lso known s weA systemsF his designtion indites tht memory is eqully essile to ll proessors with the sme performneF he populrity of these systems is not due simply to the demnd for high performne omputingF hese systems re exellent t providing high throughput for multiproessing lodD nd funtion e'etively s highEperformne dtse serversD network serversD nd snternet serversF ithin limitsD their throughput is inresed linerly s more proessors re ddedF sn this ook we re not so interested in the performne of dtse or snternet serversF ht is too pssY uy more proessorsD get etter throughputF e re interested in pureD rwD undulterted ompute speed for high performne pplitionF snsted of running hundreds of smll josD we wnt to utilize ll 6USHDHHH worth of hrdwre for our single joF he hllenge is to (nd tehniques tht mke progrm tht tkes n hour to omplete using one proessorD omplete in less thn minute using TR proessorsF his is not trivilF hroughout this ook so frD we hve een on n endless quest for prllelismF sn this nd the remining hptersD we will egin to see the pyo' for ll of your hrd work nd dedition3

uniform memory access

our

10 This

content is available online at <http://cnx.org/content/m32797/1.2/>.

IRT

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

he ost of shredEmemory multiproessor n rnge from 6RHHH to 6QH millionF ome exmple sysE tems inlude multipleEproessor sntel systems from wide rnge of vendorsD qs ower ghllenge eriesD rGgonvex gEeriesD hig elpherversD gry vetorGprllel proessorsD nd un interprise systemsF he qs yrigin PHHHD rGgonvex ixemplrD ht qenerl eEPHHHHD nd equent xweEPHHH ll re uniformE memoryD symmetri multiproessing systems tht n e linked to form even lrger shred nonuniform memoryEess systemsF emong these systemsD s the prie inresesD the numer of gs inresesD the performne of individul gs inresesD nd the memory performne inresesF sn this hpter we will study the hrdwre nd softwre environment in these systems nd lern how to exeute our progrms on these systemsF

3.2.2 Symmetric Multiprocessing Hardware11


sn pigure QFII @e shredEmemory multiproessorAD we viewed n idel shredEmemory multiproessorF sn this setionD we look in more detil t how suh system is tully onstrutedF he primry dvntge of these systems is the ility for ny g to ess ll of the memory nd peripherlsF purthermoreD the systems need fility for deiding mong themselves who hs ess to whtD nd whenD whih mens there will hve to e hrdwre support for ritrtionF he two most ommon rhiteturl underpinnings for symmetri multiproessing re nd F he us is the simplest of the two pprohesF pigure QFIP @e typil us rhitetureA shows proessors onneted using usF e us n e thought of s set of prllel wires onneting the omponents of the omputer @gD memoryD nd peripherl ontrollersAD set of protools for ommunitionD nd some hrdwre to help rry it outF e us is less expensive to uildD ut euse ll tr0 must ross the usD s the lod inresesD the us eventully eomes performne ottlenekF

buses

crossbars

A shared-memory multiprocessor

Figure 3.11

11 This

content is available online at <http://cnx.org/content/m32794/1.2/>.

IRU

A typical bus architecture

Figure 3.12

e rossr is hrdwre pproh to eliminte the ottlenek used y single usF e rossr is like severl uses running side y side with tthments to eh of the modules on the mhine " gD memoryD nd peripherlsF eny module n get to ny other y pth through the rossrD nd multiple pths my e tive simultneouslyF sn the RS rossr of pigure QFIQ @e rossrAD for instneD there n e four tive dt trnsfers in progress t one timeF sn the digrm it looks like pthwork of wiresD ut there is tully quite it of hrdwre tht goes into onstruting rossrF xot only does the rossr onnet prties tht wish to ommuniteD ut it must lso tively ritrte etween two or more gs tht wnt ess to the sme memory or peripherlF sn the event tht one module is too populrD it9s the rossr tht deides who gets ess nd who doesn9tF grossrs hve the est performne euse there is no single shred usF roweverD they re more expensive to uildD nd their ost inreses s the numer of ports is inresedF feuse of their ostD rossrs typilly re only found t the high end of the prie nd performne spetrumF hether the system uses us or rossrD there is only so muh memory ndwidth to go roundY four or eight proessors drwing from one memory system n quikly sturte ll ville ndwidthF ell of the tehniques tht improve memory performne @s desried in A lso pply here in the design of the memory susystems tthed to these uses or rossrsF

IRV

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


A crossbar

Figure 3.13

3.2.2.1 The Eect of Cache


he most ommon multiproessing system is mde up of ommodity proessors onneted to memory nd peripherls through usF snterestinglyD the ft tht these proessors mke use of he somewht mitigtes the ndwidth ottlenek on usEsed rhitetureF fy onneting the proessor to the he nd viewing the min memory through the heD we signi(ntly redue the memory tr0 ross the usF sn this rhitetureD most of the memory esses ross the us tke the form of he line lods nd )ushesF o understnd whyD onsider wht hppens when the he hit rte is very highF sn pigure QFIR @righ he hit rte redues min memory tr0AD high he hit rte elimintes some of the tr0 tht would hve otherwise gone out ross the us or rossr to min memoryF eginD it is the notion of lolity of referene tht mkes the system workF sf you ssume tht fir numer of the memory referenes will hit in the heD the equivlent ttinle min memory ndwidth is more thn the us is tully ple ofF his ssumption explins why multiproessors re designed with less us ndwidth thn the sum of wht the gs n onsume t oneF smgine senrio where two gs re essing di'erent res of memory using unit strideF foth gs ess the (rst element in he line t the sme timeF he us ritrrily llows one g ess to the memoryF he (rst g (lls he line nd egins to proess the dtF he instnt the (rst g hs ompleted its he line (llD the he line (ll for the seond g eginsF yne the seond he line (ll hs ompletedD the seond g egins to proess the dt in its he lineF sf the time to proess the dt in he line is longer thn the time to (ll he lineD the he line (ll for proessor two ompletes efore the next he line request rrives from proessor oneF yne the initil on)it is resolvedD oth proessors pper to hve on)itEfree ess to memory for the reminder of their unitEstride loopsF

IRW

High cache hit rate reduces main memory trac

Figure 3.14

sn tulityD on some of the fstest usEsed systemsD the memory us is su0iently fst tht up to PH proessors n ess memory using unit stride with very little on)itF sf the proessors re essing memory using nonEunit strideD us nd memory nk on)it eomes pprentD with fewer proessorsF his us rhiteture omined with lol hes is very populr for generlEpurpose multiproessing lodsF he memory referene ptterns for dtse or snternet servers generlly onsist of omintion of time periods with smll working setD nd time periods tht ess lrge dt strutures using unit strideF ienti( odes tend to perform more nonEunitEstride ess thn generlEpurpose odesF por this resonD the most expensive prllelEproessing systems trgeted t sienti( odes tend to use rossrs onneted to multinked memory systemsF he min memory system is etter shielded when lrger he is usedF por this resonD multiproessors sometimes inorporte twoEtier he systemD where eh proessor uses its own smll onEhip lol heD ked up y lrger seond ordElevel he with s muh s R wf of memoryF ynly when neither n stisfy memory requestD or when dt hs to e written k to min memoryD does request go out over the us or rossrF

3.2.2.2 Coherency
xowD wht hppens when one g of multiproessor running single progrm in prllel hnges the vlue of vrileD nd nother g tries to red itc here does the vlue ome fromc hese questions re interesting euse there n e multiple opies of eh vrileD nd some of them n hold old or stle vluesF por illustrtionD sy tht you re running progrm with shred vrile eF roessor I hnges the vlue of e nd roessor P goes to red itF

ISH

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


Multiple copies of variable A

Figure 3.15

sn pigure QFIS @wultiple opies of vrile eAD if roessor I is keeping e s registerEresident vrileD then roessor P doesn9t stnd hne of getting the orret vlue when it goes to look for itF here is no wy tht P n know the ontents of I9s registersY so ssumeD t the very lestD tht roessor I writes the new vlue k outF xow the question isD where does the new vlue get storedc hoes it remin in roessor I9s hec ss it written to min memoryc hoes it get updted in roessor P9s hec ellyD we re sking wht kind of the vendor uses to ssure tht ll proessors see uniform view of the vlues in memoryF st generlly isn9t something tht the progrmmer hs to worry outD exept tht in some sesD it n 'et performneF he pprohes used in these systems re similr to those used in singleEproessor systems with some extensionsF he most strightEforwrd he ohereny pproh is lled X vriles written into he re simultneously written into min memoryF es the updte tkes pleD other hes in the system see the min memory referene eing performedF his n e done euse ll of the hes ontinuously monitor @lso known s A the tr0 on the usD heking to see if eh ddress is in their heF sf he noties tht it ontins opy of the dt from the lotions eing writtenD it my either its opy of the vrile or otin new vlues @depending on the poliyAF yne thing to note is tht writeEthrough he demnds fir mount of min memory ndwidth sine eh write goes out over the min memory usF purthermoreD suessive writes to the sme lotion or nk re sujet to the min memory yle time nd n slow the mhine downF e more sophistited he ohereny protool is lled or F he ide is tht you write vlues k out to min memory only when the he housing them needs the spe for something elseF pdtes of hed dt re oordinted etween the hesD y the hesD without help from the proessorF gopyk hing lso uses hrdwre tht n monitor @snoopA nd respond to the memory trnstions of the other hes in the systemF he ene(t of this method over the writeEthrough method is tht memory tr0 is redued onsiderlyF vet9s wlk through it to see how it worksF

cache coherency protocol

write-through policy

invalidate

snooping

copyback writeback

3.2.2.3 Cache Line States


por this pproh to workD eh he must mintin stte for eh line in its heF he possile sttes used in the exmple inludeX

Modied: his he line needs to e written k to memoryF

ISI

Exclusive: here re no other hes tht hve this he lineF Shared: here re redEonly opies of this line in two or more hesF Empty/Invalid: his he line doesn9t ontin ny useful dtF
his prtiulr ohereny protool is often lled F yther he ohereny protools re more ompliE tedD ut these sttes give you n ide how multiproessor writek he ohereny worksF e strt where prtiulr he line is in memory nd in none of the writek hes on the systemsF he (rst he to sk for dt from prtiulr prt of memory ompletes norml memory essY the min memory system returns dt from the requested lotion in response to he missF he ssoited he line is mrked D mening tht this is the only he in the system ontining opy of the dtY it is the owner of the dtF sf nother he goes to min memory looking for the sme thingD the request is interepted y the (rst heD nd the dt is returned from the (rst he " not min memoryF yne n intereption hs ourred nd the dt is returnedD the dt is mrked in oth of the hesF hen prtiulr line is mrked shredD the hes hve to tret it di'erently thn they would if they were the exlusive owners of the dt " espeilly if ny of them wnts to modify itF sn prtiulrD write to shred he entry is preeded y rodst messge to ll the other hes in the systemF st tells them to invlidte their opies of the dtF he one remining he line gets mrked s to signl tht it hs een hngedD nd tht it must e returned to min memory when the spe is needed for something elseF fy these mehnismsD you n mintin he oherene ross the multiproessor without dding tremendously to the memory tr0F fy the wyD even if vrile is not shredD it9s possile for opies of it to show up in severl hesF yn symmetri multiproessorD your progrm n oune round from g to gF sf you run for little while on this gD nd then little while on thtD your progrm will hve operted out of seprte hesF ht mens tht there n e severl opies of seemingly unshred vriles sttered round the mhineF yperting systems often try to minimize how often proess is moved etween physil gs during ontext swithesF his is one reson not to overlod the ville proessors in systemF

MESI

exclusive

shared

modied

3.2.2.4 Data Placement


here is one more pitfll regrding shred memory we hve so fr filed to mentionF st involves dt movementF elthough it would e onvenient to think of the multiproessor memory s one ig poolD we hve seen tht it is tully refully rfted system of hesD ohereny protoolsD nd min memoryF he prolems ome when your pplition uses lots of dt to e trded etween the hesF ih referene tht flls out of given proessor9s he @espeilly those tht require n updte in nother proessor9s heA hs to go out on the usF yftenD it9s slower to get memory from nother proessor9s he thn from the min memory euse of the protool nd proessing overhed involvedF xot only do we need to hve progrms with high lolity of referene nd unit strideD we lso need to minimize the dt tht must e moved from one g to notherF

3.2.3 Multiprocessor Software Concepts 12


xow tht we hve exmined the wy shredEmemory multiproessor hrdwre opertesD we need to exmine how softwre opertes on these types of omputersF e still hve to wit until the next hpters to egin mking our pyex progrms run in prllelF por nowD we use g progrms to exmine the fundmentls of multiproessing nd multithredingF here re severl tehniques used to implement multithredingD so the topis we will over inludeX

yperting system!supported multiproessing ser spe multithreding yperting systemEsupported multithreding


he lst of these is wht we primrily will use to redue the wlltime of our pplitionsF
12 This
content is available online at <http://cnx.org/content/m32800/1.2/>.

ISP

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS process

3.2.3.1 Operating SystemSupported Multiprocessing


wost modern generlEpurpose operting systems support some form of multiproessingF wultiproessing doesn9t require more thn one physil gY it is simply the operting system9s ility to run more thn one on the systemF he operting system ontextEswithes etween eh proess t (xed time intervlsD or on interrupts or inputEoutput tivityF por exmpleD in xsD if you use the ps ommndD you n see the proesses on the systemX

7 ps E sh swi gwh PVRIH ptsGQR HXHH tsh PVPIQ ptsGQV HXHH xterm IHRVV ptsGSI HXHI telnet PVRII ptsGQR HXHH xiff IIIPQ ptsGPS HXHH pine QVHS ptsGPI HXHH elm TUUQ ptsGRR SXRV nsys FFF 7 ps EE | grep nsys TUUQ ptsGRR TXHH nsys
por eh proess we see the proess identi(er @shAD the terminl tht is exeuting the ommndD the mount of g time the ommnd hs usedD nd the nme of the ommndF he sh is unique ross the entire systemF wost xs ommnds re exeuted in seprte proessF sn the ove exmpleD most of the proesses re witing for some type of eventD so they re tking very few resoures exept for memoryF roess TUUQ13 seems to e exeuting nd using resouresF unning gin on(rms tht the g time is inresing for the proessX

ansys

ps

7 vmstt S pros memory pge disk fults r w swp free re mf pi po fr de sr fH sH EE EE in sy s Q H H QSQTPR RSRQP H H I H H H H H H H H RTI STPT QSR Q H H QSQPRV RQWTH H PP H H H H H H IR H H SIV TPPU QVS

unning the ommnd tells us mny things out the tivity on the systemF pirstD there re three runnle proessesF sf we hd one gD only one would tully e running t given instntF o llow ll three jos to progressD the operting system timeEshres etween the proessesF essuming equl priorityD eh proess exeutes out IGQ of the timeF roweverD this system is twoEproessor systemD so eh proess exeutes out PGQ of the timeF vooking ross the outputD we n see pging tivity @ D AD ontext swithes @ AD overll user time @ AD system time @ AD nd idle time @ AF ih proess n exeute ompletely di'erent progrmF hile most proesses re ompletely indeE pendentD they n ooperte nd shre informtion using interproess ommunition @pipesD soketsA or vrious operting systemEsupported shredEmemory resF e generlly don9t use multiproessing on these shredEmemory systems s tehnique to inrese singleEpplition performneF

vmstat 5

pu us sy id WI W H VW II H

pi po

cs

us

vmstat sy

id

13 ANSYS

is a commonly used structural-analysis package.

ISQ

3.2.3.2 Multiprocessing software


sn this setionD we explore how progrms ess multiproessing feturesF14 sn this exmpleD the progrm retes new proess using the fork@ A funtionF he new proess @hildA prints some messges nd then hnges its identity using exe@ A y loding new progrmF he originl proess @prentA prints some messges nd then wits for the hild proess to ompleteX

int glovrY min @A {

GB e glol vrile BG

int pidDsttusDretvlY int stkvrY GB e stk vrile BG glovr a IY stkvr a IY printf@4win E lling fork glovra7d stkvra7d\n4DglovrDstkvrAY pid a fork@AY printf@4win E fork returned pida7d\n4DpidAY if @ pid aa H A { printf@4ghild E glovra7d stkvra7d\n4DglovrDstkvrAY sleep@IAY printf@4ghild E woke up glovra7d stkvra7d\n4DglovrDstkvrAY glovr a IHHY stkvr a IHHY printf@4ghild E modified glovra7d stkvra7d\n4DglovrDstkvrAY retvl a exel@4GinGdte4D @hr BA H AY printf@4ghild E r ei i rii retvla7d\n4DretvlAY } else { printf@4rent E glovra7d stkvra7d\n4DglovrDstkvrAY glovr a SY stkvr a SY printf@4rent E sleeping glovra7d stkvra7d\n4DglovrDstkvrAY sleep@PAY printf@4rent E woke up glovra7d stkvra7d\n4DglovrDstkvrAY printf@4rent E witing for pida7d\n4DpidAY retvl a wit@8sttusAY sttus a sttus VY GB eturn ode in its ISEV BG printf@4rent E sttusa7d retvla7d\n4DsttusDretvlAY }

he key to understnding this ode is to understnd how the fork@ A funtion opertesF he simple summry is tht the fork@ A funtion is lled one in proess nd returns twieD one in the originl proess nd one in newly reted proessF he newly reted proess is n identil opy of the originl proessF ell the vriles @lol nd glolA hve een duplitedF foth proesses hve ess to ll of the open (les of the originl proessF pigure QFIT @row fork opertesA shows how the fork opertion retes new proessF
14 These
examples are written in C using the POSIX 1003.1 application programming interface. This example runs on most UNIX systems and on other POSIX-compliant systems including OpenNT, Open- VMS, and many others.

ISR

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

he only di'erene etween the proesses is tht the return vlue from the fork@ A funtion ll is H in the new @hildA proess nd the proess identi(er @shown y the ps ommndA in the originl @prentA proessF his is the progrm outputX

res 7 Eo fork forkF res 7 fork win E lling fork glovraI stkvraI win E fork returned pidaIWQQT win E fork returned pidaH rent E glovraI stkvraI rent E sleeping glovraS stkvraS ghild E glovraI stkvraI ghild E woke up glovraI stkvraI ghild E modified glovraIHH stkvraIHH hu xov T PPXRHXQQ rent E woke up glovraS stkvraS rent E witing for pidaIWQQT rent E sttusaH retvlaIWQQT res 7
ring this throughD (rst the progrm sets the glol nd stk vrile to one nd then lls fork@ AF huring the fork@ A llD the operting system suspends the proessD mkes n ext duplite of the proessD nd then restrts oth proessesF ou n see two messges from the sttement immeditely fter the forkF he (rst line is oming from the originl proessD nd the seond line is oming from the new proessF sf you were to exeute ps ommnd t this moment in timeD you would see two proesses running lled forkF yne would hve proess identi(er of IWQQTF

ISS

How a fork operates

Figure 3.16

es oth proesses strtD they exeute n spErixEivi nd egin to perform di'erent tions in the prent nd hildF xotie tht nd re set to S in the prentD nd then the prent sleeps for two seondsF et this pointD the hild egins exeutingF he vlues for nd re unhnged in the hild proessF his is euse these two proesses re operting in ompletely independent memory spesF he hild proess sleeps for one seond nd sets its opies of the vriles to IHHF xextD the hild proess lls the exel@ A funtion to overwrite its memory spe with the xs dte progrmF xote tht the exel@ A never returnsY the dte progrm tkes over ll of the resoures of the hild proessF sf you were to do t this moment in timeD you still see two proesses on the system ut proess IWQQT would e lled dteF he dte ommnd exeutesD nd you n see its outputF15 he prent wkes up fter rief twoEseond sleep nd noties tht its opies of glol nd lol vriles hve not een hnged y the tion of the hild proessF he prent then lls the wit@ A funtion to

globvar

stackvar

globvar

stackvar

ps

15 It's

not uncommon for a human parent process to fork and create a human child process that initially seems to have the

same identity as the parent. It's also not uncommon for the child process to change its overall identity to be something very dierent from the parent at some later point. Usually human children wait 13 years or so before this change occurs, but in UNIX, this happens in a few microseconds. So, in some ways, in UNIX, there are many parent processes that are disappointed because their children did not turn out like them!

IST

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

determine if ny of its hildren exitedF he wit@ A funtion returns whih hild hs exited nd the sttus ode returned y tht hild proess @in this seD proess IWQQTAF

3.2.3.3 User Space Multithreading


e is di'erent from proessF hen you dd thredsD they re dded to the existing proess rther thn strting in new proessF roesses strt with single thred of exeution nd n dd or remove threds throughout the durtion of the progrmF nlike proessesD whih operte in di'erent memory spesD ll threds in proess shre the sme memory speF pigure QFIU @greting thredA shows how the retion of thred di'ers from the retion of proessF xot ll of the memory spe in proess is shred etween ll thredsF sn ddition to the glol re tht is shred ross ll thredsD eh thred hs re for its own lol vrilesF st9s importnt for progrmmers to know when they re working with shred vriles nd when they re working with lol vrilesF hen ttempting to speed up high performne omputing pplitionsD threds hve the dvntge over proesses in tht multiple threds n ooperte nd work on shred dt struture to hsten the omputtionF fy dividing the work into smller portions nd ssigning eh smller portion to seprte thredD the totl work n e ompleted more quiklyF wultiple threds re lso used in high performne dtse nd snternet servers to improve the overll of the serverF ith single thredD the progrm n either e witing for the next network request or reding the disk to stisfy the previous requestF ith multiple thredsD one thred n e witing for the next network trnstion while severl other threds re witing for disk sGy to ompleteF he following is n exmple of simple multithreded pplitionF16 st egins with single mster thred tht retes three dditionl thredsF ih thred prints some messgesD esses some glol nd lol vrilesD nd then termintesX

thread

thread private

throughput

5defineiixex 5inlude <stdioFh> 5inlude <pthredFh>

GB si lines for threds BG

5define riehgyx Q void Bestpun@void BAY int glovrY GB e glol vrile BG int indexriehgyx GB vol zeroEsed thred index BG pthredt thredidriehgyxY GB ys hred shs BG min@A { int iDretvlY pthredt tidY glovr a HY printf@4win E glovra7d\n4DglovrAY for@iaHYi<riehgyxYiCCA { indexi a iY retvl a pthredrete@8tidDxvvDestpunD@void BA indexiAY printf@4win E reting ia7d tida7d retvla7d\n4DiDtidDretvlAY thredidi a tidY }
16 This
example uses the IEEE POSIX standard interface for a thread library. If your system supports POSIX threads, this example should work. If not, there should be similar routines on your system for each of the thread functions.

ISU

printf@4win thred E threds strted glovra7d\n4DglovrAY for@iaHYi<riehgyxYiCCA { printf@4win E witing for join 7d\n4DthredidiAY retvl a pthredjoin@ thredidiD xvv A Y printf@4win E k from join 7d retvla7d\n4DiDretvlAY } printf@4win thred E threds ompleted glovra7d\n4DglovrAY } void Bestpun@void BprmA { int meDselfY me a @intA prmY GB wy self a pthredself@AY printf@4estpun mea7d glovr a me C ISY printf@4estpun mea7d sleep@PAY printf@4estpun mea7d own ssigned thred ordinl BG GB he ys hred lirry thred numer BG E selfa7d glovra7d\n4DmeDselfDglovrAY E sleeping glovra7d\n4DmeDglovrAY E done prma7d glovra7d\n4DmeDselfDglovrAY

ISV

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


Creating a thread

Figure 3.17

he glol shred res in this se re those vriles delred in the stti re outside the min@ A odeF he lol vriles re ny vriles delred within routineF hen threds re ddedD eh thred gets its own funtion ll stkF sn gD the vriles tht re delred t the eginning of eh routine re lloted on the stkF es eh thred enters funtionD these vriles re seprtely lloted on tht prtiulr thred9s stkF o these re the vrilesF nlike the fork@ A funtionD the pthredrete@ A funtion retes new thredD nd then ontrol is returned to the lling thredF yne of the prmeters of the pthredrete@ A is the nme of funtionF xew threds egin exeution in the funtion estpun@ A nd the thred (nishes when it returns from this funtionF hen this progrm is exeutedD it produes the following outputX

automatic thread-local

res 7 Eo reteI Elpthred ElposixR reteIF res 7 reteI

ISW

win E glovraH win E reting iaH tidaR retvlaH win E reting iaI tidaS retvlaH win E reting iaP tidaT retvlaH win thred E threds strted glovraH win E witing for join R estpun meaH E selfaR glovraH estpun meaH E sleeping glovraIS estpun meaI E selfaS glovraIS estpun meaI E sleeping glovraIT estpun meaP E selfaT glovraIT estpun meaP E sleeping glovraIU estpun meaP E done prmaT glovraIU estpun meaI E done prmaS glovraIU estpun meaH E done prmaR glovraIU win E k from join H retvlaH win E witing for join S win E k from join I retvlaH win E witing for join T win E k from join P retvlaH win thred EE threds ompleted glovraIU res 7
ou n see the threds getting reted in the loopF he mster thred ompletes the pthredrete@ A loopD exeutes the seond loopD nd lls the pthredjoin@ A funtionF his funtion suspends the mster thred until the spei(ed thred ompletesF he mster thred is witing for hred R to ompleteF yne the mster thred suspendsD one of the new threds is strtedF hred R strts exeutingF snitilly the vrile glovr is set to H from the min progrmF he selfD me nd prm vriles re thredElol vrilesD so eh thred hs its own opyF hred R sets glovr to IS nd goes to sleepF hen hred S egins to exeute nd sees glovr set to IS from hred RY hred S sets glovr to ITD nd goes to sleepF his tivtes hred TD whih sees the urrent vlue for glovr nd sets it to IUF hen hreds TD SD nd R wke up from their sleepD ll notie the ltest vlue of IU in glovrD nd return from the estpun@ A routineD ending the thredsF ell this timeD the mster thred is in the middle of pthredjoin@ A witing for hred R to ompleteF es hred R ompletesD the pthredjoin@ A returnsF he mster thred then lls pthredjoin@ A repetedly to ensure tht ll three threds hve een ompletedF pinllyD the mster thred prints out the vlue for glovr tht ontins the ltest vlue of IUF o summrizeD when n pplition is exeuting with more thn one thredD there re shred glol res nd thred privte resF hi'erent threds exeute t di'erent timesD nd they n esily work together in shred resF

3.2.3.4 Limitations of user space multithreading


wultithreded pplitions were round long efore multiproessors existedF st is quite prtil to hve multiple threds with single gF es mtter of ftD the previous exmple would run on system with ny numer of proessorsD inluding oneF sf you look losely t the odeD it performs sleep opertion t eh ritil point in the odeF yne reson to dd the sleep lls is to slow the progrm down enough tht you n tully see wht is going onF roweverD these sleep lls lso hve nother e'etF hen one thred enters the sleep routineD it uses the thred lirry to serh for other runnle thredsF sf runnle thred is foundD it egins exeuting immeditely while the lling thred is sleepingF his is lled swithF he proess tully hs one operting system thred shred mong severl logil

thread context

user-space

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS user thredsF hen lirry routines @suh s sleep A re lledD the thred lirry jumps in nd reshedules
ITH
17

thredsF e n explore this e'et y sustituting the following pinpun@ A funtionD repling estpun@ A funtion in the pthredrete@ A ll in the previous exmpleX

void Bpinpun@void BprmA { int meY me a @intA prmY printf@4pinpun mea7d E sleeping 7d seonds FFF\n4D meD meCIAY sleep@meCIAY printf@4pinpun mea7d EE wke glovra7dFFF\n4D meD glovrAY glovr CCY printf@4pinpun mea7d E spinning glovra7dFFF\n4D meD glovrAY while@glovr < riehgyx A Y printf@4pinpun mea7d EE done glovra7dFFF\n4D meD glovrAY sleep@riehgyxCIAY }
sf you look t the funtionD eh thred entering this funtion prints messge nd goes to sleep for ID PD nd Q seondsF hen the funtion inrements glovr @initilly set to H in minA nd egins whileEloopD ontinuously heking the vlue of glovrF es time pssesD the seond nd third threds should (nish their sleep@ AD inrement the vlue for glovrD nd egin the whileEloopF hen the lst thred rehes the loopD the vlue for glovr is Q nd ll the threds exit the loopF roweverD this isn9t wht hppensX

res 7 reteP 8 I PQWPI res 7 win E glovraH win E reting iaH tidaR retvlaH win E reting iaI tidaS retvlaH win E reting iaP tidaT retvlaH win thred E threds strted glovraH win E witing for join R pinpun meaH E sleeping I seonds FFF pinpun meaI E sleeping P seonds FFF pinpun meaP E sleeping Q seonds FFF pinpun meaH E wke glovraHFFF pinpun meaH E spinning glovraIFFF res 7 ps sh PQWPI ptsGQS res 7 ps sh
17 The

swi gwh HXHW reteP swi gwh


cthreads.

pthreads library supports both user-space threads and operating-system threads, as we shall soon see. Another popular

early threads package was called

ITI

PQWPI ptsGQS IXIT reteP res 7 kill EW PQWPI I uilled res 7

reteP

e run the progrm in the kground18 nd everything seems to run (neF ell the threds go to sleep for ID PD nd Q seondsF he (rst thred wkes up nd strts the loop witing for glovr to e inremented y the other thredsF nfortuntelyD with user spe thredsD there is no utomti time shringF feuse we re in g loop tht never mkes system llD the seond nd third threds never get sheduled so they n omplete their sleep@ A llF o (x this prolemD we need to mke the following hnge to the odeX

while@glovr < riehgyx A sleep@IA Y


ith this sleep19 llD hreds P nd Q get hne to e sheduledF hey then (nish their sleep llsD inrement the glovr vrileD nd the progrm termintes properlyF ou might sk the questionD hen wht is the point of user spe thredsc ellD when there is high performne dtse server or snternet serverD the multiple logil threds n overlp network sGy with dtse sGy nd other kground omputtionsF his tehnique is not so useful when the threds ll wnt to perform simultneous gEintensive omputtionsF o do thisD you need threds tht re retedD mngedD nd sheduled y the operting system rther thn user lirryF

3.2.3.5 Operating System-Supported Multithreading


hen the operting system supports multiple threds per proessD you n egin to use these threds to do simultneous omputtionl tivityF here is still no requirement tht these pplitions e exeuted on multiproessor systemF hen n pplition tht uses four operting system threds is exeuted on single proessor mhineD the threds exeute in timeEshred fshionF sf there is no other lod on the systemD eh thred gets IGR of the proessorF hile there re good resons to hve more threds thn proessors for nonompute pplitionsD it9s not good ide to hve more tive threds thn proessors for omputeEintensive pplitions euse of thredEswithing overhedF @por more detil on the e'et of too mny thredsD see eppendix hD row pyex wnges hreds t untimeF sf you re using the ys threds lirryD it is simple modi(tion to request tht your threds e reted s opertingEsystem rther rther thn user thredsD s the following ode showsX

5define iixex 5inlude <stdioFh> 5inlude <pthredFh>

GB si QElines for threds BG

5define riehgyx P void Bpinpun@void BAY int glovrY int indexriehgyxY pthredt thredidriehgyxY pthredttrt ttrY

GB e glol vrile BG GB vol zeroEsed thred index BG GB ys hred shs BG GB hred ttriutes xvvause defult BG
) that checks for runnable threads. If it nds a runnable

18 Because we know it will hang and ignore interrupts. 19 Some thread libraries support a call to a routine sched_yield(

thread, it runs the thread. If no thread is runnable, it returns immediately to the calling thread. This routine allows a thread that has the CPU to ensure that other threads make progress during CPU-intensive periods of its code.

ITP

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

min@A { int iDretvlY pthredt tidY glovr a HY pthredttrinit@8ttrAY GB snitilize ttr with defults BG pthredttrsetsope@8ttrD riehgyiiwAY printf@4win E glovra7d\n4DglovrAY for@iaHYi<riehgyxYiCCA { indexi a iY retvl a pthredrete@8tidD8ttrDpinpunD@void BA indexiAY printf@4win E reting ia7d tida7d retvla7d\n4DiDtidDretvlAY thredidi a tidY } printf@4win thred E threds strted glovra7d\n4DglovrAY for@iaHYi<riehgyxYiCCA { printf@4win E witing for join 7d\n4DthredidiAY retvl a pthredjoin@ thredidiD xvv A Y printf@4win E k from join 7d retvla7d\n4DiDretvlAY } printf@4win thred E threds ompleted glovra7d\n4DglovrAY

he ode exeuted y the mster thred is modi(ed slightlyF e rete n ttriute dt struture nd set the riehgyiiw ttriute to indite tht we would like our new threds to e reted nd sheduled y the operting systemF e use the ttriute informtion on the ll to pthredrete@ AF xone of the other ode hs een hngedF he following is the exeution output of this new progrmX

res 7 reteQ win E glovraH win E reting iaH tidaR retvlaH pinpun meaH E sleeping I seonds FFF win E reting iaI tidaS retvlaH win thred E threds strted glovraH win E witing for join R pinpun meaI E sleeping P seonds FFF pinpun meaH E wke glovraHFFF pinpun meaH E spinning glovraIFFF pinpun meaI E wke glovraIFFF pinpun meaI E spinning glovraPFFF pinpun meaI E done glovraPFFF pinpun meaH E done glovraPFFF win E k from join H retvlaH win E witing for join S win E k from join I retvlaH win thred E threds ompleted glovraP res 7

ITQ xow the progrm exeutes properlyF hen the (rst thred strts spinningD the operting system is ontext swithing etween ll three thredsF es the threds ome out of their sleep@ AD they inrement their shred vrileD nd when the (nl thred inrements the shred vrileD the other two threds instntly notie the new vlue @euse of the he ohereny protoolA nd (nish the loopF sf there re fewer thn three gsD thred my hve to wit for timeEshring ontext swith to our efore it noties the updted glol vrileF ith opertingEsystem threds nd multiple proessorsD progrm n relistilly rek up lrge omputtion etween severl independent threds nd ompute the solution more quiklyF yf ourse this presupposes tht the omputtion ould e done in prllel in the (rst pleF

3.2.4 Techniques for Multithreaded Programs20


qiven tht we hve multithreded pilities nd multiproessorsD we must still onvine the threds to work together to omplish some overll golF yften we need some wys to oordinte nd ooperte etween the thredsF here re severl importnt tehniques tht re used while the progrm is running with multiple thredsD inludingX

porkEjoin @or reteEjoinA progrmming ynhroniztion using ritil setion with lokD semphoreD or mutex frriers
ih of these tehniques hs n overhed ssoited with itF feuse these overheds re neessry to go prllelD we must mke sure tht we hve su0ient work to mke the ene(t of prllel opertion worth the ostF

3.2.4.1 Fork-Join Programming


his pproh is the simplest method of oordinting your thredsF es in the erlier exmples in this hpterD mster thred sets up some glol dt strutures tht desrie the tsks eh thred is to perform nd then use the pthredrete@ A funtion to tivte the proper numer of thredsF ih thred heks the glol dt struture using its thredEid s n index to (nd its tskF he thred then performs the tsk nd ompletesF he mster thred wits t pthredjoin@ A pointD nd when thred hs ompletedD it updtes the glol dt struture nd retes new thredF hese steps re repeted for eh mjor itertion @suh s timeEstepA for the durtion of the progrmX

for@tsaHYts<IHHHHYtsCCA { GB ime tep voop BG GB etup tsks BG for @ithaHYith<xwriehYithCCA pthredrete@FFDworkroutineDFFA for @ithaHYith<xwriehYithCCA pthredjoin@FFFA } workroutine@A { GB erform sk BG returnY }
he shortoming of this pproh is the overhed ost ssoited with reting nd destroying n operting system thred for potentilly very short tskF he other pproh is to hve the threds reted t the eginning of the progrm nd to hve them ommunite mongst themselves throughout the durtion of the pplitionF o do thisD they use suh tehniques s ritil setions or rriersF
20 This
content is available online at <http://cnx.org/content/m32802/1.2/>.

ITR

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

3.2.4.2 Synchronization
ynhroniztion is needed when there is prtiulr opertion to shred vrile tht n only e performed y one proessor t timeF por exmpleD in previous pinpun@ A exmplesD onsider the lineX

glovrCCY
sn ssemly lngugeD this tkes t lest three instrutionsX

vyeh ehh yi

IDglovr IDI IDglovr

ht if glovr ontined HD hred I ws runningD ndD t the preise moment it ompleted the vyeh into egister I nd efore it hd ompleted the ehh or yi instrutionsD the operting system interrupted the thred nd swithed to hred Pc hred P thes up nd exeutes ll three instrutions using its registersX loding HD dding I nd storing the I k into glovrF xow hred P goes to sleep nd hred I is restrted t the ehh instrutionF egister I for hred I ontins the previously loded vlue of HY hred I dds I nd then stores I into glovrF ht is wrong with this piturec e ment to use this ode to ount the numer of threds tht hve pssed this pointF wo threds pssed the pointD ut euse of d se of d timingD our vrile indites only tht one thred pssedF his is euse the inrement of vrile in memory is not tomiF ht isD hlfwy through the inrementD something else n hppenF enother wy we n hve prolem is on multiproessor when two proessors exeute these instrutions simultneouslyF hey oth do the vyehD getting HF hen they oth dd I nd store I k to memoryF21 hih proessor tully got the honor of storing I k to memory is simply reF e must hve some wy of gurnteeing tht only one thred n e in these three instrutions t the sme timeF sf one thred hs strted these instrutionsD ll other threds must wit to enter until the (rst thred hs exitedF hese res re lled F yn singleEg systemsD there ws simple solution to ritil setionsX you ould turn o' interrupts for few instrutions nd then turn them k onF his wy you ould gurntee tht you would get ll the wy through efore timer or other interrupt ourredX

their

critical sections

sxypp vyeh IDglovr ehh IDI yi IDglovr sxyx

GG urn off snterrupts

GG urn on snterrupts

roweverD this tehnique does not work for longer ritil setions or when there is more thn one gF sn these sesD you need lokD semphoreD or mutexF wost thred lirries provide this type of routineF o use mutexD we hve to mke some modi(tions to our exmple odeX
21 Boy,
this is getting pretty picky. How often will either of these events really happen? Well, if it crashes your airline reservation system every 100,000 transactions or so, that would be way too often.

ITS

FFF pthredmutext mymutexY GB wi dt struture BG FFF min@A { FFF pthredttrinit@8ttrAY GB snitilize ttr with defults BG pthredmutexinit @8mymutexD xvvAY FFFF pthredrete@ FFF A FFF } void Bpinpun@void BprmA { FFF pthredmutexlok @8mymutexAY glovr CCY pthredmutexunlok @8mymutexAY while@glovr < riehgyx A Y printf@4pinpun mea7d EE done glovra7dFFF\n4D meD glovrAY FFF }
he mutex dt struture must e delred in the shred re of the progrmF fefore the threds re retedD pthredmutexinit must e lled to initilize the mutexF fefore glovr is inrementedD we must lok the mutex nd fter we (nish updting glovr @three instrutions lterAD we unlok the mutexF ith the ode s shown oveD there will never e more thn one proessor exeuting the glovrCC line of odeD nd the ode will never hng euse n inrement ws missedF emphores nd loks re used in similr wyF snterestinglyD when using user spe thredsD n ttempt to lok n lredy loked mutexD semphoreD or lok n use thred ontext swithF his llows the thred tht owns the lok etter hne to mke progress towrd the point where they will unlok the ritil setionF elsoD the t of unloking mutex n use the thred witing for the mutex to e dispthed y the thred lirryF

3.2.4.3 Barriers
frriers re di'erent thn ritil setionsF ometimes in multithreded pplitionD you need to hve ll threds rrive t point efore llowing ny threds to exeute eyond tht pointF en exmple of this is F ih tsk proesses its portion of the simultion ut must wit until ll of the threds hve ompleted the urrent time step efore ny thred n egin the next time stepF ypilly threds re retedD nd then eh thred exeutes loop with one or more rriers in the loopF he rough pseudoode for this type of pproh is s followsX

time-based simulation

min@A { for @ithaHYith<xwriehYithCCA pthredrete@FFDworkroutineDFFA for @ithaHYith<xwriehYithCCA pthredjoin@FFFA GB it long time BG exit@A } workroutine@A { for@tsaHYts<IHHHHYtsCCA { GB ime tep voop BG GB gompute totl fores on prtiles BG witrrier@AY GB pdte prtile positions sed on the fores BG

ITT

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


} witrrier@AY } returnY

sn senseD our pinpun@ A funtion implements rrierF st sets vrile initilly to HF hen s threds rriveD the vrile is inremented in ritil setionF smmeditely fter the ritil setionD the thred spins until the preise moment tht ll the threds re in the spin loopD t whih time ll threds exit the spin loop nd ontinue onF por ritil setionD only one proessor n e exeuting in the ritil setion t the sme timeF por rrierD ll proessors must rrive t the rrier efore ny of the proessors n leveF

3.2.5 A Real Example 22


sn ll of the ove exmplesD we hve foused on the mehnis of shred memoryD thred retionD nd thred termintionF e hve used the sleep@ A routine to slow things down su0iently to see intertions etween proessesF fut we wnt to go very fstD not just lern threding for threding9s skeF he exmple ode elow uses the multithreding tehniques desried in this hpter to speed up sum of lrge rryF he routine is from etion PFPFIF his ode llotes fourEmillionEelement douleEpreision rry nd (lls it with rndom numers eE tween H nd IF hen using oneD twoD threeD nd four thredsD it sums up the elements in the rryX

hpcwall

5define iixex 5inlude <stdioFh> 5inlude <stdliFh> 5inlude <pthredFh>

GB si QElines for threds BG

5define werieh R void Bumpun@void BAY int hredgountY doule qloumY int indexweriehY pthredt thredidweriehY pthredttrt ttrY pthredmutext mymutexY 5define wesi RHHHHHH doule rrywesiY void hpwll@doule BAY

GB GB GB GB GB GB

hreds on this try BG e glol vrile BG vol zeroEsed thred index BG ys hred shs BG hred ttriutes xvvause defult BG wi dt struture BG

GB ht we re summingFFF BG

min@A { int iDretvlY pthredt tidY doule singleDmultiDegtimeDendtimeY GB snitilize things BG for @iaHY i<wesiY iCCA rryi a drndRV@AY pthredttrinit@8ttrAY GB snitilize ttr with defults BG
22 This
content is available online at <http://cnx.org/content/m32804/1.2/>.

ITU

pthredmutexinit @8mymutexD xvvAY pthredttrsetsope@8ttrD riehgyiiwAY GB ingle threded sum BG qloum a HY hpwll@8egtimeAY for@iaHY i<wesiYiCCA qloum a qloum C rryiY hpwll@8endtimeAY single a endtime E egtimeY printf@4ingle suma7lf timea7lf\n4DqloumDsingleAY GB se different numers of threds to omplish the sme thing BG for@hredgountaPYhredgount<aweriehY hredgountCCA { printf@4hredsa7d\n4DhredgountAY qloum a HY hpwll@8egtimeAY for@iaHYi<hredgountYiCCA { indexi a iY retvl a pthredrete@8tidD8ttrDumpunD@void BA indexiAY thredidi a tidY } for@iaHYi<hredgountYiCCA retvl a pthredjoin@thredidiDxvvAY hpwll@8endtimeAY multi a endtime E egtimeY printf@4uma7lf timea7lf\n4DqloumDmultiAY printf@4iffiieny a 7lf\n4DsingleG@multiBhredgountAAY } GB ind of the hredgount loop BG

void Bumpun@void BprmA{ int iDmeDhunkDstrtDendY doule voumY GB heide whih itertions elong to me BG me a @intA prmY hunk a wesi G hredgountY strt a me B hunkY end a strt C hunkY GB gEtyle E tul element C I BG if @ me aa @hredgountEIA A end a wesiY printf@4umpun mea7d strta7d enda7d\n4DmeDstrtDendAY GB gompute sum of our susetBG voum a HY for@iastrtYi<endYiCC A voum a voum C rryiY GB pdte the glol sum nd return to the witing join BG pthredmutexlok @8mymutexAY qloum a qloum C voumY pthredmutexunlok @8mymutexAY

pirstD the ode performs the sum using single thred using forEloopF hen for eh of the prllel sumsD

ITV

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

it retes the pproprite numer of threds tht ll umpun@ AF ih thred strts in umpun@ A nd initilly hooses n re to opertion in the shred rryF he strip is hosen y dividing the overll rry up evenly mong the threds with the lst thred getting few extr if the division hs reminderF henD eh thred independently performs the sum on its reF hen thred hs (nished its ompuE ttionD it uses mutex to updte the glol sum vrile with its ontriution to the glol sumX

res 7 ddup ingle sumaUWWWWWVHHHHHHFHHHHHH timeaHFPSTTPR hredsaP umpun meaH strtaH endaPHHHHHH umpun meaI strtaPHHHHHH endaRHHHHHH umaUWWWWWVHHHHHHFHHHHHH timeaHFIQQSQH iffiieny a HFWTHWPQ hredsaQ umpun meaH strtaH endaIQQQQQQ umpun meaI strtaIQQQQQQ endaPTTTTTT umpun meaP strtaPTTTTTT endaRHHHHHH umaUWWWWWVHHHHHHFHHHHHH timeaHFHWIHIV iffiieny a HFWQWVPW hredsaR umpun meaH strtaH endaIHHHHHH umpun meaI strtaIHHHHHH endaPHHHHHH umpun meaP strtaPHHHHHH endaQHHHHHH umpun meaQ strtaQHHHHHH endaRHHHHHH umaUWWWWWVHHHHHHFHHHHHH timeaHFIHURUQ iffiieny a HFSWTWSH res 7
here re some interesting ptternsF fefore you interpret the ptternsD you must know tht this system is threeEproessor un interprise QHHHF xote tht s we go from one to two thredsD the time is redued to oneEhlfF ht is good result given how muh it osts for tht extr gF e hrterize how well the dditionl resoures hve een used y omputing n e0ieny ftor tht should e IFHF his is omputed y multiplying the wll time y the numer of thredsF hen the time it tkes on single proessor is divided y this numerF sf you re using the extr proessors wellD this evlutes to IFHF sf the extr proessors re used pretty wellD this would e out HFWF sf you hd two thredsD nd the omputtion did not speed up t llD you would get HFSF et two nd three thredsD wll time is droppingD nd the e0ieny is well over HFWF roweverD t four thredsD the wll time inresesD nd our e0ieny drops very drmtillyF his is euse we now hve more threds thn proessorsF iven though we hve four threds tht ould exeuteD they must e timeE slied etween three proessorsF23 his is even worse tht it might seemF es threds re swithedD they move from proessor to proessor nd their hes must lso move from proessor to proessorD further slowing performneF his heEthrshing e'et is not too pprent in this exmple euse the dt struture is so lrgeD most memory referenes re not to vlues previously in heF st9s importnt to note tht euse of the nture of )otingEpoint @see etion IFPFIAD the prllel sum my not e the sme s the seril sumF o perform summtion in prllelD you must e willing to tolerte these slight vritions in your resultsF
23 It
is important to match the number of runnable threads to the available resources. In compute code, when there are more threads than available processors, the threads compete among themselves, causing unnecessary overhead and reducing the eciency of your computation.

ITW

3.2.6 Closing Notes24


es they drop in prieD multiproessor systems re eoming fr more ommonF hese systems hve mny ttrtive feturesD inluding good prieGperformneD omptiility with worksttionsD lrge memoriesD high throughputD lrge shred memoriesD fst sGyD nd mny othersF hile these systems re strong in multiproE grmmed server rolesD they re lso n 'ordle high performne omputing resoure for mny orgnizE tionsF heir heEoherent shredEmemory model llows multithreded pplitions to e esily developedF e hve lso exmined some of the softwre prdigms tht must e used to develop multithreded pplitionsF hile you hopefully will never hve to write g ode with expliit threds like the exmples in this hpterD it is nie to understnd the fundmentl opertions t work on these multiproessor systemsF sing the pyex lnguge with n utomti prllelizing ompilerD we hve the dvntge tht these nd mny more detils re left to the pyex ompiler nd runtime lirryF et some pointD espeilly on the most dvned rhiteturesD you my hve to expliitly progrm multithreded progrm using the types of tehniques shown in this hpterF yne trend tht hs een predited for some time is tht we will egin to see multiple heEoherent gs on single hip one the ility to inrese the lok rte on single hip slows downF smgine tht your new 6PHHH worksttion hs four IEqrz proessors on single hipF ounds like good time to lern how to write multithreded progrms3

3.2.7 Exercises25
Exercise 3.4 Exercise 3.5
ixperiment with the fork ode in this hpterF un the progrm multiple times nd see how the order of the messges hngesF ixplin the resultsF ixperiment with the reteI nd reteQ odes in this hpterF emove ll of the sleep@ A llsF ixeute the progrms severl times on single nd multiproessor systemsF gn you explin why the output hnges from run to run in some situtions nd doesn9t hnge in othersc

Exercise 3.6

ixperiment with the prllel sum ode in this hpterF sn the umpun@ A routineD hnge the forEloop toX

for@iastrtYi<endYiCC A qloum a qloum C rryiY


emove the three lines t the end tht get the mutex nd updte the qloumF ixeute the odeF ixplin the di'erene in vlues tht you see for qloumF ere the ptterns di'erent on single proessor nd multiproessorc ixplin the performne impt on single proessor nd multiproessorF

Exercise 3.7

ixplin how the following ode segment ould use dedlok " two or more proesses witing for resoure tht n9t e relinquishedX

FFF ll ll FFF ll ll F
24 This 25 This

lok @lwordIA lok @lwordPA unlok @lwordIA unlok @lwordPA

content is available online at <http://cnx.org/content/m32807/1.2/>. content is available online at <http://cnx.org/content/m32810/1.2/>.

IUH

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


F F ll lok @lwordPA ll lok @lwordIA FFF ll unlok @lwordPA ll unlok @lwordIA FFF

Exercise 3.8

sf you were to ode the funtionlity of spinElok in gD it might look like thisX

while @3lokwordAY lokword a 3lokwordY


es you know from the (rst setions of the ookD the sme sttements would e ompiled into expliit lods nd storesD omprisonD nd rnhF here9s dnger tht two proesses ould eh lod lokwordD (nd it unsetD nd ontinue on s if they owned the lok @we hve re onditionAF his suggests tht spinEloks re implemented di'erently " tht they9re not merely the two lines of g oveF row do you suppose they re implementedc

3.3 Programming Shared-Memory Multiprocessors


3.3.1 Introduction26
sn etion QFPFID we exmined the hrdwre used to implement shredEmemory prllel proessors nd the softwre environment for progrmmer who is using threds expliitlyF sn this hpterD we view these proessors from simpler vntge pointF hen progrmming these systems in pyexD you hve the dvntge of the ompiler9s support of these systemsF et the top end of ese of useD we n simply dd )g or two on the ompiltion of our wellEwritten odeD set n environment vrileD nd voilD we re exeuting in prllelF sf you wnt some more ontrolD you n dd diretives to prtiulr loops where you know etter thn the ompiler how the loop should e exeutedF27 pirst we exmine how wellEwritten loops n ene(t from utomti prllelismF hen we will look t the types of diretives you n dd to your progrm to ssist the ompiler in generting prllel odeF hile this hpter refers to running your ode in prllelD most of the tehniques pply to the vetorEproessor superomputers s wellF

3.3.2 Automatic Parallelization28


o fr in the ookD we9ve overed the tough things you need to know to do prllel proessingF et this pointD ssuming tht your loops re lenD they use unit strideD nd the itertions n ll e done in prllelD ll you
26 This content is available online at <http://cnx.org/content/m32812/1.2/>. 27 If you have skipped all the other chapters in the book and jumped to this one,
don't be surprised if some of the terminology

is unfamiliar. While all those chapters seemed to contain endless boring detail, they did contain some basic terminology. So those of us who read all those chapters have some common terminology needed for this chapter. If you don't go back and read all the chapters, don't complain about the big words we keep using in this chapter!

28 This

content is available online at <http://cnx.org/content/m32821/1.2/>.

IUI hve to do is turn on ompiler )g nd uy good prllel proessorF por exmpleD look t the following odeX

eewii@xsiaQHHDxaIHHHHHHA ievBV e@xAD@xADf@xADg hy sswiaIDxsi hy saIDx e@sA a @sA C f@sA B g ixhhy gevv reii@eDDfDgA ixhhy
rere we hve n itertive ode tht stis(es ll the riteri for good prllel loopF yn good prllel proessor with modern ompilerD you re two )gs wy from exeuting in prllelF yn un olris systemsD the utopr )g turns on the utomti prlleliztionD nd the loopinfo )g uses the ompiler to desrie the prtiulr optimiztion performed for eh loopF o ompile this ode under olrisD you simply dd these )gs to your fUU llX

iTHHHX fUU EyQ Eutopr Eloopinfo Eo dxpy dxpyFf dxpyFfX 4dxpyFf4D line TX not prllelizedD ll my e unsfe 4dxpyFf4D line VX eevvivsih iTHHHX GinGtime dxpy rel user sys iTHHHX QHFW QHFU HFI

sf you simply run the odeD it9s exeuted using one thredF roweverD the ode is enled for prllel proessing for those loops tht n e exeuted in prllelF o exeute the ode in prllelD you need to set the xs environment to the numer of prllel threds you wish to use to exeute the odeF yn olrisD this is done using the eevviv vrileX

iTHHHX setenv eevviv I iTHHHX GinGtime dxpy rel QHFW user QHFU sys HFI iTHHHX setenv eevviv P iTHHHX GinGtime dxpy

IUP

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


rel ISFT user QIFH sys HFP iTHHHX setenv eevviv R iTHHHX GinGtime dxpy rel VFP user QPFH sys HFS iTHHHX setenv eevviv V iTHHHX GinGtime dxpy rel user sys RFQ QQFH HFV

Speedup is the term used to pture how muh fster the jo runs using x proessors ompred to the
performne on one proessorF st is omputed y dividing the single proessor time y the multiproessor time for eh numer of proessorsF pigure QFIV @smproving performne y dding proessorsA shows the wll time nd speedup for this pplitionF

Improving performance by adding processors

Figure 3.18

pigure QFIW @sdel nd tul performne improvementA shows this informtion grphillyD plotting speedup versus the numer of proessorsF

IUQ

Ideal and actual performance improvement

Figure 3.19

xote tht for while we get nerly perfet speedupD ut we egin to see mesurle drop in speedup t four nd eight proessorsF here re severl uses for thisF sn ll prllel pplitionsD there is some portion of the ode tht n9t run in prllelF huring those nonprllel timesD the other proessors re witing for work nd ren9t ontriuting to e0ienyF his nonprllel ode egins to 'et the overll performne s more proessors re dded to the pplitionF o you syD this is more like it3 nd immeditely try to run with IP nd IT thredsF xowD we see the grph in pigure QFPI @himinishing returnsA nd the dt from pigure QFPH @snresing the numer of thredsAF

IUR

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


Increasing the number of threads

Figure 3.20

Diminishing returns

Figure 3.21

IUS ht hs hppened herec hings were going so wellD nd then they slowed downF e re running this progrm on ITEproessor systemD nd there re eight other tive thredsD s indited elowX

iTHHHXuptime RXHHpm up IW dy@sAD QU min@sAD S usersD lod vergeX VFHHD VFHSD VFIR iTHHHX
yne we pss eight thredsD there re no ville proessors for our thredsF o the threds must e timeE shred etween the proessorsD signi(ntly slowing the overll opertionF fy the endD we re exeuting IT threds on eight proessorsD nd our performne is slower thn with one thredF o it is importnt tht you don9t rete too mny threds in these types of pplitionsF

3.3.2.1 Compiler Considerations


smproving performne y turning on utomti prlleliztion is n exmple of the smrter ompiler we disussed in erlier hptersF he ddition of single ompiler )g hs triggered gret del of nlysis on the prt of the ompiler inludingX

hih loops n exeute in prllelD produing the ext sme results s the sequentil exeutions of the loopsc his is done y heking for dependenies tht spn itertionsF e loop with no interitertion dependenies is lled hyevv loopF hih loops re worth exeuting in prllelc qenerlly very short loops gin no ene(t nd my exeute more slowly when exeuting in prllelF es with loop unrollingD prllelism lwys hs ostF st is est used when the ene(t fr outweighs the ostF sn loop nestD whih loop is the est ndidte to e prllelizedc qenerlly the est performne ours when we prllelize the outermost loop of loop nestF his wy the overhed ssoited with eginning prllel loop is mortized over longer prllel loop durtionF gn nd should the loop nest e interhngedc he ompiler my detet tht the loops in nest n e done in ny orderF yne order my work very well for prllel ode while giving poor memory performneF enother order my give unit stride ut perform poorly with multiple thredsF he ompiler must nlyze the ostGene(t of eh pproh nd mke the est hoieF row do we rek up the itertions mong the threds exeuting prllel loopc ere the itertions short with uniform durtionD or long with wide vrition of exeution timec e will see tht there re numer of di'erent wys to omplish thisF hen the progrmmer hs given no guidneD the ompiler must mke n eduted guessF
iven though it seems omplitedD the ompiler n do surprisingly good jo on wide vriety of odesF st is not mgiD howeverF por exmpleD in the following ode we hve loopErried )ow dependenyX

yqew hi eewii@xsiaQHHDxaIHHHHHHA ievBR e@xA hy sswiaIDxsi gevv reii@eA hy saPDx

IUT

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


e@sA a e@sEIA C e@sA B g ixhhy ixhhy ixh

hen we ompile the odeD the ompiler gives us the following messgeX

iTHHHX fUU EyQ Eutopr Eloopinfo Eo dep depFf depFfX 4depFf4D line TX not prllelizedD ll my e unsfe 4depFf4D line VX not prllelizedD unsfe dependene @A iTHHHX
he ompiler throws its hnds up in despirD nd lets you know tht the loop t vine V hd n unsfe dependeneD nd so it won9t utomtilly prllelize the loopF hen the ode is exeuted elowD dding thred does not 'et the exeution performneX

iTHHHXsetenv eevviv I iTHHHXGinGtime dep rel IVFI user IVFI sys HFH iTHHHXsetenv eevviv P iTHHHXGinGtime dep rel user sys iTHHHX IVFQ IVFP HFH

e typil pplition hs mny loopsF xot ll the loops re exeuted in prllelF st9s good ide to run pro(le of your pplitionD nd in the routines tht use most of the g timeD hek to (nd out whih loops re not eing prllelizedF ithin loop nestD the ompiler generlly hooses only one loop to exeute in prllelF

3.3.2.2 Other Compiler Flags


sn ddition to the )gs shown oveD you my hve other ompiler )gs ville to you tht pply ross the entire progrmX

ou my hve ompiler )g to enle the utomti prlleliztion of redution opertionsF feuse the order of dditions n 'et the (nl vlue when omputing sum of )otingEpoint numersD the ompiler needs permission to prllelize summtion loopsF

IUU

plgs tht relx the ompline with siii )otingEpoint rules my lso give the ompiler more )exE iility when trying to prllelize loopF roweverD you must e sure tht it9s not using ury prolems in other res of your odeF yften ompiler hs )g lled unsfe optimiztion or ssume no dependeniesF hile this )g my indeed enhne the performne of n pplition with loops tht hve dependeniesD it lmost ertinly produes inorret resultsF
here is some vlue in experimenting with ompiler to see the prtiulr omintion tht will yield good performne ross vriety of pplitionsF hen tht set of ompiler options n e used s strting point when you enounter new pplitionF

3.3.3 Assisting the Compiler29


sf it were ll tht simpleD you wouldn9t need this ookF hile ompilers re extremely leverD there is still lot of wys to improve the performne of your ode without sri(ing its portilityF snsted of onverting the whole progrm to g nd using thred lirryD you n ssist the ompiler y dding ompiler diretives to our soure odeF gompiler diretives re typilly inserted in the form of stylized pyex ommentsF his is done so tht nonprllelizing ompiler n ignore them nd just look t the pyex odeD sns ommentsF his llows to you tune your ode for prllel rhitetures without letting it run dly on wide rnge of singleEproessor systemsF here re two tegories of prllelEproessing ommentsX

essertions wnul prlleliztion diretives


essertions tell the ompiler ertin things tht you s the progrmmer know out the ode tht it might not guess y looking t the odeF hrough the ssertionsD you re ttempting to ssuge the ompiler9s douts out whether or not the loop is eligile for prlleliztionF hen you use diretivesD you re tking full responsiility for the orret exeution of the progrmF ou re telling the ompiler wht to prllelize nd how to do itF ou tke full responsiility for the output of the progrmF sf the progrm produes meningless resultsD you hve no one to lme ut yourselfF

3.3.3.1 Assertions
sn previous exmpleD we ompiled progrm nd reeived the following outputX

iTHHHX fUU EyQ Eutopr Eloopinfo Eo dep depFf depFfX 4depFf4D line TX not prllelizedD ll my e unsfe 4depFf4D line VX not prllelizedD unsfe dependene @A iTHHHX
en uneduted progrmmer who hs not red this ook @or hs not looked t the odeA might exlimD ht unsfe dependeneD s never put one of those in my ode3 nd quikly dd ssertionF his is the essene of n ssertionF snsted of telling the ompiler to simply prllelize the loopD the progrmmer is telling the ompiler tht its onlusion tht there is dependene is inorretF sully the net result is tht the ompiler does indeed prllelize the loopF e will rie)y review the types of ssertions tht re typilly supported y these ompilersF en ssertion is generlly dded to the ode using stylized ommentF

no dependencies

29 This

content is available online at <http://cnx.org/content/m32814/1.2/>.

IUV

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

3.3.3.1.1 No dependencies
e no dependencies or ignore dependencies diretive tells the ompiler tht referenes don9t overlpF ht isD it tells the ompiler to generte ode tht my exeute inorretly if there are dependeniesF ou9re syingD s know wht s9m doingY it9s yu to overlp referenesF e no dependenies diretive might help the following loopX

hy saIDx e@sA a e@sCuA B f@sA ixhhy


sf you know tht k is greter thn EI or less thn EnD you n get the ompiler to prllelize the loopX

g6ei xyhiixhixgsi hy saIDx e@sA a e@sCuA B f@sA ixhhy


yf ourseD lindly telling the ompiler tht there re no dependenies is presription for dissterF sf k equls EID the exmple ove eomes reursive loopF

3.3.3.1.2 Relations
ou will often see loops tht ontin some potentil dependeniesD mking them d ndidtes for no dependenies diretiveF roweverD you my e le to supply some lol fts out ertin vrilesF his llows prtil prlleliztion without ompromising the resultsF sn the ode elowD there re two potentil dependenies euse of susripts involving k nd jX

for @iaHY i<nY iCCA { i a iCk B iY i a iCj B iY }


erhps we know tht there re no on)its with referenes to i nd iCkF fut mye we ren9t so sure out i nd iCjF hereforeD we n9t sy in generl tht there re no dependeniesF roweverD we my e le to sy something expliit out k @like  k is lwys greter thn EIAD leving j out of itF his informtion out the reltionship of one expression to nother is lled relation assertionF epplying reltion ssertion llows the ompiler to pply its optimiztion to the (rst sttement in the loopD giving us prtil prlleliztionF30 eginD if you supply inurte testimony tht leds the ompiler to mke unsfe optimiztionsD your nswer my e wrongF
30 Notice
that, if you were tuning by hand, you could split this loop into two: one parallelizable and one not.

IUW

3.3.3.1.3 Permutations
es we hve seen elsewhereD when elements of n rry re indiretly ddressedD you hve to worry out whether or not some of the susripts my e repetedF sn the ode elowD re the vlues of u@sA ll uniquec yr re there duplitesc

hy saIDx e@u@sAA a e@u@sAA C f@sA B g ixh hy


sf you know there re no duplites in u @iFeFD tht e@u@sAA is permuttionAD you n inform the ompiler so tht itertions n exeute in prllelF ou supply the informtion using F

permutation assertion

3.3.3.1.4 No equivalences
iquivlened rrys in pyex progrms provide nother hllenge for the ompilerF sf ny elements of two equivlened rrys pper in the sme loopD most ompilers ssume tht referenes ould point to the sme memory storge lotion nd optimize very onservtivelyF his my e true even if it is undntly pprent to you tht there is no overlp whtsoeverF ou inform the ompiler tht referenes to equivlened rrys re sfe with ssertionF yf ourseD if you don9t use equivlenesD this ssertion hs no e'etF

no equivalences

3.3.3.1.5 Trip count


ih loop n e hrterized y n verge numer of itertionsF ome loops re never exeuted or go round just few timesF ythers my go round hundreds of timesX

g6ei sgyx>IHH hy savDx e@sA a f@sA C g@sA ixh hy


our ompiler is going to look t every loop s ndidte for unrolling or prlleliztionF st9s working in the drkD howeverD euse it n9t tell whih loops re importnt nd tries to optimize them llF his n led to the surprising experiene of seeing your runtime go up fter optimiztion3 e provides lue to the ompiler tht helps it deide how muh to unroll loop or when to prllelize loopF31 voops tht ren9t importnt n e identi(ed with low or zero trip ountsF smportnt loops hve high trip ountsF

trip count assertion

3.3.3.1.6 Inline substitution


sf your ompiler supports proedure inliningD you n use diretives nd ommndEline swithes to speify how mny nested levels of proedures you would like to inlineD thresholds for proedure sizeD etF he vendor will hve hosen resonle defultsF
31 The
assertion is made either by hand or from a proler.

IVH

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

essertions lso let you hoose suroutines tht you think re good ndidtes for inliningF roweverD sujet to its thresholdsD the ompiler my rejet your hoiesF snlining ould expnd the ode so muh tht inresed memory tivity would lim k gins mde y eliminting the proedure llF et higher optimiztion levelsD the ompiler is often ple of mking its own hoies for inlining ndidtesD provided it n (nd the soure ode for the routine under onsidertionF ome ompilers support feture lled F hen this is doneD the ompiler looks ross routine oundries for its dt )ow nlysisF st n perform signi(nt optimiztions ross routine oundriesD inluding utomti inliningD onstnt propgtionD nd othersF

interprocedural analysis

3.3.3.1.7 No side eects


ithout interproedurl nlysisD when looking t loopD if there is suroutine ll in the middle of the loopD the ompiler hs to tret the suroutine s if it will hve the worst possile side e'etsF elsoD it hs to ssume tht there re dependenies tht prevent the routine from exeuting simultneously in two di'erent thredsF wny routines @espeilly funtionsA don9t hve ny side e'ets nd n exeute quite niely in seprte threds euse eh thred hs its own privte ll stk nd lol vrilesF sf the routine is metyD there will e gret del of ene(t in exeuting it in prllelF our omputer my llow you to dd diretive tht tells you if suessive suEroutine lls re indeE pendentX

g6ei xyshiippig hy saIDx gevv fsqpp @eDfDgDsDtDuA ixh hy


iven if the ompiler hs ll the soure odeD use of ommon vriles or equivlenes my msk ll indeE pendeneF

3.3.3.2 Manual Parallelism


et some pointD you get tired of giving the ompiler dvie nd hoping tht it will reh the onlusion to prllelize your loopF et tht point you move into the relm of mnul prllelismF vukily the progrmming model provided in pyex insultes you from muh of the detils of extly how multiple threds re mnged t runtimeF ou generlly ontrol expliit prllelism y dding speilly formtted omment lines to your soure odeF here re wide vriety of formts of these diretivesF sn this setionD we use the syntx tht is prt of the ypenw @see 32 A stndrdF ou generlly (nd similr pilities in eh of the vendor ompilersF he preise syntx vries slightly from vendor to vendorF @ht lone is good reson to hve stndrdFA he si progrmming model is tht you re exeuting setion of ode with either single thred or multiple thredsF he progrmmer dds diretive to summon dditionl threds t vrious points in the odeF he most si onstrut is lled the F

parallel region

3.3.3.2.1 Parallel regions


sn prllel regionD the threds simply pper etween two sttements of strightEline odeF e very trivil exmple might e the following using the ypenw diretive syntxX
32 http://cnx.org/content/m32814/latest/www.openmp.org

IVI

yqew yxi iixev ywqiriehxwD ywqiwerieh sxiqi ywqiriehxwD ywqiwerieh sqvyf a ywqiwerieh@A sx BD9rello here9 g6yw eevviv sei@sewAD reih@sqvyfA sew a ywqiriehxw@A sx BD 9s m 9D sewD 9 of 9D sqvyf g6yw ixh eevviv sx BD9ell hone9 ixh
he g6yw is the sentinel tht indites tht this is diretive nd not just nother ommentF he output of the progrm when run looks s followsX

7 setenv ywxwrieh R 7 Fout rello here s m H of R s m Q of R s m I of R s m P of R ell hone 7


ixeution egins with single thredF es the progrm enounters the eevviv diretiveD the other threds re tivted to join the omputtionF o in senseD s exeution psses the (rst diretiveD one thred eomes fourF pour threds exeute the two sttements etween the diretivesF es the threds re exeuting independentlyD the order in whih the print sttements re displyed is somewht rndomF he threds wit t the ixh eevviv diretive until ll threds hve rrivedF yne ll threds hve ompleted the prllel regionD single thred ontinues exeuting the reminder of the progrmF sn pigure QFPP @ht intertions during prllel regionAD the sei@sewA indites tht the sew vrile is not shred ross ll the threds ut instedD eh thred hs its own privte version of the vrileF he sqvyf vrile is shred ross ll the thredsF eny modi(tion of sqvyf ppers in ll the other threds instntlyD within the limittions of the he oherenyF

IVP

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


Data interactions during a parallel region

Figure 3.22

huring the prllel regionD the progrmmer typilly divides the work mong the thredsF his pttern of going from singleEthreded to multithreded exeution my e repeted mny times throughout the exeution of n pplitionF feuse input nd output re generlly not thredEsfeD to e ompletely orretD we should indite tht the print sttement in the prllel setion is only to e exeuted on one proessor t ny one timeF e use diretive to indite tht this setion of ode is ritil setionF e lok or other synhroniztion mehnism ensures tht no more thn one proessor is exeuting the sttements in the ritil setion t ny one timeX

g6yw gssgev sx BD 9s m 9D sewD 9 of 9D sqvyf g6yw ixh gssgev

3.3.3.2.2 Parallel loops


uite often the res of the ode tht re most vlule to exeute in prllel re loopsF gonsider the following loopX

IVQ

hy saIDIHHHHHH wI a @ e@sA BB P A C @ f@sA BB P A wP a @wIA f@sA a wP ixhhy


o mnully prllelize this loopD we insert diretive t the eginning of the loopX

g6yw eevviv hy hy saIDIHHHHHH wI a @ e@sA BB P A C @ f@sA BB P A wP a @wIA f@sA a wP ixhhy g6yw ixh eevviv hy
hen this sttement is enountered t runtimeD the single thred gin summons the other threds to join the omputtionF roweverD efore the threds n strt working on the loopD there re few detils tht must e hndledF he eevviv hy diretive epts the dt lssi(tion nd soping luses s in the prllel setion diretive erlierF e must indite whih vriles re shred ross ll threds nd whih vriles hve seprte opy in eh thredF st would e disster to hve wI nd wP shred ross thredsF es one thred tkes the squre root of wID nother thred would e resetting the ontents of wIF e@sA nd f@sA ome from outside the loopD so they must e shredF e need to ugment the diretive s followsX

g6yw eevviv hy reih@eDfA sei@sDwIDwPA hy saIDIHHHHHH wI a @ e@sA BB P A C @ f@sA BB P A wP a @wIA f@sA a wP ixhhy g6yw ixh eevviv hy
he itertion vrile s lso must e thredEprivte vrileF es the di'erent threds inrement their wy through their prtiulr suset of the rrysD they don9t wnt to e modifying glol vlue for sF here re numer of other options s to how dt will e operted on ross the thredsF his summrizes some of the other dt semntis villeX

Firstprivate: hese re thredEprivte vriles tht tke n initil vlue from the glol vrile of the Lastprivate: hese re thredEprivte vriles exept tht the thred tht exeutes the lst itertion of
the loop opies its vlue k into the glol vrile of the sme nmeF sme nme immeditely efore the loop egins exeutingF

IVR

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


prllelF his is done y forming prtil redution using lol vrile in eh thred nd then omining the prtil results t the end of the loopF

Reduction: his indites tht vrile prtiiptes in redution opertion tht n e sfely done in

ih vendor my hve di'erent terms to indite these dt semntisD ut most support ll of these ommon semntisF pigure QFPQ @riles during prllel regionA shows how the di'erent types of dt semntis operteF xow tht we hve the dt environment set up for the loopD the only remining prolem tht must e solved is whih threds will perform whih itertionsF st turns out tht this is not trivil tskD nd wrong hoie n hve signi(nt negtive impt on our overll performneF

3.3.3.2.3 Iteration scheduling


here re two si tehniques @long with few vritionsA for dividing the itertions in loop etween thredsF e n look t two extreme exmples to get n ide of how this worksX

g igy ehh hy syfaIDIHHHH e@syfA a f@syfA C g@syfA ixhhy g esgvi egusxq hy syfaIDIHHHH exev a exh@syfA gevv sieiixiq@exevA ixhhy ixhhy

IVS

Variables during a parallel region

Figure 3.23

sn oth loopsD ll the omputtions re independentD so if there were IHDHHH proessorsD eh proessor ould exeute single itertionF sn the vetorEdd exmpleD eh itertion would e reltively shortD nd the exeution time would e reltively onstnt from itertion to itertionF sn the prtile trking exmpleD eh itertion hooses rndom numer for n initil prtile position nd itertes to (nd the minimum energyF ih itertion tkes reltively long time to ompleteD nd there will e wide vrition of ompletion times from itertion to itertionF hese two exmples re e'etively the ends of ontinuous spetrum of the itertion sheduling hllenges fing the pyex prllel runtime environmentX

Static

et the eginning of prllel loopD eh thred tkes (xed ontinuous portion of itertions of the loop sed on the numer of threds exeuting the loopF

Dynamic

ith dynmi shedulingD eh thred proesses hunk of dt nd when it hs ompleted proessingD new hunk is proessedF he hunk size n e vried y the progrmmerD ut is (xed for the durtion of the loopF hese two exmple loops n show how these itertion sheduling pprohes might operte when exE

IVT

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

euting with four thredsF sn the vetorEdd loopD stti sheduling would distriute itertions I!PSHH to hred HD PSHI!SHHH to hred ID SHHI!USHH to hred PD nd USHI!IHHHH to hred QF sn pigure QFPR @stertion ssignment for stti shedulingAD the mpping of itertions to threds is shown for the stti sheduling optionF

Iteration assignment for static scheduling

Figure 3.24

ine the loop ody @ single sttementA is short with onsistent exeution timeD stti sheduling should result in roughly the sme mount of overll work @nd time if you ssume dedited g for eh thredA ssigned to eh thred per loop exeutionF en dvntge of stti sheduling my our if the entire loop is exeuted repetedlyF sf the sme itertions re ssigned to the sme threds tht hppen to e running on the sme proessorsD the he might tully ontin the vlues for eD fD nd g from the previous loop exeutionF33 he runtime pseudoEode for stti sheduling in the (rst loop might look s followsX

g igy ehh E tti heduled se a @riehxwfi B PSHH A C I sixh a se C PRWW hy svygev a seDsixh e@svygevA a f@svygevA C g@svygevA ixhhy
st9s not lwys good strtegy to use the stti pproh of giving (xed numer of itertions to eh thredF sf this is used in the seond loop exmpleD long nd vrying itertion times would result in poor lod
33 The
operating system and runtime library actually go to some lengths to try to make this happen. This is another reason not to have more threads than available processors, which causes unnecessary context switching.

IVU lningF e etter pproh is to hve eh proessor simply get the next vlue for syf eh time t the top of the loopF ht pproh is lled D nd it n dpt to widely vrying itertion timesF sn pigure QFPS @stertion ssignment in dynmi shedulingAD the mpping of itertions to proessors using dynmi sheduling is shownF es soon s proessor (nishes one itertionD it proesses the next ville itertion in orderF

dynamic scheduling

Iteration assignment in dynamic scheduling

Figure 3.25

sf loop is exeuted repetedlyD the ssignment of itertions to threds my vry due to sutle timing issues tht 'et thredsF he pseudoEode for the dynmi sheduled loop t runtime is s followsX

g esgvi egusxq E hynmi heduled syf a H rsvi @syf <a IHHHH A fiqsxgssgevigsyx syf a syf C I svygev a syf ixhgssgevigsyx exev a exh@svygevA gevv sieiixiq@exevA ixhrsvi svygev is used so tht eh thred knows whih itertion is urrently proessingF he syf vlue is ltered y the next thred exeuting the ritil setionF hile the dynmi itertion sheduling pproh works well for this prtiulr loopD there is signi(nt negtive performne impt if the progrmmer were to use the wrong pproh for loopF por exmpleD if the dynmi pproh were used for the vetorEdd loopD the time to proess the ritil setion to determine whih itertion to proess my e lrger thn the time to tully proess the itertionF purthermoreD ny

IVV

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS

he 0nity of the dt would e e'etively lost euse of the virtully rndom ssignment of itertions to proessorsF sn etween these two pprohes re wide vriety of tehniques tht operte on hunk of itertionsF sn some tehniques the hunk size is (xedD nd in others it vries during the exeution of the loopF sn this pprohD hunk of itertions re gred eh time the ritil setion is exeutedF his redues the sheduling overhedD ut n hve prolems in produing lned exeution time for eh proessorF he runtime is modi(ed s follows to perform the prtile trking loop exmple using hunk size of IHHX

syf a I grxusi a IHH rsvi @syf <a IHHHH A fiqsxgssgevigsyx se a syf syf a syf C grxusi ixhgssgevigsyx hy svygev a seDseCgrxusiEI exev a exh@svygevA gevv sieiixiq@exevA ixhhy ixhrsvi
he hoie of hunk size is ompromise etween overhed nd termintion imlneF ypilly the progrmmer must get involved through diretives in order to ontrol hunk sizeF rt of the hllenge of itertion distriution is to lne the ost @or existeneA of the ritil setion ginst the mount of work done per invotion of the ritil setionF sn the idel worldD the ritil setion would e freeD nd ll sheduling would e done dynmillyF rllelGvetor superomputers with hrdwre ssistne for lod lning n nerly hieve the idel using dynmi pprohes with reltively smll hunk sizeF feuse the hoie of loop itertion pproh is so importntD the ompiler relies on diretives from the progrmmer to speify whih pproh to useF he following exmple shows how we n request the proper itertion sheduling for our loopsX

g igy ehh g6yw eevviv hy sei@syfA reih@eDfDgA grihvi@esgA hy syfaIDIHHHH e@syfA a f@syfA C g@syfA ixhhy g6yw ixh eevviv hy g esgvi egusxq g6yw eevviv hy sei@syfDexevA grihvi@hxewsgA hy syfaIDIHHHH exev a exh@syfA gevv sieiixiq@exevA ixhhy g6yw ixh eevviv hy

IVW

3.3.4 Closing Notes34


sing dt )ow nlysis nd other tehniquesD modern ompilers n peer through the lutter tht we progrmmers innoently put into our ode nd see the ptterns of the tul omputtionsF sn the (eld of high performne omputingD hving gret prllel hrdwre nd lousy utomti prllelizing ompiler generlly results in no slesF oo mny of the enhmrk rules llow only few ompiler options to e setF hysiists nd hemists re interested in physis nd hemistryD not omputer sieneF sf it tkes I hour to exeute hemistry ode without modi(tions nd fter six weeks of modi(tions the sme ode exeutes in PH minutesD whih is etterc ell from hemist9s point of viewD one took n hourD nd the other took IHHV hours nd PH minutesD so the nswer is oviousF35 elthough if the progrm were going to e exeuted thousnds of timesD the tuning might e win for the progrmmerF he nswer is even more ovious if it gin tkes six weeks to tune the progrm every time you mke modi(tion to the progrmF sn some wysD ssertions hve eome less populr thn diretivesF his is due to two ftorsX @IA ompilers re getting etter t deteting prllelism even if they hve to rewrite some ode to do soD nd @PA there re two kinds of progrmmersX those who know extly how to prllelize their odes nd those who turn on the sfe utoEprllelize )gs on their odesF essertions fll in the middle groundD somewhere etween where the progrmmer does not wnt to ontrol ll the detils ut kind of feels tht the loop n e prllelizedF ou n get online doumenttion of the ypenw syntx used in these exmples t wwwFopenmpForg36 F

3.3.5 Exercises37
Exercise 3.9
ke sttiD highly prllel progrm with reltive lrge inner loopF gompile the pplition for prllel exeutionF ixeute the pplition inresing the thredsF ixmine the ehvior when the numer of threds exeed the ville proessorsF ee if di'erent itertion sheduling pprohes mke di'ereneF

Exercise 3.10

ke the following loop nd exeute with severl di'erent itertion sheduling hoiesF por hunkE sed shedulingD use lrge hunk sizeD perhps IHHDHHHF ee if ny pproh performs etter thn stti shedulingX

hy saIDRHHHHHH e@sA a f@sA B PFQR ixhhy

Exercise 3.11

ixeute the following loop for rnge of vlues for x from I to IT millionX

hy saIDx
34 This content is available online at <http://cnx.org/content/m32820/1.2/>. 35 On the other hand, if the person is a computer scientist, improving the performance 36 http://cnx.org/content/m32820/latest/www.openmp.org 37 This content is available online at <http://cnx.org/content/m32819/1.2/>.
might result in anything from a poster

session at a conference to a journal article! This makes for lots of intra-departmental masters degree projects.

IWH

CHAPTER 3. SHARED-MEMORY PARALLEL PROCESSORS


e@sA a f@sA B PFQR ixhhy
un the loop in single proessorF hen fore the loop to run in prllelF et wht point do you get etter performne on multiple proessorsc ho the numer of threds 'et your oservtionsc

Exercise 3.12

se n expliit prlleliztion diretive to exeute the following loop in prllel with hunk size of IX

t a H g6yw eevviv hy sei@sA reih@tA grihvi@hxewsgA hy saIDIHHHHHH t a t C I ixhhy sx BD t g6yw ixh eevviv hy
ixeute the loop with vrying numer of thredsD inluding oneF elso ompile nd exeute the ode in serilF gompre the output nd exeution timesF ht do the results tell you out he oherenyc eout the ost of moving dt from one he to notherD nd out ritil setion ostsc

Chapter 4
Scalable Parallel Processing

4.1 Language Support for Performance


4.1.1 Introduction1
his hpter disusses the progrmming lnguges tht re used on the lrgest prllel proessing systemsF sully when you re fed with porting nd tuning your ode on new slle rhiteture rhitetureD you hve to sit k nd think out your pplition for momentF ometimes fundmentl hnges to your lgorithm re needed efore you n egin to work on the new rhitetureF hon9t e surprised if you need to rewrite ll or portions of the pplition in one of these lngugesF wodi(tions on one system my not give performne ene(t on nother systemF fut if the pplition is importnt enoughD it9s worth the e'ort to improve its performneF sn this hpterD we overX

pyex WH rpX righ erformne pyex


hese lnguges re designed for use on highEend omputing systemsF e will follow simple progrm through eh of these lngugesD using simple (niteEdi'erene omputtion tht roughly models het )owF st9s lssi prolem tht ontins gret del of prllelism nd is esily solved on wide vriety of prllel rhiteturesF e introdue nd disuss the onept of single progrm multiple dt @whA in tht we tret wswh omputers s swh omputersF e write our pplitions s if lrge swh system were going to solve the prolemF snsted of tully using swh systemD the resulting pplition is ompiled for wswh systemF he impliit synhroniztion of the swh systems is repled y expliit synhroniztion t runtime on the wswh systemsF

4.1.2 Data-Parallel Problem: Heat Flow2


e lssi prolem tht explores slle prllel proessing is the het )ow prolemF he physis ehind this prolem lie in prtil di'erentil equtionsF e will strt with oneEdimensionl metl plte @lso known s rodAD nd move to twoEdimensionl plte in lter exmplesF e strt with rod tht is t zero degrees elsiusF hen we ple one end in IHH degree stem nd the other end in zero degree ieF e wnt to simulte how the het )ows from one end to notherF end the resulting tempertures long points on the metl rod fter the temperture hs stilizedF
1 This 2 This
content is available online at <http://cnx.org/content/m33744/1.2/>. content is available online at <http://cnx.org/content/m33751/1.2/>.

IWI

IWP

CHAPTER 4. SCALABLE PARALLEL PROCESSING

o do this we rek the rod into IH segments nd trk the temperture over time for eh segmentF sntuitivelyD within time stepD the next temperture of portion of the plte is n verge of the surrounding temperturesF qiven (xed tempertures t some points in the rodD the tempertures eventully onverge to stedy stte fter su0ient time stepsF pigure RFI @ret )ow in rodA shows the setup t the eginning of the simultionF

Heat ow in a rod

Figure 4.1

e simplisti implementtion of this is s followsX

IHH

yqew rieyh eewii@weswiaPHHA sxiqi sguDsDweswi ievBR yh@IHA yh@IA a IHHFH hy saPDW yh@sA a HFH ixhhy yh@IHA a HFH hy sguaIDweswi sp @ wyh@sguDPHA FiF I A sx IHHDsguD@yh@sADsaIDIHA hy saPDW yh@sA a @yh@sEIA C yh@sCIA A G P ixhhy ixhhy pywe@sRDIHpUFPA ixh

he output of this progrm is s followsX

IWQ

7 fUU hetrodFf hetrodFfX wesx hetrodX 7 Fout I IHHFHH HFHH PI IHHFHH VUFHR RI IHHFHH VVFUR TI IHHFHH VVFVV VI IHHFHH VVFVW IHI IHHFHH VVFVW IPI IHHFHH VVFVW IRI IHHFHH VVFVW ITI IHHFHH VVFVW IVI IHHFHH VVFVW 7

HFHH URFSP UUFSI UUFUT UUFUV UUFUV UUFUV UUFUV UUFUV UUFUV

HFHH TPFSR TTFQP TTFTR TTFTT TTFTU TTFTU TTFTU TTFTU TTFTU

HFHH SIFIS SSFIW SSFSQ SSFSS SSFST SSFST SSFST SSFST SSFST

HFHH RHFQH RRFIH RRFRP RRFRR RRFRR RRFRR RRFRR RRFRR RRFRR

HFHH PWFWI QQFHS QQFQI QQFQQ QQFQQ QQFQQ QQFQQ QQFQQ QQFQQ

HFHH IWFVQ PPFHP PPFPI PPFPP PPFPP PPFPP PPFPP PPFPP PPFPP

HFHH WFWP IIFHI IIFIH IIFII IIFII IIFII IIFII IIFII IIFII

HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH

glerlyD y ime step IHID the simultion hs onverged to two deiml ples of ury s the numers hve stopped hngingF his should e the stedyEstte pproximtion of the temperture t the enter of eh segment of the rF xowD t this pointD stute reders re sying to themselvesD 4mD don9t look nowD ut tht loop hs )ow dependenyF4 ou would lso lim tht this won9t even prllelize little itF st is so d you n9t even unroll the loop for little instrutionElevel prllelism3 e person fmilir with the theory of het )ow will lso point out tht the ove loop doesn9t implement the het )ow modelF he prolem is tht the vlues on the right side of the ssignment in the yh loop re supposed to e from the previous time stepD nd tht the vlue on the left side is the next time stepF feuse of the wy the loop is writtenD the yh@sEIA vlue is from the next time stepD s shown in pF IWQF his n e solved using tehnique lled D where we lternte etween two rrysF pigure RFQ @sing two rrys to eliminte dependenyA shows how the redElk version of the omputtion opertesF his kills two irds with one stone3 xow the mthemtis is preisely orretD there is no reurreneF ounds like rel winEwin situtionF

exactly

red-black

and

Computing the new value for a cell

Figure 4.2

IWR

CHAPTER 4. SCALABLE PARALLEL PROCESSING


Using two arrays to eliminate a dependency

Figure 4.3

he only downside to this pproh is tht it tkes twie the memory storge nd twie the memory ndwidthF3 he modi(ed ode is s followsX

yqew rieih eewii@weswiaPHHA sxiqi sguDsDweswi ievBR ih@IHADfvegu@IHA ih@IA a IHHFH fvegu@IA a IHHFH hy saPDW ih@sA a HFH ixhhy ih@IHA a HFH fvegu@IHA a HFH hy sguaIDweswiDP sp @ wyh@sguDPHA FiF I A sx IHHDsguD@ih@sADsaIDIHA
3 There
passes. is another red-black approach that computes rst the even elements and then the odd elements of the rod in two The ROD array never has all the values from the same This approach has no data dependencies within each pass.

time step. Either the odd or even values are one time step ahead of the other. It ends up with a stride of two and doubles the bandwidth but does not double the memory storage required to solve the problem.

IWS

IHH

hy saPDW fvegu@sA a @ih@sEIA C ih@sCIA A G P ixhhy hy saPDW ih@sA a @fvegu@sEIA C fvegu@sCIA A G P ixhhy ixhhy pywe@sRDIHpUFPA ixh

he output for the modi(ed progrm isX

7 fUU hetredFf hetredFfX wesx hetredX 7 Fout I IHHFHH HFHH PI IHHFHH VPFQV RI IHHFHH VUFHR TI IHHFHH VVFQT VI IHHFHH VVFUR IHI IHHFHH VVFVR IPI IHHFHH VVFVV IRI IHHFHH VVFVW ITI IHHFHH VVFVW IVI IHHFHH VVFVW 7

HFHH TTFQR URFSP UTFVR UUFSI UUFUH UUFUT UUFUU UUFUV UUFUV

HFHH SHFQH TIFWW TSFQP TTFPV TTFSS TTFTQ TTFTT TTFTT TTFTU

HFHH QVFIV SHFST SRFIP SSFIR SSFRR SSFSP SSFSS SSFSS SSFSS

HFHH PTFHT QWFIQ RPFWI RRFHH RRFQP RRFRI RRFRQ RRFRR RRFRR

HFHH IVFPH PVFWR QPFHU QPFWU QQFPQ QQFQH QQFQP QQFQQ QQFQQ

HFHH IHFQS IVFUS PIFPP PIFWQ PPFIR PPFPH PPFPP PPFPP PPFPP

HFHH SFIV WFQV IHFTI IHFWU IIFHU IIFIH IIFII IIFII IIFII

HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH

snterestinglyD the modi(ed progrm tkes longer to onverge thn the (rst versionF st onverges t ime step IVI rther thn IHIF sf you look t the (rst versionD euse of the reurreneD the het ended up )owing up fster from left to right euse the left element of eh verge ws the nextEtimeEstep vlueF st my seem niftyD ut it9s wrongF4 qenerllyD in this prolemD either pproh onverges to the sme eventul vlues within the limits of )otingEpoint representtionF his het )ow prolem is extremely simpleD nd in its redElk formD it9s inherently very prllel with very simple dt intertionsF st9s good model for wide rnge of prolems where we re disretizing twoEdimensionl or threeEdimensionl spe nd performing some simple simultions in tht speF his prolem n usully e sled up y mking (ner gridF yftenD the ene(t of slle proessors is to llow (ner grid rther thn fster time to solutionF por exmpleD you might e le to to worldwide wether simultion using PHHEmile grid in four hours on one proessorF sing IHH proessorsD you my e le to do the simultion using PHEmile grid in four hours with muh more urte resultsF yrD using RHH proessorsD you n do the (ner grid simultion in one hourF
4 There
are other algorithmic approaches to solving partial dierential equations, such as the "fast multipole method" that accelerates convergence "legally." Don't assume that the brute force approach used here is the only method to solve this particular problem. Programmers should always look for the best available algorithm (parallel or not) before trying to scale up the "wrong" algorithm. For folks other than computer scientists, time to solution is more important than linear speed-up.

IWT

CHAPTER 4. SCALABLE PARALLEL PROCESSING

4.1.3 Explicity Parallel Languages5


es we9ve seen throughout this ookD one of iggest tuning hllenges is getting the ompiler to reognize tht prtiulr ode segment n e prllelizedF his is prtiulrly true for numeril odesD where the potentil pyk is gretestF hink out thisX if you know tht something is prllelD why should there e ny di0ulty getting the ompiler to reognize itc hy n9t you just write it downD nd hve the ompiler sy 4esD this is to e done in prllelF4 he prolem is tht the most ommonly used lnguges don9t o'er ny onstruts for expressing prllel omputtionsF ou re fored to express yourself in primitive termsD s if you were vemn with grnd thought ut no voulry to voie itF his is prtiulrly true of pyex nd gF hey do not support notion of prllel omputtionsD whih mens tht progrmmers must redue lultions to sequentil stepsF ht sounds umersomeD ut most progrmmers do it so nturlly tht they don9t even relize how good they re t itF por exmpleD let9s sy we wnt to dd two vetorsD e nd fF row would we do itc e would proly write little loop without moment9s thoughtX

hy saIDx g@sA a e@sA C f@sA ixh hy


his seems resonleD ut look wht hppenedF e imposed n order on the lultions3 ouldn9t it e enough to sy 4g gets e plus f4c ht would free the ompiler to dd the vetors using ny hrdwre t its disposlD using ny method it likesF his is wht prllel lnguges re outF hey seek to supply primitives suitle for expressing prllel omputtionsF xew prllel lnguges ren9t eing proposed s rpidly s they were in the midEIWVHsF hevelopers hve relized tht you n ome up with wonderful shemeD ut if it isn9t omptile with pyex or gD few people will re out itF he reson is simpleX there re illions of lines of g nd pyex odeD ut only few lines of D or whtever it is you ll your new prllel lngugeF feuse of the predominne of g nd pyexD the most signi(nt prllel lnguge tivities tody seek to extend those lngugesD thus proteting the PH or QH yers of investment in progrms lredy writtenF6 st is too tempting for the developers of new lnguge to test their lnguge on the eightEqueens prolem nd the gme of lifeD get good resultsD then delre it redy for prime time nd egin witing for the hordes of progrmmers onverting to their prtiulr lngugeF

Fizgibbet

4.1.4 FORTRAN 907


he previous emerin xtionl tndrds snstitute @exsA pyex stndrd releseD pyex UU @QFWEIWUVAD ws written to promote portility of pyex progrms etween di'erent pltformsF st didn9t invent new lnguge omponentsD ut insted inorported good fetures tht were lredy ville in prodution ompilersF nlike pyex UUD pyex WH @exs QFIWVEIWWPA rings new extensions nd fetures to the lngugeF ome of these just ring pyex up to dte with newer lnguges like g @dynmi memory llotionD soping rulesA nd gCC @generi funtion interfesAF fut some of the new fetures re unique to pyex @rry opertionsAF snterestinglyD while the pyex WH spei(tion
5 This content is available online at 6 One of the more signicant eorts
<http://cnx.org/content/m33754/1.2/>. in the area of completely new languages is Streams and Iteration in a Single Assignment

Language (SISAL). It's a data ow language that can easily integrate FORTRAN and C modules. The most interesting aspects of SISAL are the number of large computational codes that were ported to SISAL and the fact that the SISAL proponents generally compared their performance to the FORTRAN and C performance of the same applications.

7 This

content is available online at <http://cnx.org/content/m33757/1.2/>.

IWU ws eing developedD the dominnt high performne omputer rhitetures were slle swh systems suh s the gonnetion whine nd shredEmemory vetorEprllel proessor systems from ompnies like gry eserhF pyex WH does surprisingly good jo of meeting the needs of these very di'erent rhiteturesF sts fetures lso mp resonly well onto the new shred uniform memory multiproessorsF roweverD s we will see lterD pyex WH lone is not yet su0ient to meet the needs of the slle distriuted nd nonuniform ess memory systems tht re eoming dominnt t the high end of omputingF he pyex WH extensions to pyex UU inludeX

erry onstruts hynmi memory llotion nd utomti vriles ointers xew dt typesD strutures xew intrinsi funtionsD inluding mny tht operte on vetors or mtries xew ontrol struturesD suh s rii sttement inhned proedure interfes

4.1.4.1 FORTRAN 90 Array Constructs


ith pyex WH rry onstrutsD you n speify whole rrys or rry setions s the prtiipnts in unry nd inry opertionsF hese onstruts re key feture for 4unserilizing4 pplitions so tht they re etter suited to vetor omputers nd prllel proessorsF por exmpleD sy you wish to dd two vetorsD e nd fF sn pyex WHD you n express this s simple ddition opertionD rther thn trditionl loopF ht isD you n writeX

e a e C f
insted of the trditionl pyex UU loopX

hy saIDx e@sA a e@sA C f@sA ixhhy


he ode generted y the ompiler on your worksttion my not look ny di'erentD ut for some of the prllel mhines ville now nd worksttions just round the ornerD the di'erene re signi(ntF he pyex WH version sttes expliitly tht the omputtions n e performed in ny orderD inluding ll in prllel t the sme timeF yne importnt e'et of this is tht if the pyex WH version experiened )otingEpoint fult dding element IUD nd you were to look t the memory in deuggerD it would e perfetly legl for element PU to e lredy omputedF ou re not limited to oneEdimensionl rrysF por instneD the elementEwise ddition of two twoE dimensionl rrys ould e stted like thisX8
8 Just in case you are wondering,
A*B gives you an element-wise multiplication of array members not matrix multiplication. That is covered by a FORTRAN 90 intrinsic function.

IWV

CHAPTER 4. SCALABLE PARALLEL PROCESSING


e a e C f

in lieu ofX

hy taIDw hy saIDx e@sDtA a e@sDtA C f@sDtA ixh hy ixh hy


xturllyD when you wnt to omine two rrys in n opertionD their shpes hve to e omptileF edding sevenEelement vetor to n eightEelement vetor doesn9t mke senseF xeither would multiplying PR rry y QR rryF hen the two rrys hve omptile shpesD reltive to the opertion eing performed upon themD we sy they re in D s in the following odeX

shape conformance

hyfvi igssyx e@VAD f@VA FFF e a e C f


lrs re lwys onsidered to e in shpe onformne with rrys @nd other slrsAF sn inry opertion with n rryD slr is treted s n rry of the sme size with single element duplited throughoutF tillD we re limitedF hen you referene prtiulr rryD eD for exmpleD you referene the whole thingD from the (rst element to the lstF ou n imgine ses where you might e interested in speifying suset of n rryF his ould e either group of onseutive elements or something like 4every eighth element4 @iFeFD nonEunit stride through the rryAF rts of rrysD possily nonontiguousD re lled F pyex WH rry setions n e spei(ed y repling trditionl susripts with triplets of the form XXD mening 4elements through D tken with n inrement of F4 ou n omit prts of the tripletD provided the mening remins lerF por exmpleD X mens 4elements through Y4 X mens 4elements from to the upper ound with n inrement of IF4 ememer tht triplet reples single susriptD so n Edimension rry n hve tripletsF ou n use triplets in expressionsD gin mking sure tht the prts of the expression re in onformneF gonsider these sttementsX

sections n

array

iev @IHDIHAD @IHHA FFF @IHDIXIHA a @WIXIHHA @IHDXA a @WIXIHHA

IWW he (rst sttement ove ssigns the lst IH elements of to the IHth row of F he seond sttement expresses the sme thing slightly di'erentlyF he lone 4 X 4 tells the ompiler tht the whole rnge @I through IHA is impliedF

4.1.4.2 FORTRAN 90 Intrinsics


pyex WH extends the funtionlity of pyex UU intrinsisD nd dds mny new ones s wellD inluding some intrinsi suroutinesF wost n e X they n return rrys setions or slrsD depending on how they re invokedF por exmpleD here9s newD rryEvlued use of the sx intrinsiX

array-valued

iev e@IHHDIHDPA FFF e a sx@eA


ih element of rry e is repled with its sineF pyex WH intrinsis work with rry setions tooD s long s the vrile reeiving the result is in shpe onformne with the one pssedX

iev e@IHHDIHDPA iev f@IHDIHDIHHA FFF f@XDXDIA a gy@e@IXIHHXIHDXDIAA


yther intrinsisD suh s D vyqD etFD hve een extended s wellF emong the new intrinsis reX

Reductions: pyex WH hs vetor redutions suh s weevD wsxevD nd wF por higherEorder rrys

@nything more thn vetorA these funtions n perform redution long prtiulr dimensionF edditionllyD there is hyyhg funtion for the vetorsF Matrix manipulation: sntrinsis wewv nd exyi n mnipulte whole mtriesF Constructing or reshaping arrays: irei llows you to rete new rry from elements of n old one with di'erent shpeF ieh replites n rry long new dimensionF wiqi opies portions of one rry into nother under ontrol of mskF grsp llows n rry to e shifted in one or more dimensionsF Inquiry functions: reiD siD vfyxhD nd fyxh let you sk questions out how n rry is onE strutedF Parallel tests: wo other new redution intrinsisD ex nd evvD re for testing mny rry elements in prllelF

4.1.4.3 New Control Features

pyex WH inludes some new ontrol feturesD inluding onditionl lled riiD tht puts shpeEonforming rry ssignments under ontrol of msk s in the following exmpleF rere9s n exmple of the rii primitiveX

assignment primitive

PHH

CHAPTER 4. SCALABLE PARALLEL PROCESSING


iev e@PDPAD f@PDPAD g@PDPA hee fGIDPDQDRGD gGIDIDSDSG FFF rii @f FiF gA e a IFH g a f C IFH ivirii e a EIFH ixhrii

sn ples where the logil expression is iD e gets IFH nd g gets fCIFHF sn the ivirii luseD e gets EIFHF he result of the opertion ove would e rrys e nd g with the elementsX

e a

IFH EIFH EIFH EIFH

g a

PFH IFH

SFH SFH

eginD no order is implied in these onditionl ssignmentsD mening they n e done in prllelF his lk of implied order is ritil to llowing swh omputer systems nd wh environments to hve )exiility in performing these omputtionsF

4.1.4.4 Automatic and Allocatable Arrays


ivery progrm needs temporry vriles or work speF sn the pstD pyex progrmmers hve often mnged their own srth spe y delring n rry lrge enough to hndle ny temporry requirementsF his prtie goles up memory @leit virtul memoryD usullyAD nd n even hve n e'et on perforE mneF ith the ility to llote memory dynmillyD progrmmers n wit until lter to deide how muh srth spe to set sideF pyex WH supports dynmi memory llotion with two new lnguge feturesX utomti rrys nd llotle rrysF vike the lol vriles of g progrmD pyex WH9s utomti rrys re ssigned storge only for the life of the suroutine or funtion tht ontins themF his is di'erent from trditionl lol storge for pyex rrysD where some spe ws set side t ompile or link timeF he size nd shpe of utomti rrys n e sulpted from omintion of onstnts nd rgumentsF por instneD here9s delrtion of n utomti rryD fD using pyex WH9s new spei(tion syntxX

fysxi ive@xDeA sxiqi x ievD hswixsyx @xA XX eD f


wo rrys re delredX eD the dummy rgumentD nd fD n utomtiD expliit shpe rryF hen the suroutine returnsD f eses to existF xotie tht the size of f is tken from one of the rgumentsD xF ellotle rrys give you the ility to hoose the size of n rry fter exmining other vriles in the progrmF por exmpleD you might wnt to determine the mount of input dt efore lloting the rrysF his little progrm sks the user for the mtrix9s size efore lloting storgeX

PHI

sxiqi wDx ievD evvygeefviD hswixsyx @XDXA XX FFF si @BDBA 9ixi ri hswixsyx yp 9 ieh @BDBA wDx evvygei @@wDxAA FFF do something with FFF hievvygei @A FFF
he evvygei sttement retes n w x rry tht is lter freed y the hievvygei sttementF es with g progrmsD it9s importnt to give k lloted memory when you re done with itY otherwiseD your progrm might onsume ll the virtul storge villeF

4.1.4.5 Heat Flow in FORTRAN 90


he het )ow prolem is n idel progrm to use to demonstrte how niely pyex WH n express regulr rry progrmsX

IHH

yqew rieyh eewii@weswiaPHHA sxiqi sguDsDweswi ievBR yh@IHA yh@IA a IHHFH hy saPDW yh@sA a HFH ixhhy yh@IHA a HFH hy sguaIDweswi sp @ wyh@sguDPHA FiF I A sx IHHDsguD@yh@sADsaIDIHA yh@PXWA a @yh@IXVA C yh@QXIHA A G P ixhhy pywe@sRDIHpUFPA ixh

he progrm is identilD exept the inner loop is now repled y single sttement tht omputes the 4new4 setion y verging strip of the 4left4 elements nd strip of the 4right4 elementsF he output of this progrm is s followsX

iTHHHX fWH hetWHFf

PHP

CHAPTER 4. SCALABLE PARALLEL PROCESSING


HFHH VPFQV VUFHR VVFQT VVFUR VVFVR VVFVV VVFVW VVFVW VVFVW HFHH TTFQR URFSP UTFVR UUFSI UUFUH UUFUT UUFUU UUFUV UUFUV HFHH SHFQH TIFWW TSFQP TTFPV TTFSS TTFTQ TTFTT TTFTT TTFTU HFHH QVFIV SHFST SRFIP SSFIR SSFRR SSFSP SSFSS SSFSS SSFSS HFHH PTFHT QWFIQ RPFWI RRFHH RRFQP RRFRI RRFRQ RRFRR RRFRR HFHH IVFPH PVFWR QPFHU QPFWU QQFPQ QQFQH QQFQP QQFQQ QQFQQ HFHH IHFQS IVFUS PIFPP PIFWQ PPFIR PPFPH PPFPP PPFPP PPFPP HFHH SFIV WFQV IHFTI IHFWU IIFHU IIFIH IIFII IIFII IIFII HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH HFHH

iTHHHXFout I IHHFHH PI IHHFHH RI IHHFHH TI IHHFHH VI IHHFHH IHI IHHFHH IPI IHHFHH IRI IHHFHH ITI IHHFHH IVI IHHFHH iTHHHX

sf you look loselyD this output is the sme s the redElk implementtionF ht is euse in pyex WHX

yh@PXWA a @yh@IXVA C yh@QXIHA A G P


is ssignment sttementF es shown in pigure RFR @ht lignment nd omputtionsAD the right side is ompletely evluted efore the resulting rry setion is ssigned into yh@PXWAF por momentD tht might seem unnturlD ut onsider the following sttementX

single

s a s C I
e know tht if s strts with SD it9s inremented up to six y this sttementF ht hppens euse the right side @SCIA is evluted efore the ssignment of T into s is performedF sn pyex WHD vrile n e n entire rryF oD this redElk opertionF here is n 4old4 yh on the right side nd 4new4 yh on the left side3 o relly 4think4 pyex WHD it9s good to pretend you re on n swh system with millions of little gsF pirst we refully lign the dtD sliding it roundD nd then" whm" in single instrutionD we dd ll the ligned vlues in n instntF pigure RFR @ht lignment nd omputtionsA shows grphilly this t of 4ligning4 the vlues nd then dding themF he dt )ow grph is extremely simpleF he top two rows re redEonlyD nd the dt )ows from top to ottomF sing the temporry spe elimintes the seeming dependenyF his pproh of 4thinking swh4 is one of the wys to fore ourselves to fous our thoughts on the dt rther thn the ontrolF swh my not e good rhiteture for your prolem ut if you n express it so tht swh ould workD good wh environment n tke dvntge of the dt prllelism tht you hve identi(edF he ove exmple tully highlights one of the hllenges in produing n e0ient implementtion of pyex WHF sf these rrys ontined IH million elementsD nd the ompiler used simple pprohD it would need QH million elements for the old 4left4 vluesD the old 4right4 vluesD nd for the new vluesF ht )ow optimiztion is needed to determine just how muh extr dt must e mintined to give the proper resultsF sf the ompiler is leverD the extr memory n e quite smllX

is

PHQ

Data alignment and computations

Figure 4.4

eiI a yh@IA hy saPDW eiP a yh@sA yh@sA a @eiI C yh@sCIA A G P eiI a eiP ixhhy
his does not hve the prllelism tht the full redElk implementtion hsD ut it does produe the orret results with only two extr dt elementsF he trik is to sve the old 4left4 vlue just efore you wipe it outF e good pyex WH ompiler uses dt )ow nlysisD looking t templte of how the omputtion moves ross the dt to see if it n sve few elements for short period of time to llevite the need for omplete extr opy of the dtF he dvntge of the pyex WH lnguge is tht it9s up to the ompiler whether it uses omplete opy of the rry or few dt elements to insure tht the progrm exeutes properlyF wost importntlyD it n hnge its pproh s you move from one rhiteture to notherF

PHR

CHAPTER 4. SCALABLE PARALLEL PROCESSING

4.1.4.6 FORTRAN 90 Versus FORTRAN 77


snterestinglyD pyex WH hs never een fully emred y the high performne ommunityF here re few resons whyX

here is onern tht the use of pointers nd dynmi dt strutures would ruin performne nd lose the optimiztion dvntges of pyex over gF ome people would sy tht pyex WH is trying to e etter g thn gF ythers would syD 4who wnts to eome more like the slower lnguge34 htever the resonD there ws some ontroversy when pyex WH ws implementedD leding to some relutne in doption y progrmmersF ome vendors sidD 4ou n use pyex WHD ut pyex UU will lwys e fsterF4 feuse vendors often implemented di'erent susets of pyex WHD it ws not s portle s pyex UUF feuse of thisD users who needed mximum portility stuk with pyex UUF ometimes vendors purhsed their fully omplint pyex WH ompilers from third prty who demnded high liense feesF oD you ould get the free @nd fster ording to the vendorA pyex UU or py for the slower @wink winkA pyex WH ompilerF feuse of these ftorsD the numer of serious pplitions developed in pyex WH ws smllF o the enhmrks used to purhse new systems were lmost exlusively pyex UUF his furE ther motivted the vendors to improve their pyex UU ompilers insted of their pyex WH ompilersF es the pyex UU ompilers eme more sophistited using dt )ow nlysisD it eme relE tively esy to write portle 4prllel4 ode in pyex UUD using the tehniques we hve disussed in this ookF yne of the gretest potentil ene(ts to pyex WH ws portility etween swh nd the prE llelGvetor superomputersF es oth of these rhitetures were repled with the shred uniform memory multiproessorsD pyex UU eme the lnguge tht 'orded the mximum portility ross the omputers typilly used y high performne omputing progrmmersF he pyex UU ompilers supported diretives tht llowed progrmmers to (neEtune the perforE mne of their pplitions y tking full ontrol of the prllelismF gertin dilets of pyex UU essentilly eme prllel progrmming 4ssemly lngugeF4 iven highly tuned versions of these odes were reltively portle ross the di'erent vendor shred uniform memory multiproessorsF
oD events onspired ginst pyex WH in the short runF roweverD pyex UU is not well suited for the distriuted memory systems euse it does not lend itself well to dt lyout diretivesF es we need to prtition nd distriute the dt refully on these new systemsD we must give the ompiler of )exiilityF pyex WH is the lnguge est suited to this purposeF

lots

4.1.4.7 FORTRAN 90 Summary


ellD tht9s the whirlwind tour of pyex WHF e hve proly done the lnguge disservie y overing it so rie)yD ut we wnted to give you feel for itF here re mny fetures tht were not disussedF sf you would like to lern moreD we reommend D y wihel wetlf nd tohn eid @yxford niversity ressAF pyex WH y itself is not su0ient to give us slle performne on distriuted memory systemsF o frD ompilers re not yet ple of performing enough dt )ow nlysis to deide where to store the dt nd when to retrieve the memoryF oD for nowD we progrmmers must get involved with the dt lyoutF e must deompose the prolem into prllel hunks tht n e individully proessedF e hve severl optionsF e n use righ erformne pyex nd leve some of the detils to the ompilerD or we n use expliit messgeEpssing nd tke re of of the detils ourselvesF

FORTRAN 90 Explained

all

PHS

4.1.5 Problem Decomposition9


here re three min pprohes to dividing or deomposing work for distriution mong multiple gsX

Decomposing computations: e hve lredy disussed this tehniqueF hen the deomposition is done
sed on omputtionsD we ome up with some mehnism to divide the omputtions @suh s the itertions of loopA evenly mong our proessorsF he lotion of the dt is generlly ignoredD nd the primry issues re itertion durtion nd uniformityF his is the preferred tehnique for the shred uniform memory systems euse the dt n e eqully essed y ny proessorF Decomposing data: hen memory ess is nonuniformD the tendeny is to fous on the distriution of the dt rther thn omputtionsF he ssumption is tht retrieving 4remote4 dt is ostly nd should e minimizedF he dt is distriuted mong the memoriesF he proessor tht ontins the dt performs the omputtions on tht dt fter retrieving ny other dt neessry to perform the omputtionF Decomposing tasks: E hen the opertions tht must e performed re very independentD nd tke some timeD tsk deomposition n e performedF sn this pproh mster proessGthred mintins queue of work unitsF hen proessor hs ville resouresD it retrieves the next 4tsk4 from the queue nd egins proessingF his is very ttrtive pproh for emrrssingly prllel omputE tionsF10

sn some senseD the rest of this hpter is primrily out dt deompositionF sn distriuted memory systemD the ommunition osts usully re the dominnt performne ftorF sf your prolem is so emrrssingly prllel tht it n e distriuted s tsksD then nerly ny tehnique will workF htEprllel prolems our in mny disiplinesF hey vry from those tht re extremely prllel to those tht re just sort of prllelF por exmpleD frtl lultions re extremely prllelY eh point is derived independently of the restF st9s simple to divide frtl lultions mong proessorsF feuse the lultions re independentD the proessors don9t hve to oordinte or shre dtF yur het )ow prolem when expressed in its redElk @or pyex WHA form is extremely prllel ut requires some shring of dtF e grvittionl model of glxy is nother kind of prllel progrmF ih point exerts n in)uene on every otherF hereforeD unlike the frtl lultionsD the proessors do hve to shre dtF sn either seD you wnt to rrnge lultions so tht proessors n sy to one notherD 4you go over there nd work on thtD nd s9ll work on thisD nd we9ll get together when we re (nishedF4 rolems tht o'er less independene etween regions re still very good ndidtes for domin deompoE sitionF pinite di'erene prolemsD shortErnge prtile intertion simultionsD nd olumns of mtries n e treted similrlyF sf you n divide the domin evenly etween the proessorsD they eh do pproximtely the sme mount of work on their wy to solutionF yther physil systems re not so regulr or involve longErnge intertionsF he nodes of n unstrutured grid my not e lloted in diret orrespondene to their physil lotionsD for instneF yr perhps the model involves longErnge foresD suh s prtile ttrtionsF hese prolemsD though more di0ultD n e strutured for prllel mhines s wellF ometimes vrious simpli(tionsD or 4lumping4 of intermedite e'etsD re neededF por instneD the in)uene of group of distnt prtiles upon nother my e treted s if there were one omposite prtile ting t distneF his is done to spre the ommunitions tht would e required if every proessor hd to tlk to every other regrding eh detilF sn other sesD the prllel rhiteture o'ers opportunities to express physil system in di'erent nd lever wys tht mke sense in the ontext of the mhineF por instneD eh prtile ould e ssigned to its own proessorD nd these ould slide pst one notherD summing intertions nd updting time stepF hepending on the rhiteture of the prllel omputer nd prolemD hoie for either dividing or repliting @portions of A the domin my dd uneptle overhed or ost to the whole projetF
9 This content is available online at <http://cnx.org/content/m33762/1.2/>. 10 The distributed RC5 key-cracking eort was coordinated in this fashion. Each
processor would check out a block of keys

and begin testing those keys. At some point, if the processor was not fast enough or had crashed, the central system would reissue the block to another processor. This allowed the system to recover from problems on individual computers.

PHT

CHAPTER 4. SCALABLE PARALLEL PROCESSING

por lrge prolemD the dollr vlue of min memory my mke keeping seprte lol opies of the sme dt out of the questionF sn ftD need for more memory is often wht drives people to prllel mhinesY the prolem they need to solve n9t (t in the memory of onventionl omputerF fy investing some e'ortD you ould llow the domin prtitioning to evolve s the progrm runsD in response to n uneven lod distriutionF ht wyD if there were lot of requests for esD then severl proessors ould dynmilly get opy of the e piee of the dominF yr the e piee ould e spred out ross severl proessorsD eh hndling di'erent suset of the e de(nitionsF ou ould lso migrte unique opies of dt from ple to pleD hnging their home s neededF hen the dt domin is irregulrD or hnges over timeD the prllel progrm enounters lodElning prolemF uh prolem eomes espeilly pprent when one portion of the prllel omputtions tkes muh longer to omplete thn the othersF e relEworld exmple might e n engineering nlysis on n dptive gridF es the progrm runsD the grid eomes more re(ned in those res showing the most tivityF sf the work isn9t repportioned from time to timeD the setion of the omputer with responsiility for the most highly re(ned portion of the grid flls frther nd frther ehind the performne of the rest of the mhineF

4.1.6 High Performance FORTRAN (HPF)11


sn wrh IWWPD the righ erformne portrn porum @rppA egn meeting to disuss nd de(ne set of dditions to pyex WH to mke it more prtil for use in slle omputing environmentF he pln ws to develop spei(tion within the lendr yer so tht vendors ould quikly egin to implement the stndrdF he sope of the e'ort inluded the followingX sdentify slrs nd rrys tht will e distriuted ross prllel mhineF y how they will e distriutedF ill they e stripsD loksD or something elsec peify how these vriles will e ligned with respet to one notherF edistriute nd relign dt strutures t runtimeF edd pyevv ontrol onstrut for prllel ssignments tht re di0ult or impossile to onstrut using pyex WH9s rry syntxF wke improvements to the pyex WH rii ontrol onstrutF edd intrinsi funtions for ommon prllel opertionsF

here were severl soures of inspirtion for the rp e'ortF vyout diretives were lredy prt of the pyex WH progrmming environment for some swh omputers @iFeFD the gwEPAF elsoD wD the (rst portle messgeEpssing environmentD hd een relesed yer erlierD nd users hd yer of experiE ene trying to deompose y hnd progrmsF hey hd developed some si usle tehniques for dt deomposition tht worked very well ut required fr too muh ookkeepingF12 he rp e'ort rought together diverse set of interests from ll the mjor high performne omputing vendorsF endors representing ll the mjor rhitetures were representedF es result rp ws designed to e implemented on nerly ll types of rhiteturesF here is n e'ort underwy to produe the next pyex stndrdX pyex WSF pyex WS is expeted to dopt some ut not ll of the rp modi(tionsF

4.1.6.1 Programming in HPF


et its oreD rp inludes pyex WHF sf pyex WH progrm were run through n rp ompilerD it must produe the sme results s if it were run through pyex WH ompilerF essuming n rp progrm only uses pyex WH onstruts nd rp diretivesD pyex WH ompiler ould ignore the diretivesD nd it should produe the sme results s n rp ompilerF
11 This content is available 12 As we shall soon see.
online at <http://cnx.org/content/m33765/1.2/>.

PHU es the user dds diretives to the progrmD the semntis of the progrm re not hngedF sf the user ompletely misunderstnds the pplition nd inserts extremely illEoneived diretivesD the progrm produes orret results very slowlyF en rp ompiler doesn9t try to 4improve on4 the user9s diretivesF st ssumes the progrmmer is omnisientF13 yne the user hs determined how the dt will e distriuted ross the proessorsD the rp ompiler ttempts to use the minimum ommunition neessry nd overlps ommunition with omputtion whenever possileF rp generlly uses n 4owner omputes4 rule for the plement of the omputtionsF e prtiulr element in n rry is omputed on the proessor tht stores tht rry elementF ell the neessry dt to perform the omputtion is gthered from remote proessorsD if neessryD to perform the omputtionF sf the progrmmer is lever in deomposition nd lignmentD muh of the dt needed will e from the lol memory rther then remote memoryF he rp ompiler is lso responsile for lloting ny temporry dt strutures needed to support ommunitions t runtimeF sn generlD the rp ompiler is not mgi E it simply does very good jo with the ommunition detils when the progrmmer n design good dt deompositionF et the sme timeD it retins portility with the single g nd shred uniform memory systems using pyex WHF

4.1.6.2 HPF data layout directives


erhps the most importnt ontriutions of rp re its dt lyout diretivesF sing these diretivesD the progrmmer n ontrol how dt is lid out sed on the progrmmer9s knowledge of the dt intertionsF en exmple diretive is s followsX

ievBR yh@IHA 3rp6 hssfi yh@fvyguA


he 3rp6 pre(x would e omment to nonErp ompiler nd n sfely e ignored y stright pyex WH ompilerF he hssfi diretive indites tht the yh rry is to e distriuted ross multiple proessorsF sf this diretive is not usedD the yh rry is lloted on one proessor nd omE munited to the other proessors s neessryF here re severl distriutions tht n e done in eh dimensionX

ievBR fyf@IHHDIHHDIHHADsgr@IHHDIHHDIHHA 3rp6 hssfi fyf@fvyguDggvsgDBA 3rp6 hssfi sgr@ggvsg@IHAA


hese distriutions operte s followsX

fvygu he rry is distriuted ross the proessors using ontiguous loks of the index vlueF he loks re mde s lrge s possileF ggvsg he rry is distriuted ross the proessorsD mpping eh suessive element to the 4next4 proessorD nd when the lst proessor is rehedD llotion strts gin on the (rst proessorF ggvsg@nA he rry is distriuted the sme s ggvsg exept tht n suessive elements re pled on eh proessor efore moving on to the next proessorF
13 Always
a safe assumption.

PHV
note:

CHAPTER 4. SCALABLE PARALLEL PROCESSING


ell the elements in tht dimension re pled on the sme proessorF his is most useful for multidimensionl rrysF

Distributing array elements to processors

Figure 4.5

pigure RFS @histriuting rry elements to proessorsA shows how the elements of simple rry would e mpped onto three proessors with di'erent diretivesF st must llote four elements to roessors I nd P euse there is no roessor R ville for the leftover element if it lloted three elements to roessors I nd PF sn pigure RFS @histriuting rry elements to proessorsAD the elements re lloted on suessive proessorsD wrpping round to roessor I fter the lst proessorF sn pigure RFS @histriuting rry elements to proessorsAD using hunk size with ggvsg is ompromise etween pure fvygu nd pure ggvsgF o explore the use of the BD we n look t simple twoEdimensionl rry mpped onto four proessorsF sn pigure RFT @woEdimensionl distriutionsAD we show the rry lyout nd eh ell indites whih proessor will hold the dt for tht ell in the twoEdimensionl rryF sn pigure RFT @woEdimensionl distriutionsAD the diretive deomposes in oth dimensions simultneouslyF his pproh results in roughly squre pthes in the rryF roweverD this my not e the est pprohF sn the following exmpleD we use the B to indite tht we wnt ll the elements of prtiulr olumn to e lloted on the sme proessorF oD the olumn vlues eqully distriute the olumns ross the proessorsF henD ll the rows in eh olumn follow where the olumn hs een pledF his llows unit stride for the onEproessor portions of the omputtion nd is ene(il in some pplitionsF he B syntx is lso lled onEproessor distriutionF

PHW

Two-dimensional distributions

Figure 4.6

hen deling with more thn one dt struture to perform omputtionD you n either seprtely distriute them or use the evsqx diretive to ensure tht orresponding elements of the two dt strutures re to e lloted togetherF sn the following exmpleD we hve plte rry nd sling ftor tht must e pplied to eh olumn of the plte during the omputtionX

hswixsyx vei@PHHDPHHADgevi@PHHA 3rp6 hssfi vei@BDfvyguA 3rp6 evsqx gevi@sA sr vei@tDsA


yrX

hswixsyx vei@PHHDPHHADgevi@PHHA 3rp6 hssfi vei@BDfvyguA 3rp6 evsqx gevi@XA sr vei@BDXA


sn oth exmplesD the vei nd the gevi vriles re lloted to the sme proessors s the orresponding olumns of veiF he B nd X syntx ommunite the sme informtionF hen B is usedD tht dimension is ollpsedD nd it doesn9t prtiipte in the distriutionF hen the X is usedD it mens tht dimension follows the orresponding dimension in the vrile tht hs lredy een distriutedF ou ould lso speify the lyout of the gevi vrile nd hve the vei vrile 4follow4 the lyout of the gevi vrileX

PIH

CHAPTER 4. SCALABLE PARALLEL PROCESSING


hswixsyx vei@PHHDPHHADgevi@PHHA 3rp6 hssfi gevi@fvyguA 3rp6 evsqx vei@tDsA sr gevi@sA

ou n put simple rithmeti expressions into the evsqx diretive sujet to some limittionsF yther diretives inludeX

ygiy ellows you to rete shpe of the proessor on(gurtion tht n e used to lign other dt struturesF ihssfi nd ievsqx ellow you to dynmilly reshpe dt strutures t runtime s the ommuE nition ptterns hnge during the ourse of the runF iwvei ellows you to rete n rry tht uses no speF snsted of distriuting one dt struture nd ligning ll the other dt struturesD some users will rete nd distriute templte nd then lign ll of the rel dt strutures to tht templteF
he use of diretives n rnge from very simple to very omplexF sn some situtionsD you distriute the one lrge shred strutureD lign few relted strutures nd you re doneF sn other situtionsD progrmmers ttempt to optimize ommunitions sed on the topology of the interonnetion network @hyperueD multiE stge interonnetion networkD meshD or toroidA using very detiled diretivesF hey lso might refully redistriute the dt t the vrious phses of the omputtionF ropefully your pplition will yield good performne without too muh e'ortF

4.1.6.3 HPF control structures


hile the rp designers were in the midst of de(ning new lngugeD they set out improving on wht they sw s limittions in pyex WHF snterestinglyD these modi(tions re wht is eing onsidered s prt of the new pyex WS stndrdF he pyevv sttement llows the user to express simple itertive opertions tht pply to the entire rry without resorting to doEloop @rememerD doEloops fore orderAF por exmpleX

pyevv @saIXIHHD taIXIHHA e@sDtA a s C t


his n e expressed in ntive pyex WH ut it is rther uglyD ounterintuitiveD nd prone to errorF enother ontrol struture is the ility to delre funtion s 4iF4 e i funtion hs no side e'ets other thn through its prmetersF he progrmmer is gurnteeing tht i funtion n exeute simultneously on mny proessors with no ill e'etsF his llows rp to ssume tht it will only operte on lol dt nd does not need ny dt ommunition during the durtion of the funtion exeutionF he progrmmer n lso delre whih prmeters of the funtion re input prmetersD output prmetersD nd inputEoutput prmetersF

4.1.6.4 HPF intrinsics


he ompnies who mrketed swh omputers needed to ome up with signi(nt tools to llow e0ient olletive opertions ross ll the proessorsF e perfet exmple of this is the w opertionF o w the vlue of n rry spred ross x proessorsD the simplisti pproh tkes x stepsF roweverD it is possile to omplish it in log@xA steps using tehnique lled F fy the time rp ws in

parallel-prex-sum

PII developmentD numer of these opertions hd een identi(ed nd implementedF rp took the opportunity to de(ne stndrdized syntx for these opertionsF e smple of these opertions inludesX

wips erforms vrious types of prllelEpre(x summtionsF evvgei histriutes single vlue to set of proessorsF qehihyx orts into deresing orderF sex gomputes the logil y of set of vluesF
hile there re lrge numer of these intrinsi funtionsD most pplitions use only few of the opertionsF

4.1.6.5 HPF extrinsics


sn order to llow the vendors with diverse rhitetures to provide their prtiulr dvntgeD rp inluded the pility to link 4extrinsi4 funtionsF hese funtions didn9t need to e written in pyex WHGrp nd performed numer of vendorEsupported pilitiesF his pility llowed users to perform suh tsks s the retion of hyrid pplitions with some rp nd some messge pssingF righ performne omputing progrmmers lwys like the ility to do things their own wy in order to eke out tht lst drop of performneF

4.1.6.6 Heat Flow in HPF


o port our het )ow pplition to rpD there is relly only single line of ode tht needs to e ddedF sn the exmple elowD we9ve hnged to lrger twoEdimensionl rryX

3rp6

sxiqi veisDweswi eewii@veisaPHHHDweswiaPHHA hssfi vei@BDfvyguA ievBR vei@veisDveisA sxiqi sgu vei a HFH

B edd foundries vei@IDXA a IHHFH vei@veisDXA a ERHFH vei@XDveisA a QSFPQ vei@XDIA a RFS hy sgu a IDweswi vei@PXveisEIDPXveisEIA a vei@IXveisEPDPXveisEIA vei@QXveisEHDPXveisEIA vei@PXveisEIDIXveisEPA vei@PXveisEIDQXveisEHA sx IHHHDsguD vei@PDPA pywe@9sgu a 9DsSD pIQFVA ixhhy ixh @ C C C A G RFH

C C C C IHHH B

PIP

CHAPTER 4. SCALABLE PARALLEL PROCESSING

ou will notie tht the rp diretive distriutes the rry olumns using the fvygu pprohD keeping ll the elements within olumn on single proessorF et (rst glneD it might pper tht @fvyguDfvyguA is the etter distriutionF roweverD there re two dvntges to @BDfvyguA distriutionF pirstD striding down olumn is unitEstride opertion nd so you might just s well proess n entire olumnF he more signi(nt spet of the distriution is tht @fvyguDfvyguA distriution fores eh proessor to ommunite with up to eight other proessors to get its neighoring vluesF sing the @BDfvyguA distriutionD eh proessor will hve to exhnge dt with t most two proessors eh time stepF hen we look t wD we will look t this sme progrm implemented in whEstyle messgeEpssing fshionF sn tht exmpleD you will see some of the detils tht rp must hndle to properly exeute this odeF efter reviewing tht odeD you will proly hoose to implement ll of your future het )ow pplitions in rp3

4.1.6.7 HPF Summary


sn some wysD rp hs een good for pyex WHF gompnies suh s sfw with its EI needed to provide some highElevel lnguge for those users who didn9t wnt to write messgeEpssing odesF feuse of thisD sfw hs invested gret del of e'ort in implementing nd optimizing rpF snterestinglyD muh of this e'ort will diretly ene(t the ility to develop more sophistited pyex WH ompilersF he extensive dt )ow nlysis required to minimize ommunitions nd mnge the dynmi dt strutures will rry over into pyex WH ompilers even without using the rp diretivesF ime will tell if the rp dt distriution diretives will no longer e needed nd ompilers will e ple of performing su0ient nlysis of stright pyex WH ode to optimize dt plement nd movementF sn its urrent formD rp is n exellent vehile for expressing the highly dtEprllelD gridEsed ppliE tionsF sts weknesses re irregulr ommunitions nd dynmi lod lningF e new e'ort to develop the next version of rp is underE wy to ddress some of these issuesF nfortuntelyD it is more di0ult to solve these runtime prolems while mintining good performne ross wide rnge of rhiteturesF

4.1.7 Closing Notes14


sn this hpterD we hve overed some of the e'orts in the re of lnguges tht hve een developed to llow progrms to e written for slle omputingF here is tension etween pure pyexEUUD pyex WHD rpD nd messge pssing s to whih will e the ultimte tools for slleD high performne omputingF gertinlyD there hve een exmples of gret suesses for oth pyex WH @hinking whines gwESA nd rp @sfw nd othersA s lnguges tht n mke exellent use of slle omputing systemsF yne of the prolems of highElevel lnguge pproh is tht sometimes using n strt highElevel lnguge tully e'etive portilityF he lnguges re designed to e portleD ut if the vendor of your prtiulr slle omputer doesn9t support the lnguge vrint in whih you hve hosen to write your pplitionD then it isn9t portleF iven if the vendor hs your lnguge villeD it my not e tuned to generte the est ode for their rhitetureF yne solution is to purhse your ompilers from thirdEprty ompny suh s i( ierr or uuk nd essoitesF hese vendors sell one ompiler tht runs ross wide rnge of systemsF por users who n 'ord these optionsD these ompilers 'ord higher level of portilityF yne of the fundmentl issues is the hikenEndEegg prolemF sf users don9t use lngugeD vendors won9t improve the lngugeF sf ll the in)uentil users @with ll the moneyA use messge pssingD then the existene of n exellent rp ompiler is of no rel vlue to those usersF he good news is tht oth pyex WH nd rp provide one rod mp to portle slle omputing tht doesn9t require expliit messge pssingF he only question is whih rod we users will hooseF

reduces

14 This

content is available online at <http://cnx.org/content/m33775/1.2/>.

PIQ

4.2 Message-Passing Environments


4.2.1 Introduction15
e messgeEpssing interfe is set of funtion nd suroutine lls for g or pyex tht give you wy to split n pplition for prllel exeutionF ht is divided nd pssed out to other proessors s messgesF he reeiving proessors unpk themD do some workD nd send the results k or pss them long to other proessors in the prllel omputerF sn some wysD messge pssing is the ssemly lnguge of prllel proessingF ou get ultimte responsiilityD nd if you re tlented @nd your prolem oopertesAD you get ultimte performneF sf you hve nie slle prolem nd re not stis(ed with the resulting performneD you pretty muh hve yourself to lmeF he ompiler is ompletely unwre of the prllel spets of the progrmF he two most populr messgeEpssing environments re @wA nd @wsAF wost of the importnt fetures re ville in either environmentF yne you hve mstered messge pssingD moving from w to ws won9t use you muh trouleF ou my lso operte on system tht provides only vendorEspei( messgeEpssing interfeF roweverD one you understnd messge pssing onepts nd hve properly deomposed your pplitionD usully it9s not tht muh more e'ort to move from one messgeEpssing lirry to notherF16

passing interface

parallel virtual machine

message-

4.2.2 Parallel Virtual Machine17


he ide ehind w is to ssemle diverse set of networkEonneted resoures into virtul mhineF e user ould mrshl the resoures of QS idle worksttions on the snternet nd hve their own personl slle proessing systemF he work on w strted in the erly IWWHs t yk idge xtionl vsF w ws pretty muh n instnt suess mong omputer sientistsF st provided rough frmework in whih to experiment with using network of worksttions s prllel proessorF sn w ersion QD your virtul mhine n onsist of single proessorsD shredEmemory multiproessorsD nd slle multiproessorsF w ttempts to knit ll of these resoures into singleD onsistentD exeution environmentF o run wD you simply need login ount on set of network omputers tht hve the w softwre instlledF ou n even instll it in your home diretoryF o rete your own personl virtul mhineD you would rete list of these omputers in (leX

7 t hostfile frodoFegrFmsuFedu gollumFegrFmsuFedu mordorFegrFmsuFedu 7


efter some nontrivil mhintions with pths nd environment vrilesD you n strt the w onsoleX

7 pvm hostfile pvmd lredy runningF


15 This content 16 Notice I said 17 This content
is available online at <http://cnx.org/content/m33781/1.2/>. not that much more eort. is available online at <http://cnx.org/content/m33779/1.2/>.

PIR

CHAPTER 4. SCALABLE PARALLEL PROCESSING


pvm> onf I hostD I dt formt ry frodo gollum mordor pvm> ps ry frodo pvm> reset pvm> ps ry pvm> hsh egr RHHHH xRyvP RHHHI xRyvP RHHHP xRyvP iih IHHH IHHH IHHH

sh pveq Hx gywwexh RHHRP TGDf pvmgs sh pveq Hx gywwexh

wny di'erent users n e running virtul mhines using the sme pool of resouresF ih user hs their own view of n empty mhineF he only wy you might detet other virtul mhines using your resoures is in the perentge of the time your pplitions get the gF here is wide rnge of ommnds you n issue t the w onsoleF he ommnd shows the running proesses in your virtul mhineF st9s quite possile to hve more proesses thn omputer systemsF ih proess is timeEshred on system long with ll the other lod on the systemF he ommnd performs soft reoot on your virtul mhineF ou re the virtul system dministrtor of the virtul mhine you hve ssemledF o exeute progrms on your virtul omputerD you must ompile nd link your progrms with the w lirry routinesX18

ps

reset

7 imk mst slv mking in xRyvPG for xRyvP Ey EsGoptGpvmQGinlude Ehfpxg Eh Ehxyqihfvs Ehsqxev EhxyesQ Ehxyxshyw Eo mst FFGmstF EvGoptGpvmQGliGxRyvP ElpvmQ Elnsl Elsoket mv mst rsGpvmQGinGxRyvP Ey EsGoptGpvmQGinlude Ehfpxg Eh Ehxyqihfvs Ehsqxev EhxyesQ Ehxyxshyw Eo slv FFGslvF EvGoptGpvmQGliGxRyvP ElpvmQ Elnsl Elsoket mv slv rsGpvmQGinGxRyvP 7
hen the (rst w ll is enounteredD the pplition ontts your virtul mhine nd enrolls itself in the virtul mhineF et tht point it should show up in the output of the ommnd issued t the w onsoleF prom tht point onD your pplition issues w lls to rete more proesses nd intert with those proessesF w tkes the responsiility for distriuting the proesses on the di'erent systems in the virtul mhineD sed on the lod nd your ssessment of eh system9s reltive performneF wessges re moved ross the network using @hA nd delivered to the pproprite proessF ypillyD the w pplition strts up some dditionl w proessesF hese n e dditionl opies of the sme progrm or eh w proess n run di'erent w pplitionF hen the work is distriuted mong the proessesD nd results re gthered s neessryF

ps

user datagram protocol

18 Note:

the exact compilation may be dierent on your system.

PIS here re severl si models of omputing tht re typilly used when working with wX

Master/Slave: X hen operting in this modeD one proess @usully the initil proessA is designted s

the mster tht spwns some numer of worker proessesF ork units re sent to eh worker proessD nd the results re returned to the msterF yften the mster mintins queue of work to e done nd s slve (nishesD the mster delivers new work item to the slveF his pproh works well when there is little dt intertion nd eh work unit is independentF his pproh hs the dvntge tht the overll prolem is nturlly lodElned even when there is some vrition in the exeution time of individul proessesF Broadcast/Gather: X his type of pplition is typilly hrterized y the ft tht the shred dt struture is reltively smll nd n e esily opied into every proessor9s nodeF et the eginning of the time stepD ll the glol dt strutures re rodst from the mster proess to ll of the proessesF ih proess then opertes on their portion of the dtF ih proess produes prtil result tht is sent k nd gthered y the mster proessF his pttern is repeted for eh time stepF SPMD/Data decomposition: X hen the overll dt struture is too lrge to hve opy stored in every proessD it must e deomposed ross multiple proessesF qenerllyD t the eginning of time stepD ll proesses must exhnge some dt with eh of their neighoring proessesF hen with their lol dt ugmented y the neessry suset of the remote dtD they perform their omputtionsF et the end of the time stepD neessry dt is gin exhnged etween neighoring proessesD nd the proess is restrtedF he most omplited pplitions hve nonuniform dt )ows nd dt tht migrtes round the system s the pplition hnges nd the lod hnges on the systemF sn this setionD we hve two exmple progrmsX one is msterEslve opertionD nd the other is dt deompositionEstyle solution to the het )ow prolemF

4.2.2.1 Queue of Tasks


sn this exmpleD one proess @mstA retes (ve slve proesses @slvA nd doles out PH work units @dd one to numerAF es slve proess respondsD it9s given new work or told tht ll of the work units hve een exhustedX

7 t mstF 5inlude <stdioFh> 5inlude 4pvmQFh4 5define weyg S 5define tyf PH min@A { int mytidDinfoY int tidsweygY int tidDinputDoutputDnswersDworkY mytid a pvmmytid@AY infoapvmspwn@4slv4D @hrBBAHD HD 44D weygD tidsAY GB end out the first work BG

PIT

CHAPTER 4. SCALABLE PARALLEL PROCESSING


for@workaHYwork<weygYworkCCA { pvminitsend@vmhthefultAY pvmpkint@8workD ID I A Y pvmsend@tidsworkDIA YGB I a msgtype BG } GB end out the rest of the work requests BG work a weygY for@nswersaHY nswers < tyf Y nswersCCA { pvmrev@ EID P AY GB EI a ny tsk P a msgtype BG pvmupkint@ 8tidD ID I AY pvmupkint@ 8inputD ID I AY pvmupkint@ 8outputD ID I AY printf@4hnks to 7d PB7da7d\n4DtidDinputDoutputAY pvminitsend@vmhthefultAY if @ work < tyf A { pvmpkint@8workD ID I A Y workCCY } else { input a EIY pvmpkint@8inputD ID I A Y GB ell them to stop BG } pvmsend@tidDIA Y } } 7 pvmexit@AY

yne of the interesting spets of the w interfe is the seprtion of lls to prepre new messgeD pk dt into the messgeD nd send the messgeF his is done for severl resonsF w hs the pility to onvert etween di'erent )otingEpoint formtsD yte orderingsD nd hrter formtsF his lso llows single messge to hve multiple dt items with di'erent typesF he purpose of the messge type in eh w send or reeive is to llow the sender to wit for prtiulr type of messgeF sn this exmpleD we use two messge typesF ype one is messge from the mster to the slveD nd type two is the responseF hen performing reeiveD proess n either wit for messge from spei( proess or messge from ny proessF sn the seond phse of the omputtionD the mster wits for response from ny slveD prints the responseD nd then doles out nother work unit to the slve or tells the slve to terminte y sending messge with vlue of EIF he slve ode is quite simple " it wits for messgeD unpks itD heks to see if it is termintion messgeD returns responseD nd repetsX

7 t slvF 5inlude <stdioFh> 5inlude 4pvmQFh4

PIU

GB e simple progrm to doule integers BG min@A { int mytidY int inputDoutputY mytid a pvmmytid@AY while@IA { pvmrev@ EID I AY GB EI a ny tsk Iamsgtype BG pvmupkint@8inputD ID IAY if @ input aa EI A rekY GB ell done BG output a input B PY pvminitsend@ vmhthefult AY pvmpkint@ 8mytidD ID I AY pvmpkint@ 8inputD ID I AY pvmpkint@ 8outputD ID I AY pvmsend@ pvmprent@AD P AY

} 7

} pvmexit@AY

hen the mster progrm is exeutedD it produes the following outputX

7 phet hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to hnks to 7

PTPPHR PTPPHS PTPPHT PTPPHU PTPPHR PTPPHS PTPPHT PTPPHU PTPPHR PTPPHS PTPPHT PTPPHU PTPPHS PTPPHU PTPPHS PTPPHU PTPPHR PTPPHS PTPPHT PTPPHV

PBHaH PBIaP PBPaR PBQaT PBSaIH PBTaIP PBUaIR PBVaIT PBWaIV PBIHaPH PBIIaPP PBIPaPR PBIRaPV PBITaQP PBIUaQR PBIVaQT PBIQaPT PBIWaQV PBISaQH PBRaV

PIV

CHAPTER 4. SCALABLE PARALLEL PROCESSING

glerly the proesses re operting in prllelD nd the order of exeution is somewht rndomF his ode is n exellent skeleton for hndling wide rnge of omputtionsF sn the next exmpleD we perform n whEstyle omputtion to solve the het )ow prolem using wF

4.2.2.2 Heat Flow in PVM


his next exmple is rther omplited pplition tht implements the het )ow prolem in wF sn mny wysD it gives some insight into the work tht is performed y the rp environmentF e will solve het )ow in twoEdimensionl plte with four het soures nd the edges in zeroEdegree wterD s shown in pigure RFU @e twoEdimensionl plte with four onstnt het souresAF

A two-dimensional plate with four constant heat sources

Figure 4.7

he dt will e spred ross ll of the proesses using @BD fvyguA distriutionF golumns re distriuted to proesses in ontiguous loksD nd ll the row elements in olumn re stored on the sme proessF es with rpD the proess tht owns dt ell performs the omputtions for tht ell fter retrieving ny dt neessry to perform the omputtionF e use redElk pproh ut for simpliityD we opy the dt k t the end of eh itertionF por true redElkD you would perform the omputtion in the opposite diretion every other time stepF xote tht insted of spwning slve proessD the prent proess spwns dditionl opies of itselfF his is typil of whEstyle progrmsF yne the dditionl proesses hve een spwnedD ll the proesses wit t rrier efore they look for the proess numers of the memers of the groupF yne the proesses hve rrived t the rrierD they ll retrieve list of the di'erent proess numersX

7 t phetFf yqew rie sxgvhi 9FFGinludeGfpvmQFh9 sxiqi xygDyDgyvDygyvDyppi eewii@xygaRDweswiaPHHA eewii@yaPHHDygyvaPHHA eewii@gyva@ygyvGxygACQA ievBV ih@HXyCIDHXgyvCIAD fvegu@HXyCIDHXgyvCIA vyqsgev sewpsDsewve sxiqi sxwDsxpyDsh@HXxygEIADsi

PIW

sxiqi sDDg sxiqi sguDweswi greegiBQH pxewi B qet the wh thing going E toin the phet group gevv wptysxqy@9phet9D sxwA

B sf we re the first in the phet groupD mke some helpers sp @ sxwFiFH A rix hy saIDxygEI gevv wpex@9phet9D HD 9nywhere9D ID sh@sAD siA ixhhy ixhsp B frrier to mke sure we re ll here so we n look them up gevv wpfesi@ 9phet9D xygD sxpy A

B pind my pls nd get their shs E sh re neessry for sending hy saHDxygEI gevv wpqish@9phet9D sD sh@sAA ixhhy
et this point in the odeD we hve xyg proesses exeuting in n wh modeF he next step is to determine whih suset of the rry eh proess will omputeF his is driven y the sxw vrileD whih rnges from H to Q nd uniquely identi(es these proessesF e deompose the dt nd store only one qurter of the dt on eh proessF sing the sxw vrileD we hoose our ontinuous set of olumns to store nd omputeF he yppi vrile mps etween glol olumn in the entire rry nd lol olumn in our lol suset of the rryF pigure RFV @essigning grid elements to proessorsA shows mp tht indites whih proessors store whih dt elementsF he vlues mrked with f re oundry vlues nd won9t hnge during the simultionF hey re ll set to HF his ode is often rther triky to (gure outF erforming @fvyguD fvyguA distriution requires twoEdimensionl deomposition nd exhnging dt with the neighors ove nd elowD in ddition to the neighors to the left nd rightX

PPH

CHAPTER 4. SCALABLE PARALLEL PROCESSING


Assigning grid elements to processors

Figure 4.8

B gompute my geometry E ht suset do s proessc @sxwaH vluesA B etul golumn a yppi C golumn @yppi a HA B golumn H a neighors from left B golumn I a send to left B golumns IFFmylen wy ells to ompute B golumn mylen a end to right @mylenaSHA B golumn mylenCI a xeighors from ight @golumn SIA sewps a @sxw FiF HA sewve a @sxw FiF xygEIA yppi a @yGxyg B sxw A wvix a yGxyg sp @ sewve A wvix a ygyv E yppi sx BD9sxwX9DsxwD9 vol9DIDwvixD C 9 qlol9DyppiCIDyppiCwvix B trt gold hy gaHDgyvCI hy aHDyCI fvegu@DgA a HFH ixhhy ixhhy
xow we run the time stepsF he (rst t in eh time step is to reset the het souresF sn this simultionD we hve four het soures pled ner the middle of the plteF e must restore ll the vlues eh time through the simultion s they re modi(ed in the min loopX

PPI

B fegin running the time steps hy sguaIDweswi B et the het persistent soures gevv yi@fveguDyDgyvDyppiDwvixD C yGQDygyvGQDIHFHDsxwA gevv yi@fveguDyDgyvDyppiDwvixD C PByGQDygyvGQDPHFHDsxwA gevv yi@fveguDyDgyvDyppiDwvixD C yGQDPBygyvGQDEPHFHDsxwA gevv yi@fveguDyDgyvDyppiDwvixD C PByGQDPBygyvGQDPHFHDsxwA
xow we perform the exhnge of the ghost vlues with our neighoring proessesF por exmpleD roess H ontins the elements for glol olumn SHF o ompute the next time step vlues for olumn SHD we need olumn SID whih is stored in roess IF imilrlyD efore roess I n ompute the new vlues for olumn SID it needs roess H9s vlues for olumn SHF pigure RFW @ttern of ommunition for ghost vluesA shows how the dt is trnsferred etween proE essorsF ih proess sends its leftmost olumn to the left nd its rightmost olumn to the rightF feuse the (rst nd lst proesses order unhnging oundry vlues on the left nd right respetivelyD this is not neessry for olumns one nd PHHF sf ll is done properlyD eh proess n reeive its ghost vlues from their left nd right neighorsF

PPP

CHAPTER 4. SCALABLE PARALLEL PROCESSING


Pattern of communication for ghost values

Figure 4.9

he net result of ll of the trnsfers is tht for eh spe tht must e omputedD it9s surrounded y one lyer of either oundry vlues or ghost vlues from the right or left neighorsX

B end left nd right sp @ FxyF sewps A rix gevv wpsxsixh@whipevDiA gevv wpegu@ ievVD fvegu@IDIAD yD ID sxpy A gevv wpixh@ sh@sxwEIAD ID sxpy A ixhsp sp @ FxyF sewve A rix gevv wpsxsixh@whipevDiA gevv wpegu@ ievVD fvegu@IDwvixAD yD ID sxpy A gevv wpixh@ sh@sxwCIAD PD sxpy A ixhsp B eeive rightD then left sp @ FxyF sewve A rix gevv wpig@ sh@sxwCIAD ID fpsh A gevv wpxegu @ ievVD fvegu@IDwvixCIAD yD ID sxpy ixhsp sp @ FxyF sewps A rix gevv wpig@ sh@sxwEIAD PD fpsh A gevv wpxegu @ ievVD fvegu@IDHAD yD ID sxpyA

PPQ

ixhsp
his next segment is the esy prtF ell the pproprite ghost vlues re in pleD so we must simply perform the omputtion in our suspeF et the endD we opy k from the ih to the fvegu rryY in rel simultionD we would perform two time stepsD one from fvegu to ih nd the other from ih to fveguD to sve this extr opyX

B erform the flow hy gaIDwvix hy aIDy ih@DgA a @ fvegu@DgA C C fvegu@DgEIA C fvegu@EIDgA C C fvegu@CIDgA C fvegu@DgCIA A G SFH ixhhy ixhhy B gopy k E xormlly we would do red nd lk version of the loop hy gaIDwvix hy aIDy fvegu@DgA a ih@DgA ixhhy ixhhy ixhhy
xow we (nd the enter ell nd send to the mster proess @if neessryA so it n e printed outF e lso dump out the dt into (les for deugging or lter visuliztion of the resultsF ih (le is mde unique y ppending the instne numer to the (lenmeF hen the progrm termintesX

gevv ixhgivv@ihDyDgyvDyppiDwvixDsxwDsh@HAD yGPDygyvGPA

B hump out dt for verifition sp @ y FviF PH A rix pxewi a 9GtmpGphetoutF9 GG gre@sgre@9H9ACsxwA yix@xsaWDxewiapxewiDpywa9formtted9A hy gaIDwvix si@WDIHHA@fvegu@DgADaIDyA IHH pywe@PHpIPFTA ixhhy gvyi@xsaWA ixhsp B vets ll go together gevv wpfesi@ 9phet9D xygD sxpy A gevv wpis@ sxpy A

PPR

CHAPTER 4. SCALABLE PARALLEL PROCESSING


ixh

he ixhgivv routine (nds prtiulr ell nd prints it out on the mster proessF his routine is lled in n wh styleX ll the proesses enter this routine lthough ll not t preisely the sme timeF hepending on the sxw nd the ell tht we re looking forD eh proess my do something di'erentF sf the ell in question is in the mster proessD nd we re the mster proessD print it outF ell other proesses do nothingF sf the ell in question is stored in nother proessD the proess with the ell sends it to the mster proessesF he mster proess reeives the vlue nd prints it outF ell the other proesses do nothingF his is simple exmple of the typil style of wh odeF ell the proesses exeute the ode t roughly the sme timeD utD sed on informtion lol to eh proessD the tions performed y di'erent proesses my e quite di'erentX

fysxi ixhgivv@ihDyDgyvDyppiDwvixDsxwDshDDgA sxgvhi 9FFGinludeGfpvmQFh9 sxiqi yDgyvDyppiDwvixDsxwDshDDg ievBV ih@HXyCIDHXgyvCIA ievBV gixi B gompute lol row numer to determine if it is ours s a g E yppi sp @ s FqiF I FexhF sFviF wvix A rix sp @ sxw FiF H A rix sx BD9wster hs9D ih@DsAD D gD s ivi gevv wpsxsixh@whipevDiA gevv wpegu@ ievVD ih@DsAD ID ID sxpy A sx BD 9sxwX9DsxwD9 eturning9DDgDih@DsADs gevv wpixh@ shD QD sxpy A ixhsp ivi sp @ sxw FiF H A rix gevv wpig@ EI D QD fpsh A gevv wpxegu @ ievVD gixiD ID ID sxpyA sx BD 9wster eeived9DDgDgixi ixhsp ixhsp ix ixh
vike the previous routineD the yi routine is exeuted on ll proessesF he ide is to store vlue into row nd olumn positionF pirstD we must determine if the ell is even in our proessF sf the ell is in our proessD we must ompute the lol olumn @sA in our suset of the overll mtrix nd then store the vlueX

global

PPS

fysxi yi@ihDyDgyvDyppiDwvixDDgDeviDsxwA ievBV ih@HXyCIDHXgyvCIA iev evi sxiqi yDgyvDyppiDwvixDDgDsDsxw s a g E yppi sp @ s FvF I FyF s FqF wvix A ix ih@DsA a evi ix ixh
hen this progrm exeutesD it hs the following outputX

7 phet sxwX H vol I SH qlol I SH wster eeived IHH IHH QFRUPPQWHHPQSRIhEHU 7


e see two lines of printF he (rst line indites the vlues tht roess H used in its geometry omputtionF he seond line is the output from the mster proess of the temperture t ell @IHHDIHHA fter PHH time stepsF yne interesting tehnique tht is useful for deugging this type of progrm is to hnge the numer of proesses tht re retedF sf the progrm is not quite moving its dt properlyD you usully get di'erent results when di'erent numers of proesses re usedF sf you look loselyD the ove ode performs orretly with one proess or QH proessesF xotie tht there is no rrier opertion t the end of eh time stepF his is in ontrst to the wy prllel loops operte on shred uniform memory multiproessors tht fore rrier t the end of eh loopF feuse we hve used n owner omputes ruleD nd nothing is omputed until ll the required ghost dt is reeivedD there is no need for rrierF he reeipt of the messges with the proper ghost vlues llows proess to egin omputing immeditely without regrd to wht the other proesses re urrently doingF his exmple n e used either s frmework for developing other gridEsed omputtionsD or s good exuse to use rp nd ppreite the hrd work tht the rp ompiler developers hve doneF e wellE done rp implementtion of this simultion should outperform the w implementtion euse rp n mke tighter optimiztionsF nlike usD the rp ompiler doesn9t hve to keep its generted ode redleF

4.2.2.3 PVM Summary


w is widely used tool euse it 'ords portility ross every rhiteture other thn swhF yne the e'ort hs een invested in mking ode messge pssingD it tends to run well on mny rhiteturesF he primry omplints out w inludeX

he need for pk step seprte from the send step he ft tht it is designed to work in heterogeneous environment tht my inur some overhed st doesn9t utomte ommon tsks suh s geometry omputtions
fut ll in llD for ertin set of progrmmersD w is the tool to useF sf you would like to lern more out w see D y el qeistD edm feguelinD tk hongrrD eiheng tingD oert wnhekD nd idy underm @ws ressAF snformtion is lso ville t wwwFnetliForgGpvmQG19 F

PVM  A User's Guide and Tutorial for Networked Parallel Computing

19 http://cnx.org/content/m33779/latest/www.netlib.org/pvm3/

PPT

CHAPTER 4. SCALABLE PARALLEL PROCESSING

4.2.3 Message-Passing Interface20


he wessgeEssing snterfe @wsA ws designed to e n industrilEstrength messgeEpssing environment tht is portle ross wide rnge of hrdwre environmentsF wuh like righ erformne pyexD ws ws developed y group of omputer vendorsD pplition developersD nd omputer sientistsF he ide ws to ome up with spei(tion tht would tke the strengths of mny of the existing proprietry messge pssing environments on wide vriety of rhitetures nd ome up with spei(tion tht ould e implemented on rhitetures rnging from swh systems with thousnds of smll proessors to wswh networks of worksttions nd everything in etweenF snterestinglyD the ws e'ort ws ompleted yer the righ erformne pyex @rpA e'ort ws ompletedF ome viewed ws s portle messgeEpssing interfe tht ould support good rp ompilerF rving ws mkes the ompiler more portleF elso hving the ompiler use ws s its messgeE pssing environment insures tht ws is hevily tested nd tht su0ient resoures re invested into the ws implementtionF

after

4.2.3.1 PVM Versus MPI


hile mny of the folks involved in w prtiipted in the ws e'ortD ws is not simply followEon to wF w ws developed in universityGreserh l environment nd evolved over time s new fetures were neededF por exmpleD the group pility ws not designed into w t fundmentl levelF ome of the underlying ssumptions of w were sed on network of worksttions onneted vi ithernet model nd didn9t export well to slle omputersF21 sn some wysD ws is more roust thn wD nd in other wysD ws is simpler thn wF ws doesn9t speify the system mngement detils s in wY ws doesn9t speify how virtul mhine is to e retedD opertedD nd usedF

4.2.3.2 MPI Features


ws hs numer of useful fetures eyond the si send nd reeive pilitiesF hese inludeX

Communicators: X e ommunitor is suset of the tive proesses tht n e treted s group

for olletive opertions suh s rodstD redutionD rriersD sendingD or reeivingF ithin eh ommunitorD proess hs tht rnges from zero to the size of the groupF e proess my e memer of more thn one ommunitor nd hve di'erent rnk within eh ommunitorF here is defult ommunitor tht refers to ll the ws proesses tht is lled wsgywwyvhF Topologies: X e ommunitor n hve topology ssoited with itF his rrnges the proesses tht elong to ommunitor into some lyoutF he most ommon lyout is grtesin deompositionF por exmpleD IP proesses my e rrnged into QR gridF22 yne these topologies re de(nedD they n e queried to (nd the neighoring proesses in the topologyF sn ddition to the grtesin @gridA topologyD ws lso supports grphEsed topologyF Communication modes: X ws supports multiple styles of ommunitionD inluding loking nd nonE lokingF sers n lso hoose to use expliit u'ers for sending or llow ws to mnge the u'ersF he nonloking pilities llow the overlp of ommunition nd omputtionF ws n support model in whih there is no ville memory spe for u'ers nd the dt must e opied diretly from the ddress spe of the sending proess to the memory spe of the reeiving proessF ws lso supports single ll to perform send nd reeive tht is quite useful when proesses need to exhnge dtF Single-call collective operations: X ome of the lls in ws utomte olletive opertions in single llF por exmpleD the rodst opertion sends vlues from the mster to the slves nd reeives the vlues on the slves in the sme opertionF he net result is tht the vlues re updted on ll

rank

20 This content is available online at <http://cnx.org/content/m33783/1.2/>. 21 One should not diminish the positive contributions of PVM, however. PVM was the rst widely avail- able portable messagepassing environment. PVM pioneered the idea of heterogeneous distributed computing with built-in format conversion.

22 Sounds

a little like HPF, no?

PPU proessesF imilrlyD there is single ll to sum vlue ross ll of the proesses to single vlueF fy undling ll this funtionlity into single llD systems tht hve support for olletive opertions in hrdwre n mke est use of this hrdwreF elsoD when ws is operting on shredEmemory environmentD the rodst n e simpli(ed s ll the slves simply mke lol opy of shred vrileF glerlyD the developers of the ws spei(tion hd signi(nt experiene with developing messgeEpssing pplitions nd dded mny widely used fetures to the messgeEpssing lirryF ithout these feturesD eh progrmmer needed to use more primitive opertions to onstrut their own versions of the higherElevel opertionsF

4.2.3.3 Heat Flow in MPI


sn this exmpleD we implement our het )ow prolem in ws using similr deomposition to the w exmpleF here re severl wys to pproh the proE lemF e ould lmost trnslte w lls to orresponding ws lls using the wsgywwyvh ommunitorF roweverD to showse some of the ws feturesD we rete grtesin ommunitorX

B B B B

yqew wrieg sxgvhi 9mpifFh9 sxgvhi 9mpefFh9 sxiqi yDgyvDygyv eewii@weswiaPHHA his simultion n e run on wsxyg or greter proessesF st is yu to set wsxyg to I for testing purposes por lrge numer of rows nd olumnsD it is est to set wsxyg to the tul numer of runtime proesses eewii@wsxygaPA eewii@yaPHHDygyvaPHHDgyvaygyvGwsxygA hyfvi igssyx ih@HXyCIDHXgyvCIADfvegu@HXyCIDHXgyvCIA sxiqi DiDwvixDDg sxiqi sguDweswi greegiBQH pxewi

he si dt strutures re muh the sme s in the w exmpleF e llote suset of the het rrys in eh proessF sn this exmpleD the mount of spe lloted in eh proess is set y the ompileEtime vrile wsxygF he simultion n exeute on more thn wsxyg proesses @wsting some spe in eh proessAD ut it n9t exeute on less thn wsxyg proessesD or there won9t e su0ient totl spe ross ll of the proesses to hold the rryX

sxiqi sxiqi vyqsgev vyqsgev sxiqi sxiqi sxiqi

gywwIhDsxwDxygDsi hsw@IADgyyh@IA isyh@IA iyhi xhsw e@wsesiA sqrygD vipyg

PPV

CHAPTER 4. SCALABLE PARALLEL PROCESSING

hese dt strutures re used for our intertion with wsF es we will e doing oneEdimensionl grtesin deompositionD our rrys re dimensioned to oneF sf you were to do twoEdimensionl deompositionD these rrys would need two elementsX

sx BD9glling wssxs9 gevv wssxs@ si A sx BD9fk from wssxs9 gevv wsgywwsi@ wsgywwyvhD xygD si A
he ll to wssxs retes the pproprite numer of proessesF xote tht in the outputD the sx sttement efore the ll only ppers oneD ut the seond sx ppers one for eh proessF e ll wsgywwsi to determine the size of the glol ommunitor wsgywwyvhF e use this vlue to set up our grtesin topologyX

B grete new ommunitor tht hs grtesin topology ssoited B with it E wsgegiei returns gywwIh E e ommunitor desriptor hsw@IA a xyg isyh@IA a FpeviF iyhi a FiF xhsw a I gevv wsgegiei@wsgywwyvhD xhswD hswD isyhD C iyhiD gywwIhD siA
xow we rete oneEdimensionl @xhswaIA rrngement of ll of our proesses @wsgywwyvhAF ell of the prmeters on this ll re input vlues exept for gywwIh nd siF gywwIh is n integer ommunitor hndleF sf you print it outD it will e vlue suh s IQRF st is not tully dtD it is merely hndle tht is used in other llsF st is quite similr to (le desriptor or unit numer used when performing inputEoutput to nd from (lesF he topology we use is oneEdimensionl deomposition tht isn9t periodiF sf we spei(ed tht we wnted periodi deompositionD the frEleft nd frEright proesses would e neighors in wrppedEround fshion mking ringF qiven tht it isn9t periodiD the frEleft nd frEright proesses hve no neighorsF sn our w exmple oveD we delred tht roess H ws the frEright proessD roess xygEI ws the frEleft proessD nd the other proesses were rrnged linerly etween those twoF sf we set iyhi to FpeviFD ws lso hooses this rrngementF roweverD if we set iyhi to FiFD ws my hoose to rrnge the proesses in some other fshion to hieve etter performneD ssuming tht you re ommuE niting with lose neighorsF yne the ommunitor is set upD we use it in ll of our ommunition opertionsX

B qet my rnk in the new ommunitor gevv wsgywwexu@ gywwIhD sxwD siA

PPW

ithin eh ommunitorD eh proess hs rnk from zero to the size of the ommunitor minus IF he wsgywwexu tells eh proess its rnk within the ommunitorF e proess my hve di'erent rnk in the gywwIh ommunitor thn in the wsgywwyvh ommunitor euse of some reorderingF qiven grtesin topology ommunitorD23 we n extrt informtion from the ommunitor using the wsgeqi routineX

B qiven ommunitor hndle gywwIhD get the topologyD nd my position B in the topology gevv wsgeqi@gywwIhD xhswD hswD isyhD gyyhD siA
sn this llD ll of the prmeters re output vlues rther thn input vlues s in the wsgegiei llF he gyyh vrile tells us our oordintes within the ommunitorF his is not so useful in our oneEdimensionl exmpleD ut in twoEdimensionl proess deompositionD it would tell our urrent position in tht twoEdimensionl gridX

B eturns the left nd right neighors I unit wy in the zeroth dimension B of our grtesin mp E sine we re not periodiD our neighors my B not lwys exist E wsgersp hndles this for us gevv wsgersp@gywwIhD HD ID vipygD sqrygD siA gevv wihigywIh@ygyvD xygD sxwD D iA wvix a @ i E A C I sp @ wvixFqFgyv A rix sx BD9xot enough speD need9DwvixD9 hve 9Dgyv sx BDygyvDxygDsxwDDi y ixhsp sx BDsxwDxygDgyyh@IADvipygDsqrygD D i
e n use wsgersp to determine the rnk numer of our left nd right neighorsD so we n exhnge our ommon points with these neighorsF his is neessry euse we n9t simply send to sxwEI nd sxwCI if ws hs hosen to reorder our grtesin deompositionF sf we re the frEleft or frEright proessD the neighor tht doesn9t exist is set to wsygxvvD whih indites tht we hve no neighorF vter when we re performing messge sendingD it heks this vlue nd sends messges only to rel proessesF fy not sending the messge to the null proessD ws hs sved us n sp testF o determine whih strip of the glol rry we store nd ompute in this proessD we ll utility routine lled wihigywIh tht simply does severl lultions to evenly split our PHH olumns mong our proesses in ontiguous stripsF sn the w versionD we need to perform this omputtion y hndF he wihigywIh routine is n exmple of n extended ws lirry ll @hene the wi pre(xAF hese extensions inlude grphis support nd logging tools in ddition to some generl utilitiesF he wi lirry
23 Remember,
ingly, the

MPI_COMM_WORLD

each communicator may have a topology associated with it. A topology can be grid, graph, or none. Interestcommunicator has no topology associated with it.

PQH

CHAPTER 4. SCALABLE PARALLEL PROCESSING

onsists of routines tht were useful enough to stndrdize ut not required to e supported y ll ws implementtionsF ou will (nd the wi routines supported on most ws implementtionsF xow tht we hve our ommunitor group set upD nd we know whih strip eh proess will hndleD we egin the omputtionX

B trt gold hy gaHDgyvCI hy aHDyCI fvegu@DgA a HFH ixhhy ixhhy


es in the w exmpleD we set the plte @inluding oundry vluesA to zeroF ell proesses egin the time step loopF snterestinglyD like in wD there is no need for ny synhroniztionF he messges impliitly synhronize our loopsF he (rst step is to store the permnent het souresF e need to use routine euse we must mke the store opertions reltive to our strip of the glol rryX

B fegin running the time steps hy sguaIDweswi B et the persistent het soures gevv yi@fveguDyDgyvDDiDyGQDygyvGQDIHFHDsxwA gevv yi@fveguDyDgyvDDiDPByGQDygyvGQDPHFHDsxwA gevv yi@fveguDyDgyvDDiDyGQDPBygyvGQDEPHFHDsxwA gevv yi@fveguDyDgyvDDiDPByGQDPBygyvGQDPHFHDsxwA
ell of the proesses set these vlues independently depending on whih proess hs whih strip of the overll rryF xow we exhnge the dt with our neighors s determined y the grtesin ommunitorF xote tht we don9t need n sp test to determine if we re the frEleft or frEright proessF sf we re t the edgeD our neighor setting is wsygxvv nd the wsixh nd wsig lls do nothing when given this s soure or destintion vlueD thus sving us n sp testF xote tht we speify the ommunitor gywwIh euse the rnk vlues we re using in these lls re reltive to tht ommunitorX

B end left nd reeive right gevv wsixh@fvegu@IDIADyDwshyfviigssyxD C vipygDIDgywwIhDsiA gevv wsig@fvegu@IDwvixCIADyDwshyfviigssyxD C sqrygDIDgywwIhDeDsiA

PQI

B end ight gevv C C C

nd eeive left in single sttement wsixhig@ fvegu@IDwvixADyDgywwIhDsqrygDPD fvegu@IDHADyDgywwIhDvipygD PD wsgywwyvhD eD siA

tust to show o'D we use oth the seprte send nd reeiveD nd the omined send nd reeiveF hen given hoieD it9s proly good ide to use the omined opertions to give the runtime environment more )exiility in terms of u'eringF yne downside to this tht ours on network of worksttions @or ny other highElteny interonnetA is tht you n9t do oth send opertions (rst nd then do oth reeive opertions to overlp some of the ommunition delyF yne we hve ll of our ghost points from our neighorsD we n perform the lgorithm on our suset of the speX

B erform the flow hy gaIDwvix hy aIDy ih@DgA a @ fvegu@DgA C C fvegu@DgEIA C fvegu@EIDgA C C fvegu@CIDgA C fvegu@DgCIA A G SFH ixhhy ixhhy B gopy k E xormlly we would do red nd lk version of the loop hy gaIDwvix hy aIDy fvegu@DgA a ih@DgA ixhhy ixhhy ixhhy
eginD for simpliityD we don9t do the omplete redElk omputtionF24 e hve no synhroniztion t the ottom of the loop euse the messges impliitly synhronize the proesses t the top of the next loopF eginD we dump out the dt for veri(tionF es in the w exmpleD one good test of si orretness is to mke sure you get extly the sme results for vrying numers of proessesX

B hump out dt for verifition sp @ y FviF PH A rix pxewi a 9GtmpGmhetoutF9 GG gre@sgre@9H9ACsxwA yix@xsaWDxewiapxewiDpywa9formtted9A hy gaIDwvix si@WDIHHA@fvegu@DgADaIDyA
24 Note
loop. that you could do two time steps (one black-red-black iteration) if you exchanged two ghost columns at the top of the

PQP

CHAPTER 4. SCALABLE PARALLEL PROCESSING


IHH pywe@PHpIPFTA ixhhy gvyi@xsaWA ixhsp

o terminte the progrmD we ll wspsxevsiX

B vets ll go together gevv wspsxevsi@siA ixh


es in the w exmpleD we need routine to store vlue into the proper strip of the glol rryF his routine simply heks to see if prtiulr glol element is in this proess nd if soD omputes the proper lotion within its strip for the vlueF sf the glol element is not in this proessD this routine simply returns doing nothingX

fysxi yi@ihDyDgyvDDiDDgDeviDsxwA ievBV ih@HXyCIDHXgyvCIA iev evi sxiqi yDgyvDDiDDgDsDsxw sp @ g FvF FyF g FqF i A ix s a @ g E A C I sx BD9yiD sxwDDgDDiDDs9DsxwDDgDDiDDsDevi ih@DsA a evi ix ixh

hen this progrm is exeutedD it hs the following outputX

7 mpifUU E mhetFf mhetFfX wesx mhetX storeX 7 mpifUU Eo mhet mhetFo Elmpe 7 mhet Enp R glling wssxs fk from wssxs fk from wssxs fk from wssxs fk from wssxs H R H EI I I SH P R P I Q IHI ISH Q R Q P EI ISI PHH I R I H P SI IHH

PQQ

7
es you n seeD we ll wssxs to tivte the four proessesF he sx sttement immeditely fter the wssxs ll ppers four timesD one for eh of the tivted proessesF hen eh proess prints out the strip of the rry it will proessF e n lso see the neighors of eh proess inluding EI when proess hs no neighor to the left or rightF xotie tht roess H hs no left neighorD nd roess Q hs no right neighorF ws hs provided us the utilities to simplify messgeEpssing ode tht we need to dd to implement this type of gridE sed pplitionF hen you ompre this exmple with w implementtion of the sme prolemD you n see some of the ontrsts etween the two pprohesF rogrmmers who wrote the sme six lines of ode over nd over in w omined them into single ll in wsF sn wsD you n think dt prllel nd express your progrm in more dtEprllel fshionF sn some wysD ws feels less like ssemly lnguge thn wF roweverD ws does tke little getting used to when ompred to wF he onept of grtesin ommunitor my seem foreign t (rstD ut with understndingD it eomes )exile nd powerful toolF

4.2.3.4 Heat in MPI Using Broadcast/Gather

yne style of prllel progrmming tht we hve not yet seen is the styleF xot ll ppliE tions n e nturlly solved using this style of progrmmingF roweverD if n pplition n use this pproh e'etivelyD the mount of modi(tion tht is required to mke ode run in messgeEpssing environment is minimlF epplitions tht most ene(t from this pproh generlly do lot of omputtion using some smll mount of shred informtionF yne requirement is tht one omplete opy of the shred informtion must (t in eh of the proessesF sf we keep our grid size smll enoughD we n tully progrm our het )ow pplition using this pprohF his is lmost ertinly less e0ient implementtion thn ny of the erlier implementtions of this prolem euse the ore omputtion is so simpleF roweverD if the ore omputtions were more omplex nd needed ess to vlues frther thn one unit wyD this might e good pprohF he dt strutures re simpler for this pproh ndD tullyD re no di'erent thn the singleEproess pyex WH or rp versionsF e will llote omplete ih nd fvegu rry in every proessX

broadcast/gather

yqew wrie sxgvhi 9mpifFh9 sxgvhi 9mpefFh9 sxiqi yDgyv eewii@weswiaPHHA eewii@yaPHHDgyvaPHHA hyfvi igssyx ih@HXyCIDHXgyvCIADfvegu@HXyCIDHXgyvCIA
e need fewer vriles for the ws lls euse we ren9t reting ommunitorF e simply use the defult ommunitor wsgywwyvhF e strt up our proessesD nd (nd the size nd rnk of our proess groupX

sxiqi sxwDxygDsiDgDhiDeq

PQR

CHAPTER 4. SCALABLE PARALLEL PROCESSING


sxiqi DiDvDviDwvix sxiqi e@wsesiA sxiqi sDDg sxiqi sguDweswi greegiBQH pxewi sx BD9glling wssxs9 gevv wssxs@ si A gevv wsgywwsi@ wsgywwyvhD xygD si A gevv wsgywwexu@ wsgywwyvhD sxwD siA gevv wihigywIh@gyvD xygD sxwD D iD siA sx BD9wy hre 9D sxwD xygD D i

ine we re rodsting initil vlues to ll of the proessesD we only hve to set things up on the mster proessX

B trt gold sp @ sxwFiFH A rix hy gaHDgyvCI hy aHDyCI fvegu@DgA a HFH ixhhy ixhhy ixhsp
es we run the time steps @gin with no synhroniztionAD we set the persistent het soures diretlyF ine the shpe of the dt struture is the sme in the mster nd ll other proessesD we n use the rel rry oordintes rther thn mpping them s with the previous exmplesF e ould skip the persistent settings on the nonmster proessesD ut it doesn9t hurt to do it on ll proessesX

B fegin running the time steps hy sguaIDweswi B et the het soures fvegu@yGQD gyvGQAa IHFH fvegu@PByGQD gyvGQA a PHFH fvegu@yGQD PBgyvGQA a EPHFH fvegu@PByGQD PBgyvGQA a PHFH
xow we rodst the entire rry from proess rnk zero to ll of the other proesses in the wsgywwyvh ommunitorF xote tht this ll does the sending on rnk zero proess nd reeiving on the other proessesF he net result of this ll is tht ll the proesses hve the vlues formerly in the mster proess in single llX

PQS

B frodst the rry gevv wsfge@fveguD@yCPAB@gyvCPADwshyfviigssyxD C HDwsgywwyvhDsiA


xow we perform the suset omputtion on eh proessF xote tht we re using glol oordintes euse the rry hs the sme shpe on eh of the proessesF ell we need to do is mke sure we set up our prtiulr strip of olumns ording to nd iX

B erform the flow on our suset hy gaDi hy aIDy ih@DgA a @ fvegu@DgA C fvegu@DgEIA C fvegu@EIDgA C fvegu@CIDgA C fvegu@DgCIA A G SFH ixhhy ixhhy

C C

xow we need to gther the pproprite strips from the proesses into the pproprite strip in the mster rry for rerodst in the next time stepF e ould hnge the loop in the mster to reeive the messges in ny order nd hek the e vrile to see whih strip it reeivedX

B qther k up into the fvegu rry in mster @sxw a HA sp @ sxw FiF H A rix hy gaDi hy aIDy fvegu@DgA a ih@DgA ixhhy ixhhy hy saIDxygEI gevv wihigywIh@gyvD xygD sD vD viD siA wvix a @ vi E v A C I g a s eq a H gevv wsig@fvegu@HDvADwvixB@yCPAD C wshyfviigssyxD gD eqD C wsgywwyvhD eD siA B rint BD9ev9DsDwvix ixhhy ivi wvix a @ i E A C I hi a H eq a H

PQT

CHAPTER 4. SCALABLE PARALLEL PROCESSING


gevv wsixh@ih@HDADwvixB@yCPADwshyfviigssyxD hiD eqD wsgywwyvhD siA rint BD9end9DsxwDwvix ixhsp ixhhy C

e use wihigywIh to determine whih strip we9re reeiving from eh proessF sn some pplitionsD the vlue tht must e gthered is sum or nother single vlueF o omplish thisD you n use one of the ws redution routines tht olese set of distriuted vlues into single vlue using single llF egin t the endD we dump out the dt for testingF roweverD sine it hs ll een gthered k onto the mster proessD we only need to dump it on one proessX

B hump out dt for verifition sp @ sxw FiFH FexhF y FviF PH A rix pxewi a 9GtmpGmhetout9 yix@xsaWDxewiapxewiDpywa9formtted9A hy gaIDgyv si@WDIHHA@fvegu@DgADaIDyA IHH pywe@PHpIPFTA ixhhy gvyi@xsaWA ixhsp gevv wspsxevsi@siA ixh
hen this progrm exeutes with four proessesD it produes the following outputX

7 mpifUU E mhetFf mhetFfX wesx mhetX 7 mpifUU Eo mhet mhetFo Elmpe 7 mhet Enp R glling wssxs wy hre I R SI IHH wy hre H R I SH wy hre Q R ISI PHH wy hre P R IHI ISH 7
he rnks of the proesses nd the susets of the omputtions for eh proess re shown in the outputF o tht is somewht ontrived exmple of the rodstGgther pproh to prllelizing n ppliE tionF sf the dt strutures re the right size nd the mount of omputtion reltive to ommunition is ppropriteD this n e very e'etive pproh tht my require the smllest numer of ode modi(tions ompred to singleEproessor version of the odeF

PQU

4.2.3.5 MPI Summary


hether you hose w or ws depends on whih lirry the vendor of your system prefersF ometimes ws is the etter hoie euse it ontins the newest feturesD suh s support for hrdwreEsupported multist or rodstD tht n signi(ntly improve the overll performne of stterEgther pplitionF e good text on ws is sing D y illim qroppD iwing vuskD nd enthony kjellum @ws ressAF ou my lso wnt to retrieve nd print the ws spei(tion from httpXGGwwwFnetliForgGmpiG25 F

MPI  Portable Parallel Programmingwith the Message-Passing Interface

4.2.4 Closing Notes26


sn this hpter we hve looked t the ssemly lnguge of prllel progrmmingF hile it n seem dunting to rethink your pplitionD there re often some simple hnges you n mke to port your ode to messge pssingF hepending on the pplitionD msterEslveD rodstEgtherD or deomposed dt pproh might e most ppropriteF st9s importnt to relize tht some pplitions just don9t deompose into messge pssing very wellF ou my e working with just suh n pplitionF yne you hve some experiene with messge pssingD it eomes esier to identify the ritil points where dt must e ommunited etween proessesF hile rpD wD nd ws re ll mture nd populr tehnologiesD it9s not ler whether ny of these tehnologies will e the longEterm solution tht we will use IH yers from nowF yne possiility is tht we will use pyex WH @or pyex WSA without ny dt lyout diretives or tht the diretives will e optionlF enother interesting possiility is simply to keep using pyex UUF es slleD heEoherentD nonEuniform memory systems eome more populrD they will evolve their own dt llotion primitivesF por exmpleD the rGixemplr supports the following dt storge ttriutesX shredD nodeEprivteD nd thredEprivteF es dynmi dt strutures re llotedD they n e pled in ny one of these lssesF xodeEprivte memory is shred ross ll the threds on single node ut not shred eyond those thredsF erhps we will only hve to delre the storge lss of the dt ut not the dt lyout for these new mhinesF w nd ws still need the pility of supporting fultEtolernt style of omputing tht llows n pplition to omplete s resoures fil or otherwise eome villeF he mount of ompute power tht will e ville to pplitions tht n tolerte some unreliility in the resoures will e very lrgeF here hve een numer of modertely suessful ttempts in this re suh s gondorD ut none hve relly ught on in the minstremF o run the most powerful omputers in the world t their solute mximum performne levelsD the need to e portle is somewht reduedF wking your prtiulr pplition go ever fster nd sle to ever higher numers of proessors is fsinting tivityF wy the pvy e with you3

25 http://www.netlib.org/mpi/ 26 This content is available online

at <http://cnx.org/content/m33784/1.2/>.

PQV

CHAPTER 4. SCALABLE PARALLEL PROCESSING

Chapter 5
Appendixes

5.1 Appendix C: High Performance Microprocessors


5.1.1 Introduction1
5.1.1.1 High Performance Microprocessors
st hs een sid tht history is rewritten y the vitorsF st is ler tht high performne sgEsed miroproessors re de(ning the urrent history of high performne omputingF e egin our study with the si uilding loks of modern high performne omputingX the high performne sg miroproessorsF e @gsgA instrution set is mde up of powerful primitivesD lose in funtionlity to the primitives of highElevel lnguges like g or pyexF st ptures the sense of don9t do in softwre wht you n do in hrdwreF sgD on the other hndD emphsizes lowElevel primitivesD fr elow the omplexity of highElevel lngugeF ou n ompute nything you wnt using either pprohD though it will proly tke more mhine instrutions if you9re using sgF he importnt di'erene is tht with sg you n trde instrutionEset omplexity for speedF o e firD sg isn9t relly ll tht newF here were some importnt erly mhines tht pioneered sg philosophiesD suh s the ghg TTHH @IWTRA nd the sfw VHI projet @IWUSAF st ws in the midEIWVHsD howeverD tht sg mhines (rst posed diret hllenge to the gsg instlled seF reted dete roke out " sg versus gsg " nd even lingers todyD though it is ler tht the sg2 pproh is in gretest fvorY lteEgenertion gsg mhines re looking more sgElikeD nd some very old fmilies of gsgD suh s the hig eD re eing retiredF his hpter is out gsg nd sg instrution set rhitetures nd the di'erenes etween themF e lso desrie newer proessors tht n exeute more thn one instrution t time nd n exeute instrutions out of orderF

complex instruction set computer

5.1.2 Why CISC?3


ou might skD sf sg is fsterD why did people other with gsg designs in the (rst plec he short nswer is tht in the eginningD gsg the right wy to goY sg wsn9t lwys oth fesile nd 'ordleF ivery kind of design inorportes trdeEo'sD nd over timeD the est systems will mke them di'erentlyF sn the pstD the design vriles fvored gsgF

was

1 This content is available online at <http://cnx.org/content/m33671/1.2/>. 2 One of the most interesting remaining topics is the denition of RISC. Don't

be fooled into thinking there is one denition

of RISC. The best I have heard so far is from John Mashey: RISC is a label most commonly used for a set of instruction set architecture characteristics chosen to ease the use of aggressive implementation techniques found in high performance processors (regardless of RISC, CISC, or irrelevant).

3 This

content is available online at <http://cnx.org/content/m33672/1.2/>.

PQW

PRH

CHAPTER 5. APPENDIXES

5.1.2.1 Space and Time


o strtD we9ll sk you how well you know the ssemly lnguge for your workE sttionF he nswer is proly tht you hven9t even seen itF hy otherc gompilers nd development tools re very goodD nd if you hve prolemD you n deug it t the soure levelF roweverD QH yers goD respetle progrmmers understood the mhine9s instrution setF righElevel lnguge ompilers were ommonly villeD ut they didn9t generte the fstest odeD nd they weren9t terrily thrifty with memoryF hen progrmmingD you needed to sve oth spe nd timeD whih ment you knew how to progrm in ssemly lngugeF eordinglyD you ould develop n opinion out the mhine9s instrution setF e good instrution set ws oth esy to use nd powerfulF sn mny wys these qulities were the smeX powerful instrutions omplished lotD nd sved the progrmmer from speifying mny little steps " whihD in turnD mde them esy to useF fut they hd otherD less pprent @though perhps more importntA fetures s wellX powerful instrutions sved memory nd timeF fk thenD omputers hd very little storge y tody9s stndrdsF en instrution tht ould roll ll the steps of omplex opertionD suh s doEloopD into single opode4 ws plusD euse memory ws preiousF o put some stkes in the groundD onsider the lst vuumEtue omputer tht sfw uiltD the model UHR @IWSTAF st hd hrdwre )otingEpointD inluding division opertionD index registersD nd instrutions tht ould operte diretly on memory lotionsF por instneD you ould dd two numers together nd store the result k into memory with single ommndF he hilo PHHHD n erly trnsistorized mhine @IWSWAD hd n opertion tht ould repet sequene of instrutions until the ontents of ounter ws deremented to zero " very muh like doEloopF hese were omplex opertionsD even y tody9s stndrdsF roweverD oth mhines hd limited mount of memory " QPEu wordsF he less memory your progrm took upD the more you hd ville for dtD nd the less likely tht you would hve to resort to overlying portions of the progrm on top of one notherF gomplex instrutions sved timeD tooF elmost every lrge omputer following the sfw UHR hd memory system tht ws slower thn its entrl proessing unit @gAF hen single instrution n perform severl opertionsD the overll numer of instrutions retrieved from memory n e reduedF winimizing the numer of instrutions ws prtiulrly importnt euseD with few exeptionsD the mhines of the lte IWSHs were very sequentilY not until the urrent instrution ws ompleted did the omputer initite the proess of going out to memory to get the next instrutionF5 fy ontrstD modern mhines form something of uket rigde " pssing instrutions in from memory nd (guring out wht they do on the wy " so there re fewer gps in proessingF sf the designers of erly mhines hd hd very fst nd undnt instrution memoryD sophistited ompilersD nd the wherewithl to uild the instrution uket rigde " heply " they might hve hosen to rete mhines with simple instrution setsF et the timeD howeverD tehnologil hoies indited tht instrutions should e powerful nd thrifty with memoryF

5.1.2.2 Beliefs About Complex Instruction Sets


oD given tht the lot ws st in fvor of omplex instrution setsD omputer rhitets hd liense to experiE ment with mthing them to the intended purposes of the mhinesF por instneD the doEloop instrution on the hilo PHHH looked like good ompnion for proedurl lnguges like pyexF whine designers ssumed tht ompiler writers ould generte ojet progrms using these powerful mhine instrutionsD or possily tht the ompiler ould e elimintedD nd tht the mhine ould exeute soure ode diretly in hrdwreF ou n imgine how these ides set the tone for produt mrketingF p until the erly IWVHsD it ws ommon prtie to equte igger instrution set with more powerful omputerF hen lok speeds were inresing y multiplesD no inrese in instrution set omplexity ould fetter new model of omputer
4 Opcode = operation 5 In 1955, IBM began
code = instruction. constructing a machine known as Stretch. It was the rst computer to process several instructions

at a time in stages, so that they streamed in, rather than being fetched in a piece- meal fashion. The goal was to make it 25 times faster than the then brand-new IBM 704. It was six years before the rst Stretch was delivered to Los Alamos National Laboratory. It was indeed faster, but it was expensive to build. Eight were sold for a loss of $20 million.

PRI enough so tht there wsn9t still tremendous net inrese in speedF gsg mhines kept getting fsterD in spite of the inresed opertion omplexityF es it turned outD ssemly lnguge progrmmers used the omplited mhine instrutionsD ut omE pilers generlly did notF st ws di0ult enough to get ompiler to reognize when omplited instrution ould e usedD ut the rel prolem ws one of optimiztionsX vertim trnsltion of soure onstruts isn9t very e0ientF en optimizing ompiler works y simplifying nd eliminting redundnt omputtionsF efter pss through n optimizing ompilerD opportunities to use the omplited instrutions tend to dispperF

5.1.3 Fundamental of RISC6


e sg mhine ould hve een uilt in IWTHF @sn ftD eymour gry uilt one in IWTR " the ghg TTHHFA roweverD given the sme osts of omponentsD tehnil rriersD nd even expettions for how omputers would e usedD you would proly still hve hosen gsg design " even with the ene(t of hindsightF he ext inspirtion tht led to developing high performne sg miroproessors in the IWVHs is sujet of some deteF egrdless of the motivtion of the sg designersD there were severl ovious pressures tht 'eted the development of sgX

he numer of trnsistors tht ould (t on single hip ws inresingF st ws ler tht one would eventully e le to (t ll the omponents from proessor ord onto single hipF ehniques suh s pipelining were eing explored to improve performneF rileElength instrutions nd vrileElength instrution exeution times @due to vrying numers of miroode stepsA mde implementing pipelines more di0ultF es ompilers improvedD they found tht wellEoptimized sequenes of stremE lined instrutions often outperformed the equivlent omplited multiEyle instrutionsF @ee eppendix eD roessor erhiE teturesD nd eppendix fD vooking t essemly vngugeFA
he sg designers sought to rete high performne singleEhip proessor with fst lok rteF hen g n (t on single hipD its ost is deresedD its reliility is inresedD nd its lok speed n e inresedF hile not ll sg proessors re singleEhip implementtionD most use single hipF o omplish this tskD it ws neessry to disrd the existing gsg instrution sets nd develop new miniml instrution set tht ould (t on single hipF rene the term F sn sense reduing the instrution set ws not n end ut mens to n endF por the (rst genertion of sg hipsD the restritions on the numer of omponents tht ould e mnuftured on single hip were severeD foring the designers to leve out hrdwre support for some instrutionsF he erliest sg proessors hd no )otingEpoint support in hrdwreD nd some did not even support integer multiply in hrdwreF roweverD these instrutions ould e implemented using softwre routines tht omined other instrutions @ miroode of sortsAF hese erliest sg proessors @most severely reduedA were not overwhelming suesses for four resonsX

reduced instruction set computer

st took time for ompilersD operting systemsD nd user softwre to e retuned to tke dvntge of the new proessorsF sf n pplition depended on the performne of one of the softwreEimplemented instrutionsD its performne su'ered drmtillyF feuse sg instrutions were simplerD more instrutions were needed to omplish the tskF feuse ll the sg instrutions were QP its longD nd ommonly used gsg instrutions were s short s V itsD sg progrm exeutles were often lrgerF
es result of these lst two issuesD sg progrm my hve to feth more memory for its instrutions thn gsg progrmF his inresed ppetite for instrutions tully logged the memory ottlenek until su0ient hes were dded to the sg proessorsF sn some senseD you ould view the hes on sg
6 This
content is available online at <http://cnx.org/content/m33673/1.2/>.

PRP

CHAPTER 5. APPENDIXES

proessors s the miroode store in gsg proessorF foth redued the overll ppetite for instrutions tht were loded from memoryF hile the sg proessor designers worked out these issues nd the mnufturing pility improvedD there ws ttle etween the existing @now lled gsgA proessors nd the new sg @not yet suessfulA proessorsF he gsg proessor designers hd mture designs nd wellEtuned populr softwreF hey lso kept dding performne triks to their systemsF fy the time wotorol hd evolved from the wgTVHHH in IWVP tht ws gsg proessor to the wgTVHRH in IWVWD they referred to the wgTVHRH s sg proessorF7 roweverD the sg proessors eventully eme suessfulF es the mount of logi ville on single hip inresedD )otingEpoint opertions were dded k onto the hipF ome of the dditionl logi ws used to dd onEhip he to solve some of the memory ottlenek prolems due to the lrger ppetite for instrution memoryF hese nd other hnges moved the sg rhitetures from the defensive to the o'ensiveF sg proessors quikly eme known for their 'ordle highEspeed )otingE point pility ompred to gsg proessorsF8 his exellent performne on sienti( nd engineering pplitions e'etively reted new type of omputer systemD the worksttionF orksttions were more expensive thn personl omputers ut their ost ws su0iently low tht worksttions were hevily used in the gehD grphisD nd design resF he emerging worksttion mrket e'etively reted three new omputer ompnies in epolloD un wirosystemsD nd ilion qrphisF ome of the existing ompnies hve reted ompetitive sg proessors in ddition to their gsg designsF sfw developed its ETHHH @syA proessorD whih hd exellent )otingEpoint performneF he elph from hig hs exellent performne in numer of omputing enhmrksF rewlettEkrd hs developed the eEsg series of proessors with exellent performneF wotorol nd sfw hve temed to develop the owerg series of sg proessors tht re used in sfw nd epple systemsF fy the end of the sg revolutionD the performne of sg proessors ws so impressive tht single nd multiproessor sgEsed server systems quikly took over the miniomputer mrket nd re urrently enrohing on the trditionl minfrme mrketF

5.1.3.1 Characterizing RISC


sg is more of design philosophy thn set of golsF yf ourse every sg proessor hs its own personlityF roweverD there re numer of fetures ommonly found in mhines people onsider to e sgX

snstrution pipelining ipelining )otingEpoint exeution niform instrution length helyed rnhing vodGstore rhiteture imple ddressing modes

his list highlights the di'erenes etween sg nd gsg proessorsF xturllyD the two types of instrutionEset rhitetures hve muh in ommonY eh uses registersD memoryD etF end mny of these tehniques re used in gsg mhines tooD suh s hes nd instrution pipelinesF st is the fundmentl di'erenes tht give sg its speed dvntgeX fousing on smller set of less powerful instrutions mkes it possile to uild fster omputerF roweverD the notion tht sg mhines re generlly simpler thn gsg mhines isn9t orretF yther feturesD suh s funtionl pipelinesD sophistited memory systemsD nd the ility to issue two or more instrutions per lok mke the ltest sg proessors the most omplited ever uiltF purthermoreD muh of the omplexity tht hs een lifted from the instrution set hs een driven into the ompilersD mking good optimizing ompiler prerequisite for mhine performneF
7 And they did it without ever taking 8 The typical CISC microprocessor in
out a single instruction! the 1980s supported oating-point operations in a separate coprocessor.

PRQ vet9s put ourselves in the role of omputer rhitet gin nd look t eh item in the list ove to understnd why it9s importntF

5.1.3.2 Pipelines

iverything within digitl omputer @sg or gsgA hppens in step with X signl tht pes the omputer9s iruitryF he rte of the lokD or D determines the overll speed of the proessorF here is n upper limit to how fst you n lok given omputerF e numer of prmeters ple n upper limit on the lok speedD inluding the semiondutor tehnologyD pkgingD the length of wires tying the piees togetherD nd the longest pth in the proessorF elthough it my e possile to reh lzing speed y optimizing ll of the prmetersD the ost n e prohiitiveF purthermoreD exoti omputers don9t mke good o0e mtesY they n require too muh powerD produe too muh noise nd hetD or e too lrgeF here is inentive for mnufturers to stik with mnufturle nd mrketle tehnologiesF eduing the numer of lok tiks it tkes to exeute n individul instrution is good ideD though ost nd prtility eome issues eyond ertin pointF e greter ene(t omes from prtilly overlpping instrutions so tht more thn one n e in progress simultneouslyF por instneD if you hve two dditions to performD it would e nie to exeute them oth t the sme timeF row do you do thtc he (rstD nd perhps most oviousD pprohD would e to strt them simultneouslyF wo dditions would exeute together nd omplete together in the mount of time it tkes to perform oneF es resultD the throughput would e e'etively douledF he downside is tht you would need hrdwre for two dders in sitution where spe is usully t premium @espeilly for the erly sg proessorsAF yther pprohes for overlpping exeution re more ostEe'etive thn sideEyEside exeutionF smgine wht it would e like ifD moment fter lunhing one opertionD you ould lunh nother without witing for the (rst to ompleteF erhps you ould strt nother of the sme type right ehind the (rst one " like the two dditionsF his would give you nerly the performne of sideEyEside exeution without duplited hrdwreF uh mehnism does exist to vrying degrees in ll omputers " gsg nd sgF st9s lled pipelineF e pipeline tkes dvntge of the ft tht mny opertions re divided into identi(le stepsD eh of whih uses di'erent resoures on the proessorF9

clock speed

clock

A Pipeline

Figure 5.1

9 Here

is a simple analogy: imagine a line at a fast-food drive up window. If there is only one window, one customer orders

and pays, and the food is bagged and delivered to the customer before the second customer orders. For busier restaurants, there are three windows. First you order, then move ahead. Then at a second window, you pay and move ahead. At the third window you pull up, grab the food and roar o into the distance. While your wait at the three-window (pipelined) drive-up may have been slightly longer than your wait at the one-window (non-pipelined) restaurant, the pipeline solution is signicantly better because multiple customers are being processed simultaneously.

PRR

CHAPTER 5. APPENDIXES

pigure SFI @e ipelineA shows oneptul digrm of pipelineF en opertion entering t the left proeeds on its own for (ve lok tiks efore emerging t the rightF qiven tht the pipeline stges re independent of one notherD up to (ve opertions n e in )ight t time s long s eh instrution is delyed long enough for the previous instrution to ler the pipeline stgeF gonsider how powerful this mehnism isX where efore it would hve tken (ve lok tiks to get single resultD pipeline produes s muh s one result every lok tikF ipelining is useful when proedure n e divided into stgesF snstrution proessing (ts into tht tegoryF he jo of retrieving n instrution from memoryD (guring out wht it doesD nd doing it re seprte steps we usully lump together when we tlk out exeuting n instrutionF he numer of steps vriesD depending on whose proessor you re usingD ut for illustrtionD let9s sy there re (veX

Instruction fetchX he proessor fethes n instrution from memoryF Instruction decodeX he instrution is reognized or deodedF Operand FetchX he proessor fethes the opernds the instrution needsF hese opernds my e in registers or in memoryF RF Execute X he instrution gets exeutedF SF Writeback X he proessor writes the results k to wherever they re supposed to go "possily
IF PF QF registersD possily memoryF IF ill e entering the opernd feth stge s instrution PF enters instrution deode stge nd instrution QF strts instrution fethD nd so onF

sdellyD instrution

yur pipeline is (ve stges deepD so it should e possile to get (ve instrutions in )ight ll t oneF sf we ould keep it upD we would see one instrution omplete per lok yleF imple s this illustrtion seemsD instrution pipelining is omplited in rel lifeF ih step must e le to our on di'erent instrutions simultneouslyD nd delys in ny stge hve to e oordinted with ll those tht followF sn pigure SFP @hree instrutions in )ight through one pipelineA we see three instrutions eing exeuted simultneously y the proessorD with eh instrution in di'erent stge of exeutionF

PRS

Three instructions in ight through one pipeline

Figure 5.2

por instneD if omplited memory ess ours in stge threeD the instrution needs to e delyed efore going on to stge four euse it tkes some time to lulte the opernd9s ddress nd retrieve it from memoryF ell the whileD the rest of the pipeline is stlledF e simpler instrutionD sitting in one of the erlier stgesD n9t ontinue until the tr0 hed lers upF xow imgine how jump to new progrm ddressD perhps used y n if sttementD ould disrupt the pipeline )owF he proessor doesn9t know n instrution is rnh until the deode stgeF st usully doesn9t know whether rnh will e tken or not until the exeute stgeF es shown in pigure SFQ @heteting rnhAD during the four yles fter the rnh instrution ws fethedD the proessor lindly fethes instrutions sequentilly nd strts these instrutions through the pipelineF

PRT

CHAPTER 5. APPENDIXES
Detecting a branch

Figure 5.3

sf the rnh flls throughD then everything is in gret shpeY the pipeline simply exeutes the next instrutionF st9s s if the rnh were noEop instrutionF roweverD if the rnh jumps wyD those three prtilly proessed instrutions never get exeutedF he (rst order of usiness is to disrd these inE)ight instrutions from the pipelineF st turns out tht euse none of these instrutions ws tully going to do nything until its exeute stgeD we n throw them wy without hurting nything @other thn our e0ienyAF omehow the proessor hs to e le to ler out the pipeline nd restrt the pipeline t the rnh destintionF nfortuntelyD rnh instrutions our every (ve to ten instrutions in mny progrmsF sf we exeuted rnh every (fth instrution nd only hlf our rnhes fell throughD the lost e0ieny due to restrting the pipeline fter the rnhes would e PH perentF ou need optiml onditions to keep the pipeline movingF iven in lessEthnEoptiml onditionsD instruE tion pipelining is ig win " espeilly for sg proessorsF snterestinglyD the ide dtes k to the lte IWSHs nd erly IWTHs with the xsE eg veg nd the sfw trethF snstrution pipelining eme minstremed in IWTRD when the ghg TTHH nd the sfw GQTH fmilies were introdued with pipelined instrution units " on mhines tht represented sgEish nd gsg designsD respetivelyF o this dyD ever more sophistited tehniques re eing pplied to instrution pipeliningD s mhines tht n overlp instrution exeution eome ommonpleF

5.1.3.3 Pipelined Floating-Point Operations


feuse the exeution stge for )otingEpoint opertions n tke longer thn the exeution stge for (xedE point omputtionsD these opertions re typilly pipelinedD tooF qenerllyD this inludes )otingEpoint dditionD sutrtionD multiplitionD omprisonsD nd onversionsD though it might not inlude squre roots nd divisionF yne pipelined )otingEpoint opertion is strtedD lultions ontinue through the

PRU severl stges without delying the rest of the proessorF he result ppers in register t some point in the futureF ome proessors re limited in the mount of overlp their )otingEpoint pipelines n supportF snternl omponents of the pipelines my e shred @for ddingD multiplyingD normlizingD nd rounding intermedite resultsAD foring restritions on when nd how often you n egin new opertionsF sn other sesD )otingE point opertions n e strted every yle regrdless of the previous )otingE point opertionsF e sy tht suh opertions re F he numer of stges in )otingEpoint pipelines for 'ordle omputers hs deresed over the lst IH yersF wore trnsistors nd newer lgorithms mke it possile to perform )otingEpoint ddition or multiplition in just one to three ylesF qenerlly the most di0ult instrution to perform in single yle is the )otingEpoint multiplyF roweverD if you dedite enough hrdwre to itD there re designs tht n operte in single yle t moderte lok rteF

fully pipelined

5.1.3.4 Uniform Instruction Length


yur smple instrution pipeline hd (ve stgesX instrution fethD instrution deodeD opernd fethD exeuE tionD nd writekF e wnt this pipeline to e le to proess (ve instrutions in vrious stges without stllingF heomposing eh opertion into (ve identi(le prtsD eh of whih is roughly the sme mount of timeD is hllenging enough for sg omputerF por designer working with gsg instrution setD it9s espeilly di0ult euse gsg instrutions ome in vrying lengthsF e simple return from suroutine instrution might e one yte longD for instneD wheres it would tke longer instrution to sy dd regE ister four to memory lotion PHHS nd leve the result in register (veF he numer of ytes to e fethed must e known y the feth stge of the pipeline s shown in pigure SFR @rile length instrutions mke pipelining di0ultAF

Variable length instructions make pipelining dicult

Figure 5.4

he proessor hs no wy of knowing how long n instrution will e until it rehes the deode stge nd determines wht it isF sf it turns out to e long instrutionD the proessor my hve to go k to memory nd get the portion left ehindY this stlls the pipelineF e ould eliminte the prolem y requiring tht ll instrutions e the sme lengthD nd tht there e limited numer of instrution formts s shown in pigure SFS @rileElength gsg versus (xedElength sg instrutionsAF his wyD every instrution entering the pipeline is known to e omplete " not needing nother memory essF st would lso e esier for the proessor to lote the instrution (elds tht speify registers or onstntsF eltogether euse sg n ssume (xed instrution lengthD the pipeline )ows muh more smoothlyF

a priori

PRV

CHAPTER 5. APPENDIXES
Variable-length CISC versus xed-length RISC instructions

Figure 5.5

5.1.3.5 Delayed Branches


es desried erlierD rnhes re signi(nt prolem in pipelined rhitetureF ther thn tke penlty for lening out the pipeline fter misguessed rnhD mny sg designs require n instrution fter the rnhF his instrutionD in wht is lled the D is exeuted no mtter wht wy the rnh goesF en instrution in this position should e usefulD or t lest hrmlessD whihever wy the rnh proeedsF ht isD you expet the proessor to exeute the instrution following the rnh in either seD nd pln for itF sn pinhD noEop n e usedF e slight vrition would e to give the proessor the ility to @or squshA the instrution ppering in the rnh dely slot if it turns out tht it shouldn9t hve een issued fter llX

branch delay slot

annul

ehh f fe vefivI

IDPDI QDIDQ ywirii iy Q FFF

dd rI to rP nd store in rI sutrt rI from rQD store in rQ rnh somewhere else instrution in rnh dely slot

hile rnh dely slots ppered to e very lever solution to eliminting pipeline stlls ssoited with rnh opertionsD s proessors moved towrd exeE uting two nd four instrutions simultneouslyD nother pproh ws neededF10 e more roust wy of eliminting pipeline stlls ws to predit the diretion of the rnh using tle stored in the deode unitF es prt of the deode stgeD the g would notie tht the instrution ws
10 Interestingly,
while the delay slot is no longer critical in processors that execute four instructions simultaneously, there is not yet a strong reason to remove the feature. Removing the delay slot would be nonupwards-compatible, breaking many existing codes. To some degree, the branch delay slot has become baggage on those new 10-year-old architectures that must continue to support it.

PRW rnh nd onsult tle tht kept the reent ehvior of the rnhY it would then mke guessF fsed on the guessD the g would immeditely egin fething t the predited lotionF es long s the guesses were orretD rnhes ost extly the sme s ny other instrutionF sf the predition ws wrongD the instrutions tht were in proess hd to e nE elledD resulting in wsted time nd e'ortF e simple rnh predition sheme is typilly orret well over WH7 of the timeD signi(ntly reduing the overll negtive performne impt of pipeline stlls due to rnhesF ell reent sg designs inorporte some type of rnh preditionD mking rnh dely slots e'etively unneessryF enother mehnism for reduing rnh penlties is F hese re instrutions tht look like rnhes in soure odeD ut turn out to e speil type of instrution in the ojet odeF hey re very useful euse they reple test nd rnh sequenes ltogetherF he following lines of ode pture the sense of onditionl rnhX

conditional execution

sp @ f < g A rix e a h ivi e a i ixhsp


sing rnhesD this would require t lest two rnhes to ensure tht the proper vlue ended up in eF sing onditionl exeutionD one might generte ode tht looks s followsX

gywei f < g sp i e a h sp pevi e a i

onditionl instrution onditionl instrution

his is sequene of three instrutions with F yne of the two ssignments exeutesD nd the other ts s noEopF xo rnh predition is neededD nd the pipeline opertes perfetlyF here is ost to tking this pproh when there re lrge numer of instrutions in one or the other rnh pths tht would seldom get exeuted using the trditionl rnh instrution modelF

no branches

5.1.3.6 Load/Store Architecture


sn lodGstore instrution set rhitetureD memory referenes re limited to expliit lod nd store instruE tionsF ih instrution my not mke more thn one memory referene per instrutionF sn gsg proessorD rithmeti nd logil instrutions n inlude emedded memory referenesF here re three resons why limiting lods nd stores to their own instrutions is n improvementX

pirstD we wnt ll instrutions to e the sme lengthD for the resons given oveF roweverD (xed lengths impose udget limit when it omes to desriing wht the opertion does nd whih registers it usesF en instrution tht oth referened memory nd performed some lultion wouldn9t (t within one instrution wordF eondD giving every instrution the option to referene memory would omE plite the pipeline euse there would e two omputtions to perform" the ddress lultion plus whtever the instrution is supposed to do " ut there is only one exeution stgeF e ould throw more hrdwre t itD ut y restriting memory referenes to expliit lods nd storesD we n void the prolem entirelyF eny instrution n perform n ddress lultion or some other opertionD ut no instrution n do othF

PSH

CHAPTER 5. APPENDIXES

he third reson for limiting memory referenes to expliit lods nd stores is tht they n tke more time thn other instrutions " sometimes two or three lok yles moreF e generl instrution with n emedded memory referene would get hung up in the opernd feth stge for those extr ylesD witing for the referene to ompleteF egin we would e fed with n instrution pipeline stllF
ixpliit lod nd store instrutions n kik o' memory referenes in the pipeline9s exeute stgeD to e ompleted t lter time @they might omplete immeditelyY it depends on the proessor nd the heAF en opertion downstrem my require the result of the refereneD ut tht9s ll rightD s long s it is fr enough downstrem tht the referene hs hd time to ompleteF

5.1.3.7 Simple Addressing Modes


tust s we wnt to simplify the instrution setD we lso wnt simple set of memory ddressing modesF he resons re the smeX omplited ddress lultionsD or those tht require multiple memory referenesD will tke too muh time nd stll the pipelineF his doesn9t men tht your progrm n9t use elegnt dt struturesY the ompiler expliitly genertes the extr ddress rithmeti when it needs itD s long s it n ount on few fundmentl ddressing modes in hrdwreF sn ftD the extr ddress rithmeti is often esier for the ompiler to optimize into fster forms @see etion SFPFI nd etion PFIFTFV @sndution rile impli(tionAAF yf ourseD utting k the numer of ddressing modes mens tht some memory referenes will tke more rel instrutions thn they might hve tken on gsg mhineF roweverD euse everything exeutes more quiklyD it generlly is still performne winF

5.1.4 Second-Generation RISC Processors11


he roly qril for erly sg mhines ws to hieve one instrution per lokF he idelized sg omputer running tD syD SH wrzD would e le to issue SH million instrutions per seond ssuming perfet pipeline shedulingF es we hve seenD single instrution will tke (ve or more lok tiks to get through the instrution pipelineD ut if the pipeline n e kept fullD the ggregte rte willD in ftD pproh one instrution per lokF yne the si pipelined sg proessor designs eme suessfulD ompetition ensued to determine whih ompny ould uild the est sg proessorF eondEgenertion sg designers used three si methods to develop ompetitive sg proessorsX

smprove the mnufturing proesses to simply mke the lok rte fsterF ke simple designY mke it smller nd fsterF his pproh ws tken y the elph proessors from higF elph proessors typilly hve hd lok rtes doule those of the losest ompetitorF edd duplite ompute elements on the spe ville s we n mnufture hips with more trnsisE torsF his ould llow two instrutions to e exeuted per yle nd ould doule performne without inresing lok rteF his tehnique is lled superslrF snrese the numer of stges in the pipeline ove (veF sf the instrutions n truly e deomE posed evenly intoD syD ten stgesD the lok rte ould theoretilly e douled without requiring new mnufturing proessesF his tehnique ws lled F he ws proessors used this tehnique with some suessF

superpipelining

5.1.4.1 Superscalar Processors


he wy you get two or more instrutions per lok is y strting severl opertions side y sideD possily in seprte pipelinesF sn pigure SFT @heomposing seril stremAD if you hve n integer ddition nd multiplition to performD it should e possile to egin them simultneouslyD provided they re independent of eh other @s long s the multiplition does not need the output of the ddition s one of its opernds
11 This
content is available online at <http://cnx.org/content/m33675/1.2/>.

PSI or vie versAF ou ould lso exeute multiple (xedEpoint instrutions " ompresD integer dditionsD etF " t the sme timeD provided tht theyD tooD re independentF enother term used to desrie superslr proessors is proessorsF

multiple instruction issue

Decomposing a serial stream

Figure 5.6

he numer nd vriety of opertions tht n e run in prllel depends on oth the progrm nd the proessorF he progrm hs to hve enough usle prllelism so tht there re multiple things to doD nd the proessor hs to hve n pproprite ssortment of funtionl units nd the ility to keep them usyF he ide is oneptully simpleD ut it n e hllenge for oth hrdwre designers nd ompiler writersF ivery opportunity to do severl things in prllel exposes the dnger of violting some preedene @iFeFD performing omputtions in the wrong orderAF

5.1.4.2 Superpipelined Processors


oughly sttedD simpler iruitry n run t higher lok speedsF ut yourself in the role of g designer ginF vooking t the instrution pipeline of your proessorD you might deide tht the reson you n9t get more speed out of it is tht some of the stges re too omplited or hve too muh going onD nd they re pling limits on how fst the whole pipeline n goF feuse the stges re loked in unisonD the slowest of them forms wek link in the hinF sf you divide the omplited stges into less omplited portionsD you n inrese the overll speed of the pipelineF his is lled F wore instrution pipeline stges with less omplexity per stge will do the sme work s pipelined proessorD ut with higher throughput due to inresed lok speedF pigure SFU @ws RHHH instrution pipelineA shows n eightEstge pipeline used in the ws RHHH proessorF

superpipelining

PSP

CHAPTER 5. APPENDIXES
MIPS R4000 instruction pipeline

Figure 5.7

heoretillyD if the redued omplexity llows the proessor to lok fsterD you n hieve nerly the sme performne s superslr proessorsD yet without instrution mix preferenesF por illustrtionD piture superslr proessor with two units " (xedE nd )otingEpoint " exeuting progrm tht is omposed solely of (xedEpoint lultionsY the )otingEpoint unit goes unusedF his redues the superslr performne y one hlf ompred to its theoretil mximumF e superpipelined proessorD on the other hndD will e perfetly hppy to hndle n unlned instrution mix t full speedF uperpipelines re not newY deep pipelines hve een employed in the pstD notly on the ghg TTHHF he lel is mrketing retion to drw ontrst to superslr proessingD nd other forms of e0ientD highEspeed omputingF uperpipelining n e omined with other pprohesF ou ould hve superslr mhine with deep pipelines @hig e nd ws EVHHH re exmplesAF sn ftD you should proly expet tht fster pipelines with more stges will eome so ommonple tht noody will rememer to ll them superpipelines fter whileF

5.1.5 RISC Means Fast12


e ll know tht the  in sg mens reduedF vtelyD s the numer of omponents tht n e mnuftured on hip hs inresedD g designers hve een looking t wys to mke their proessors fster y dding feturesF e hve lredy tlked out mny of the fetures suh s onEhip multipliersD very fst )otingEpointD lots of registersD nd onEhip hesF iven with ll of these feturesD there seems to e spe left overF elsoD euse muh of the design of the ontrol setion of the proessor is utomtedD it might not e so d to dd just few new instrutions here nd thereF ispeilly if simultions indite IH7 overll inrese in speed3 oD wht does it men when they dd IS instrutions to sg instrution set rhiteture @seAc ould we ll it notEsoEsgc e suggested term for this trend is psgD or F he point is tht reduing the numer of instrutions is not the golF he gol is to uild the fstest possile proessor within the mnufturing nd ost onstrintsF13 ome of the types of instrutions tht re eing dded into rhitetures inludeX

fast instruction set computer

wore ddressing modes wetEinstrutions suh s derement ounter nd rnh if nonEzero peilized grphis instrutions suh s the un s setD the r grphis instrutionsD the ws higitl wedi ixtentions @whwAD nd the sntel ww instrutions
12 This content is available online at <http://cnx.org/content/m33679/1.2/>. 13 People will argue forever but, in a sense, reducing the instruction set was never
an end in itself, it was a means to an end.

PSQ snterestinglyD the reson tht the (rst two re fesile is tht dder units tke up so little speD it is possile to put one dder into the deode unit nd nother into the lodGstore unitF wost visuliztion instrution sets tke up very little hip reF hey often provide gnged VEit omputtions to llow TREit register to e used to perform eight VEit opertions in single instrutionF

5.1.6 Out-of-Order Execution: The Post-RISC Architecture14


e9re never stis(ed with the performne level of our omputing equipment nd neither re the proessor designersF woEwy superslr proessors were very suessful round IWWRF wny designs were le to exeute IFT!IFV instrutions per yle on vergeD using ll of the triks desried so frF es we eme le to mnufture hips with n everEinresing trnsistor ountD it seemed tht we would nturlly progress to fourEwy nd then eightEwy superslr proessorsF he fundmentl prolem we fe when trying to keep four funtionl units usy is tht it9s di0ult to (nd ontiguous sets of four @or eightA instrutions tht n e exeuted in prllelF st9s n esy opEout to syD the ompiler will solve it llF he solution to these prolems tht will llow these proessors to e'etively use four funtionl units per yle nd hide memory lteny is nd F yutEofEorder exeution llows lter instrution to e proessed efore n erlier instrution is ompletedF he proessor is etting tht the instrution will exeuteD nd the proessor will hve the preomputed nswer the instrution needsF sn some wysD portions of the sg design philosophy re turned insideEout in these new proessorsF

out-of-order execution

speculative execution

5.1.6.1 Speculative Computation


o understnd the postEsg rhitetureD it is importnt to seprte the onept of omputing vlue for n instrution nd tully exeuting the instrutionF vet9s look t simple exmpleX

vh FFF phs

IHDP@HA RDSDT

vod into IH from memory QH snstrutions of vrious kinds @not phsA R a S G T

essume tht @IA we re exeuting the lod instrutionD @PA S nd T re lredy loded from erlier instrutionsD @QA it tkes QH yles to do )otingEpoint divideD nd @RA there re no instrutions tht need the divide unit etween the vh nd the phsF hy not strt the divide unit the phs right nowD storing the result in some temporry srth rec st hs nothing etter to doF hen or if we rrive t the phsD we will know the result of the lultionD opy the srth re into RD nd the phs will pper to in one yleF ound frfethedc xot for postEsg proessorF he postEsg proessor must e le to speultively ompute results efore the proessor knows whether or not n instrution will tully exeuteF st omplishes this y llowing instrutions to strt tht will never (nish nd llowing lter instrutions to strt efore erlier instrutions (nishF o store these instrutions tht re in limo etween strted nd (nishedD the postEsg proessor needs some spe on the proessorF his spe for instrutions is lled the @sfAF

computing

execute

instruction reorder buer

5.1.6.2 The Post-RISC Pipeline


he postEsg proessor pipeline in pigure SFV @ostEsg pipelineA looks somewht di'erent from the sg pipelineF he (rst two stges re still instrution feth nd deodeF heode inludes rnh predition using tle tht indites the prole ehvior of rnhF yne instrutions re deoded nd rnhes re preditedD the instrutions re pled into the sf to e omputed s soon s possileF
14 This
content is available online at <http://cnx.org/content/m33677/1.2/>.

PSR

CHAPTER 5. APPENDIXES
Post-RISC pipeline

Figure 5.8

he sf holds up to TH or so instrutions tht re witing to exeute for one reson or notherF sn senseD the feth nd deodeGpredit phses operte until the u'er (lls upF ih time the deode unit predits rnhD the following instrutions re mrked with di'erent inditor so they n e found esily if the predition turns out to e wrongF ithin the u'erD instrutions re llowed to go to the omputtionl units when the instrution hs ll of its opernd vluesF feuse the instrutions re omputing results without eing exeutedD ny instrution tht hs its input vlues nd n ville omputtion unit n e omputedF he results of these omputtions re stored in extr registers not visile to the progrmmer lled F he proessor llotes renme registersD s they re needed for instrutions eing omputedF he exeution units my hve one or more pipeline stgesD depending on the type of the instrutionF his prt looks very muh like trditionl superslr sg proessorsF ypilly up to four instrutions n egin omputtion from the sf in ny yleD provided four instrutions re ville with input opernds nd there re su0ient omputtionl units for those instrutionsF yne the results for the instrution hve een omputed nd stored in renme registerD the instrution must wit until the preeding instrutions (nish so we know tht the instrution tully exeutesF sn ddition to the omputed resultsD eh instrution hs )gs ssoited with itD suh s exeptionsF por

rename registers

PSS exmpleD you would not e hppy if your progrm rshed with the following messgeX irrorD divide y zeroF s ws preomputing divide in se you got to the instrution to sve some timeD ut the rnh ws mispredited nd it turned out tht you were never going to exeute tht divide nywyF s still hd to low you up thoughF xo hrd feelingsc ignedD the postEsg gF o when speultively omputed instrution divides y zeroD the g must simply store tht ft until it knows the instrution will exeute nd t tht momentD the progrm n e legitimtely rshedF sf rnh does get mispreditedD lot of ookkeeping must our very quiklyF e messge is sent to ll the units to disrd instrutions tht re prt of ll ontrol )ow pths eyond the inorret rnhF snsted of lling the lst phse of the pipeline writekD it9s lled retireF he retire phse is wht exeutes the instrutions tht hve lredy een omputedF he retire phse keeps trk of the instrution exeution order nd retires the instrutions in progrm orderD posting results from the renme registers to the tul registers nd rising exeptions s neessryF ypilly up to four instrutions n e retired per yleF o the postEsg pipeline is tully three pipelines onneted y two u'ers tht llow instrutions to e proessed out of orderF roweverD even with ll of this speultive omputtion going onD the retire unit fores the proessor to pper s simple sg proessor with preditle exeution nd interruptsF

5.1.7 Closing Notes15


gongrtultions for rehing the end of long hpter3 e hve tlked little it out old omputersD gsgD sgD postEsgD nd isgD nd mentioned superomputers in pssingF s think it9s interesting to oserve tht sg proessors re rnh o' longEestlished treeF wny of the ides tht hve gone into sg designs re orrowed from other types of omputersD ut none of them evolved into sg " sg strted t disontinuityF here were hints of sg revolution @the ghg TTHH nd the sfw VHI projetA ut it relly ws fored on the world @for its own goodA y g designers t ferkeley nd tnford in the IWVHsF es sg hs mturedD there hve een mny improvementsF ih time it ppers tht we hve rehed the limit of the performne of our miroproessors there is new rhiteturl rekthrough improving our single g performneF row long n it ontinuec st is ler tht s long s ompetition ontinuesD there is signi(nt performne hedroom using the outEofEorder exeution s the lok rtes move from typil PHH wrz to SHHC wrzF hig9s elph PIPTR is plnned to hve fourEwy outEofEorder exeution t SHH wrz y IWWVF es of IWWVD vendors re eginning to revel their plns for proessors loked t IHHH wrz or I qrzF nfortuntelyD developing new proessor is very expensive tskF sf enough ompnies merge nd ompetition diminishesD the rte of innovtion will slowF ropefully we will e seeing four proessors on hipD eh ITEwy outEofEorder superslrD loked t I qrz for 6PHH efore we eliminte ompetition nd let the g designers rest on their lurelsF et tht pointD slle prllel proessing will suddenly eome interesting ginF row will designers tkle some of the fundmentl rhiteturl prolemsD perhps the lrgest eing memory systemsc iven though the postEsg rhiteture nd the isg llevite the lteny prolems somewhtD the memory ottlenek will lwys e thereF he good news is tht even though memory performne improves more slowly thn g performneD memory system performne does improve over timeF e9ll look next t tehniques for uilding memory systemsF es disussed in etion D the exerises tht ome t the end of most hpters in this ook re not like the exerises in most engineering textsF hese exerises re mostly thought experimentsD without wellEde(ned nswersD designed to get you thinking out the hrdwre on your deskF
15 This
content is available online at <http://cnx.org/content/m33683/1.2/>.

PST

CHAPTER 5. APPENDIXES
Exercise 5.1
peultive exeution is sfe for ertin types of instrutionsY results n e disrded if it turns out tht the instrution shouldn9t hve exeutedF plotingE point instrutions nd memory opertions re two lsses of instrutions for whih speultive exeution is trikierD prtiulrly euse of the hne of generting exeptionsF por instneD dividing y zero or tking the squre root of negtive numer uses n exeptionF nder wht irumstnes will speultive memory referene use n exeptionc

5.1.8 Exercises16

Exercise 5.2

iture mhine with )otingEpoint pipelines tht re IHH stges deep @tht9s ridiulously deepAD eh of whih n deliver new result every nnoseondF ht would give eh pipeline pek throughput rte of I q)opD nd worstE se throughput rte of IH w)opsF ht hrteristis would progrm need to hve to tke dvntge of suh pipelinec

5.2 Appendix B: Looking at Assembly Language


5.2.1 Assembly Language17
sn this ppendixD we tke look t the ssemly lnguge produed y numer of di'erent ompilers on numer of di'erent rhiteturesF sn this survey we revisit some of the issues of gsg versus sgD nd the strengths nd weknesses of di'erent rhiteturesF por this surveyD two roughly identil segments of ode were usedF he ode ws reltively long loop dding two rrys nd storing the result in third rryF he loops were written oth in pyex nd gF he pyex loop ws s followsX

fysxi ehhiw@eDfDgDxA iev e@IHHHHADf@IHHHHADg@IHHHHA sxiqi xDs hy IH saIDx e@sA a f@sA C g@sA ixhhy ixh
he g version wsX

for@iaHYi<nYiCCA i a i C iY
e hve gthered these exmples over the yers from numer of di'erent ompilersD nd the results re not prtiulrly sienti(F his is not intended to review prtiulr rhiteture or ompiler versionD ut rther just to show n exmple of the kinds of things you n lern from looking t the output of the ompilerF
16 This 17 This
content is available online at <http://cnx.org/content/m33682/1.2/>. content is available online at <http://cnx.org/content/m33787/1.2/>.

PSU

5.2.1.1 Intel 8088


he sntel VHVV proessor used in the originl sfw ersonl gomputer is very trditionl gsg proessing system with fetures severely limited y its trnsistor ountF st hs very few registersD nd the registers generlly hve rther spei( funtionsF o support lrge memory modelD it must set its segment register leding up to eh memory opertionF his limittion mens tht every memory ess tkes minimum of three instrutionsF snterestinglyD similr pttern ours on sg proessorsF ou notie tht t one pointD the ode moves vlue from the x register to the x register euse it needs to perform nother omputtion tht n only e done in the x registerF xote tht this is only n integer omputtionD s the sntel

6IIX

mov mov mp ge shl mov dd mov mov mov shl dd mov dd mov shl dd mov mov

word ptr EPpDH xDword ptr EPp xDword ptr IVp 6IH xDI xDx xDword ptr IHp esDword ptr IPp xDesX word ptr x xDword ptr EPp xDI xDword ptr IRp esDword ptr ITp xDesX word ptr x xDword ptr EPp xDI xDword ptr Tp esDword ptr Vp esX word ptr xDx

5 p is s 5 vod s 5 ghek s>ax 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 wultiply s y P hone E now move to x x a eddress of f C yffset op prt of ddress vod f@iA vod s wultiply s y P x a eddress of g C yffset op prt of ddress vod g@sA vod s wultiply s y P x a eddress of e C yffset op prt of ddress tore

6WX 6IHX

in word ptr EPp jmp 6II

5 snrement s in memory

feuse there re so few registersD the vrile s is kept in memory nd loded severl times throughout the loopF he in instrution t the end of the loop tully updtes the vlue in memoryF snterestinglyD t the top of the loopD the vlue is then reloded from memoryF sn this type of rhitetureD the ville registers put suh strin on the )exiility of the ompilerD there is often not muh optimiztion tht is prtilF

5.2.1.2 Motorola MC68020


sn this setionD we exmine nother lssi gsg proessorD the wotorol wgTVHPHD whih ws used to uild wintosh omputers nd un worksttionsF e hppened to run this ode on ffx qEIHHH futter)y prllel proessing system mde up of WT wgTVHPH proessorsF he wotorol rhiteture is reltively esy to progrm in ssemly lngugeF st hs plenty of QPEit registersD nd they re reltively esy to useF st hs gsg instrution set tht keeps ssemly lnguge progrmming quite simpleF wny instrutions n perform multiple opertions in single instrutionF

PSV

CHAPTER 5. APPENDIXES

e use this exmple to show progression of optimiztion levelsD using fUU ompiler on )otingEpoint version of the loopF yur (rst exmple is with no optimiztionX

vSX

3 xote dH ontins the vlue s movl le fmoves le fdds le fmoves ddql suql tstl nes dHDvIQ Id@ERADH Hd@HDdHXlXRADfpH Qd@ERADH Hd@HDdHXlXRADfpH Pd@ERADH fpHDHd@HDdHXlXRA 5IDdH 5IDdI dI vS 3 3 3 3 3 3 3 3 3 tore s to memory if loop ends I a ddress of f vod of f@sA Q a ddress of g vod of g@sA @end eddA P a ddress of e tore of e@sA snrement s herement 4x4

he vlue for s is stored in the dH registerF ih time through the loopD it9s inremented y IF et the sme timeD register dI is initilized to the vlue for x nd deremented eh time through the loopF ih time through the loopD s is stored into memoryD so the proper vlue for s ends up in memory when the loop termintesF egisters ID PD nd Q re preloded to e the (rst ddress of the rrys fD eD nd g respetivelyF roweverD sine pyex rrys egin t ID we must sutrt R from eh of these ddresses efore we n use s s the o'setF he le instrutions re e'etively sutrting R from one ddress register nd storing it in notherF he following instrution performs n ddress omputtion tht is lmost oneEtoE one trnsltion of n rry refereneX

fmoves Hd@HDdHXlXRADfpH 3 vod of f@sA


his instrution retrieves )otingEpoint vlue from the memoryF he ddress is omputed y (rst multiE plying dH y R @euse these re QPEit )otingEpoint numersA nd dding tht vlue to HF es mtter of ftD the le nd fmoves instrutions ould hve een omined s followsX

fmoves Id@ERDdHXlXRADfpH 3 vod of f@sA


o ompute its memory ddressD this instrution multiplies dH y RD dds the ontents of ID nd then sutrts RF he resulting ddress is used to lod R ytes into )otingEpoint register fpHF his is lmost literl trnsltion of fething f@sAF ou n see how the ssemly is set up to trk highElevel onstrutsF st is lmost s if the ompiler were trying to show o' nd mke use of the nifty ssemly lnguge instrutionsF vike the sntelD this is not lodEstore rhitetureF he fdds instrution dds vlue from memory to vlue in register @fpHA nd leves the result of the ddition in the registerF nlike the sntel VHVVD we

PSW hve enough registers to store quite few of the vlues used throughout the loop @sD xD the ddress of eD fD nd gA in registers to sve memory opertionsF

5.2.1.2.1 C on the MC68020


sn the next exmpleD we ompiled the g version of the loop with the norml optimiztion @EyA turned onF e see the g perspetive on rrys in this odeF g views rrys s extensions to pointers in gY the loop index dvnes s n o'set from pointer to the eginning of the rryX

3 3 3 3 3

dQ a s dI a eddress of e dP a eddress of f dH a eddress of g Td@PHA a x moveq 5HDdQ rs vS vIX movl dQDI movl IDdR sll 5PDdR movl dRDI fmoves Id@HDdPXlADfpH movl Td@ITADdH fdds Id@HDdHXlADfpH fmoves fpHDId@HDdIXlA ddql 5IDdQ vSX mpl Td@PHADdQ its vI

3 3 3 3 3 3 3 3 3 3 3

snitilize s tump to ind of the loop wke opy of s egin wultiply y R @word sizeA ut k in n ddress register vod f@sA qet ddress of g edd g@sA tore into e@sA snrement s

e (rst see the vlue of s eing opied into severl registers nd multiplied y R @using left shift of PD strength redutionAF snterestinglyD the vlue in register I is s multiplied y RF egisters dHD dID nd dP re the ddresses of gD fD nd e respetivelyF sn the lodD ddD nd storeD I is the se of the ddress omputtion nd dHD dID nd dP re dded s n o'set to I to ompute eh ddressF his is simplisti optimiztion tht is primrily trying to mximize the vlues tht re kept in registers during loop exeutionF yverllD it9s reltively literl trnsltion of the g lnguge semntis from g to ssemlyF sn mny wysD g ws designed to generte reltively e0ient ode without requiring highly sophistited optimizerF

5.2.1.2.2 More optimization


sn this exmpleD we re k to the pyex version on the wgTVHPHF e hve ompiled it with the highest level of optimiztion @EyvwA ville on this ompilerF xow we see muh more ggressive pproh to the loopX

3 H a eddress of g@sA 3 I a eddress of f@sA 3 P a eddress of e@sA

PTH

CHAPTER 5. APPENDIXES
vQX fmoves fdds fmoves ddql ddql ddql suql tstl nes IdDfpH HdDfpH fpHDPd 5RDH 5RDI 5RDP 5IDdH dH vQ 3 3 3 3 3 3 3 vod f@sA edd g@sA tore e@sA edvne y R edvne y R edvne y R herement s

pirst o'D the ompiler is smrt enough to do ll of its ddress djustment outside the loop nd store the djusted ddresses of eD fD nd g in registersF e do the lodD ddD nd store in quik suessionF hen we dvne the rry ddresses y R nd perform the sutrtion to determine when the loop is ompleteF his is very tight ode nd ers little resemlne to the originl pyex odeF

5.2.1.3 SPARC Architecture


hese next exmples were performed using eg rhiteture system using pyexF he eg rhiteture is lssi sg proessor using lodEstore ess to memoryD mny registers nd delyed rnhingF e (rst exmine the ode t the lowest optimiztionX

FvIVX

ld sethi or ld sll dd ld ld sethi or ld sll dd ld fdds ld sethi or ld sll dd st sethi or ld dd

3 op of the loop 7fpERD7lP 3 eddress of f 7hi@qfFddemFiAD7lH 3 eddress of s in 7lH 7lHD7lo@qfFddemFiAD7lH 7lHCHD7lH 3 vod s 7lHDPD7lI 3 wultiply y R 7lPD7lID7lH 3 pigure effetive ddress of f@sA 7lHCHD7fQ 3 vod f@sA 7fpEVD7lP 3 eddress of g 7hi@qfFddemFiAD7lH 3 eddress of s in 7lH 7lHD7lo@qfFddemFiAD7lH 7lHCHD7lH 3 vod s 7lHDPD7lI 3 wultiply y R 7lPD7lID7lH 3 pigure effetive ddress of f@sA 7lHCHD7fP 3 vod g@sA 7fQD7fPD7fP 3 ho the ploting oint edd 7fpEIPD7lP 3 eddress of e 7hi@qfFddemFiAD7lH 3 eddress of i in 7lH 7lHD7lo@qfFddemFiAD7lH 7lHCHD7lH 3 vod s 7lHDPD7lI 3 wultiply y R 7lPD7lID7lH 3 pigure effetive ddress of e@sA 7fPD7lHCH 3 tore e@sA 7hi@qfFddemFiAD7lH 3 eddress of i in 7lH 7lHD7lo@qfFddemFiAD7lH 7lHCHD7lH 3 vod s 7lHDID7lI 3 snrement s

PTI

sethi or st sethi or ld ld mp le nop

7hi@qfFddemFiAD7lH 3 eddress of s in 7lH 7lHD7lo@qfFddemFiAD7lH 7lID7lHCH 3 tore s 7hi@qfFddemFiAD7lH 3 eddress of s in 7lH 7lHD7lo@qfFddemFiAD7lH 7lHCHD7lI 3 vod s 7fpEPHD7lH 3 vod x 7lID7lH 3 gompre FvIV 3 frnh hely lot

his is some pretty poor odeF e don9t need to go through it line y lineD ut there re few quik oservtions we n mkeF he vlue for s is loded from memory (ve times in the loopF he ddress of s is omputed six times throughout the loop @eh time tkes two instrutionsAF here re no triky memory ddressing modesD so multiplying s y R to get yte o'set is done expliitly three times @t lest they use shiftAF o dd insult to injuryD they even put xyEy in the rnh dely slotF yne might skD hy do they ever generte ode this dc ellD it9s not euse the ompiler isn9t ple of generting e0ient odeD s we shll see elowF yne explntion is tht in this optimiztion levelD it simply does oneEtoEone trnsltion of the tuples @intermedite odeA into mhine lngugeF ou n lmost drw lines in the ove exmple nd preisely identify whih instrutions me from whih tuplesF yne reson to generte the ode using this simplisti pproh is to gurntee tht the progrm will produe the orret resultsF vooking t the ove odeD it9s pretty esy to rgue tht it indeed does extly wht the pyex ode doesF ou n trk every single ssemly sttement diretly k to prt of pyex sttementF st9s pretty ler tht you don9t wnt to exeute this ode in high performne prodution environment without some more optimiztionF

5.2.1.3.1 Moderate optimization


sn this exmpleD we enle some optimiztion @EyIAX

sve dd st dd st dd st sethi dd ld or st st ld mp g nop ld

7spDEIPHD7sp 3 7iHDERD7oH 3 7oHD7fpEIP 3 7iIDERD7oH 3 7oHD7fpER 3 7iPDERD7oH 3 7oHD7fpEV 3 7hi@qfFddemFiAD7oH 3 7oHD7lo@qfFddemFiAD7oP 7iQD7oH 3 7gHDID7oI 3 7oHD7fpEPH 3 7oID7oP 3 7oPD7oI 3 7oID7oH 3 FvIP 3 7oPD7oH

otte the register window eddress of e@HA tore on the stk eddress of f@HA tore on the stk eddress of g@HA tore on the stk eddress of s @top portionA 3 eddress of s @lower portionA 7oH a x @fourth prmeterA 7oI a I @for dditionA store x on the stk et memory opy of s to I oI a s @kind of redundntA ghek s > x @zeroEtripcA hon9t do loop t ll 3 hely lot 3 reElod for frnh hely lot

PTP

CHAPTER 5. APPENDIXES
FvWHHHHHIIHX ld sll ld ld ld sll ld fdds ld ld sll st ld dd st ld ld mp leD ld 7fpERD7oI 7oHDPD7oH 7oIC7oHD7fP 7oPD7oH 7fpEVD7oI 7oHDPD7oH 7oIC7oHD7fQ 7fPD7fQD7fP 7oPD7oH 7fpEIPD7oI 7oHDPD7oH 7fPD7oIC7oH 7oPD7oH 7oHDID7oH 7oHD7oP 7oPD7oH 7fpEPHD7oI 7oHD7oI FvWHHHHHIIH 7oPD7oH 3 op of the loop 3 oI a eddress of f@HA 3 wultiply s y R 3 fP a f@sA 3 vod s from memory 3 oI a eddress of g@HA 3 wultiply s y R 3 fQ a g@sA 3 egisterEtoEregister dd 3 vod s from memory @not gin3A 3 oI a eddress of e@HA 3 wultiply s y R @yesD ginA 3 e@sA a fP 3 vod s from memory 3 snrement s in register 3 tore s k into memory 3 vod s k into register 3 vod x into register 3 s > x cc 3 frnh hely lot

his is signi(nt improvement from the previous exmpleF ome loop onstnt omputtions @sutrting RA were hoisted out of the loopF e only loded s R times during loop itertionF trngelyD the ompiler didn9t hoose to store the ddresses of e@HAD f@HAD nd g@HA in registers t ll even though there were plenty of registersF iven more perplexing is the ft tht it loded vlue from memory immeditely fter it hd stored it from the ext sme register3 fut one right spot is the rnh dely slotF por the (rst itertionD the lod ws done efore the loop strtedF por the suessive itertionsD the (rst lod ws done in the rnh dely slot t the ottom of the loopF gompring this ode to the moderte optimiztion ode on the wgTVHPHD you n egin to get sense of why sg ws not n overnight senstionF st turned out tht n unsophistited ompiler ould generte muh tighter ode for gsg proessor thn sg proessorF sg proessors re lwys exeuting extr instrutions here nd there to ompenste for the lk of slik fetures in their instrution setF sf proessor hs fster lok rte ut hs to exeute more instrutionsD it does not lwys hve etter performne thn slowerD more e0ient proessorF fut s we shll soon seeD this gsg dvntge is out to evporte in this prtiulr exmpleF

5.2.1.3.2 Higher optimization


e now inrese the optimiztion to EyPF xow the ompiler genertes muh etter odeF st9s importnt you rememer tht this is the sme ompiler eing used for ll three exmplesF et this optimiztion levelD the ompiler looked through the ode su0iently well to know it didn9t even need to rotte the register windows @no sve instrutionAF glerly the ompiler looked t the register usge of the entire routineX

3 xoteD didn9t even rotte the register indow 3 e just use the 7o registers from the ller

PTQ

3 3 3 3

7oH 7oI 7oP 7oQ

a a a a

eddress eddress eddress eddress

of of of of

first element of e @from lling onventionA first element of f @from lling onventionA first element of g @from lling onventionA x @from lling onventionA 3 3 3 3 vod x ghek to see if it is <I ghek for zero trip loop hely slot E et s to I

ddemX

ld mp l or FvUUHHHHHQX ld FvWHHHHHIHWX ld fdds dd dd dd mp st dd leD ld FvUUHHHHHTX retl nop

7oQD7gP 7gPDI FvUUHHHHHT 7gHDID7gI 7oID7fH 7oPD7fI 7fHD7fID7fH 7gIDID7gI 7oIDRD7oI 7oPDRD7oP 7gID7gP 7fHD7oH 7oHDRD7oH FvWHHHHHIHW 7oID7fH

3 vod f@sA pirst time ynly 3 3 3 3 3 3 3 3 3 3 vod g@sA edd snrement s snrement eddress of f snrement eddress of g ghek voop ermintion tore e@sA snrement eddress of e frnh wG nnul vod the f@sA

3 vef eturn @xo windowA 3 frnh hely lot

his is tight odeF he registers oHD oID nd oP ontin the ddresses of the (rst elements of eD fD nd g respetivelyF hey lredy point to the right vlue for the (rst itertion of the loopF he vlue for s is never stored in memoryY it is kept in glol register gIF snsted of multiplying s y RD we simply dvne the three ddresses y R ytes eh itertionF he rnh dely slots re utilized for oth rnhesF he rnh t the ottom of the loop uses the nnul feture to nel the following lod if the rnh flls throughF he most interesting oservtion regrding this ode is the striking similrity to the ode nd the ode generted for the wgTVHPH t its top optimiztion levelX

vQX

fmoves fdds fmoves ddql ddql ddql suql tstl nes

IdDfpH HdDfpH fpHDPd 5RDH 5RDI 5RDP 5IDdH dH vQ

3 3 3 3 3 3 3

vod f@sA edd g@sA tore e@sA edvne y R edvne y R edvne y R herement s

he two ode sequenes re nerly identil3 por the egD it does n extr lod euse of its lodEstore rhitetureF yn the egD s is inremented nd ompred to xD while on the wgTVHPHD s is deremented nd ompred to zeroF

PTR

CHAPTER 5. APPENDIXES

his ptly shows how the dvning ompiler optimiztion pilities quikly mde the nifty fetures of the gsg rhitetures rther uselessF iven on the gsg proessorD the postEoptimiztion ode used the simple forms of the instrutions euse they produe they fstest exeution timeF xote tht these ode sequenes were generted on n wgTVHPHF en wgTVHTH should e le to eliminte the three ddql instrutions y using postEinrementD sving three instrutionsF edd little loop unrollingD nd you hve some very tight odeF yf ourseD the wgTVHTH ws never rodly deployed worksttion proessorD so we never relly got hne to tke it for test driveF

5.2.1.4 Convex C-240


his setion shows the results of ompiling on the gonvex gEeries of prllelGvetor superomputersF sn ddition to their norml registersD vetor omputers hve vetor registers tht ontin up to PST TREit elementsF hese proessors n perform opertions on ny suset of these registers with single instrutionF st is hrd to lim tht these vetor superomputers re more sg or gsgF hey hve simple len instrution sets ndD heneD re sgElikeF roweverD they hve instrutions tht implement loopsD nd so they re somewht gsgElikeF he gonvex gEPRH hs slr registers @sPAD vetor registers @vPAD nd ddress registers @QAF ih vetor register hs IPV elementsF he vetor length register ontrols how mny of the elements of eh vetor register re proessed y vetor instrutionsF sf vetor length is ove IPVD the entire register is proessedF he ode to implement our loop is s followsX

vRX

movFws ldFw ldFw ddFs stFw ddFw ddFw ddFw ddFw ltFw jrsFt

PDvl H@SADvH H@PADvI vIDvHDvP vPDH@QA 5EIPVDsP 5SIPDP 5SIPDQ 5SIPDS 5HDsP vR

Y Y Y Y Y Y Y Y Y Y

et the etor length to x vod f into etor egister vod g into etor egister edd the vetor registers tore results into e herement 4x4 edvne ddress for e edvne ddress for f edvne ddress for g ghek to see if 4x4 is < H

snitillyD the vetor length register is set to xF e ssume tht for the (rst itertionD x is greter thn IPVF he next instrution is vetor lod instrution into register vHF his lods IPV QPEit elements into this registerF he next instrution lso lods IPV elementsD nd the following instrution dds those two registers nd ples the results into third vetor registerF hen the IPV elements in egister vP re stored k into memoryF efter those elements hve een proessedD x is deremented y IPV @fter llD we did proess IPV elementsAF hen we dd SIP to eh of the ddresses @R ytes per elementA nd loop k upF et some pointD during the lst itertionD if x is not n ext multiple of IPVD the vetor length register is less thn IPVD nd the vetor instrutions only proess those remining elements up to xF yne of the hllenges of vetor proessors is to llow n instrution to egin exeuting efore the previous instrution hs ompletedF por exmpleD one the lod into egister vI hs prtilly ompletedD the proessor ould tully egin dding the (rst few elements of vH nd vI while witing for the rest of the elements of vI to rriveF his pproh of strting the next vetor instrution efore the previous vetor instrution hs ompleted is lled hiningF ghining is n importnt feture to get mximum performne from vetor proessorsF

PTS

5.2.1.5 IBM RS-6000


he sfw ETHHH is generlly redited s the (rst sg proessor to hve rked the vinpk IHHIHH enhmrkF he ETHHH is hrterized y strong )otingEpoint performne nd exellent memory ndE width mong sg worksttionsF he ETHHH ws the sis for sfw9s slle prllel proessorX the sfwEI nd PF hen our exmple progrm is run on the ETHHHD we n see the use of gsgE style instrution in the middle of sg proessorF he ETHHH supports rnhE onEount instrution tht omines the derementD testD nd rnh opertions into single instrutionF woreoverD there is speil register @the ount registerA tht is prt of the instrution feth unit tht stores the urrent vlue of the ounterF he feth unit lso hs its own dd unit to perform the derements for this instrutionF hese types of fetures reeping into sg rhitetures re ouring euse there is plenty of hip spe for themF sf wide rnge of progrms n run fster with this type of instrutionD it9s often ddedF he ssemly ode on the ETHHH isX

vIVX

i i i r mtspr lfsu lfsu f frsp stfsu

rQDrQDER rSDrSDER rRDrRDER fyspxyDgHq gDrT

5 eddress of e@HA 5 eddress of f@HA 5 eddress of g@HA 5 tore in the gounter egister

fpHDR@rRA 5 re snrement vod fpIDR@rSA 5 re snrement vod fpHDfpHDfpI fpHDfpH fpHDR@rQA 5 reEinrement tore fydgxiyDgHvDvIV 5 frnh on gounter

he ETHHH lso supports memory ddressing mode tht n dd vlue to its ddress register efore using the ddress registerF snterestinglyD these two fetures @rnh on ount nd preEinrement lodA eliminte severl instrutions when ompred to the more pure eg proessorF he eg proessor hs IH instrutions in the ody of its loopD while the ETHHH hs T instrutionsF he dvntge of the ETHHH in this prtiulr loop my e less signi(nt if oth proessors were twoEwy superslrF he instrutions were eliminted on the ETHHH were integer instrutionsF yn twoEwy superslr proessorD those integer instrutions my simply exeute on the integer units while the )otingEpoint units re usy performing the )otingEpoint omputtionsF

5.2.1.6 Conclusion
sn this setionD we hve ttempted to give you some understnding of the vriety of ssemly lnguge tht is produed y ompilers t di'erent optimiztion levels nd on di'erent omputer rhiteturesF et some point during the tuning of your odeD it n e quite instrutive to tke look t the generted ssemly lnguge to e sure tht the ompiler is not doing something relly stupid tht is slowing you downF lese don9t e tempted to rewrite portions in ssemly lngugeF sully ny prolems n e solved y lening up nd stremlining your highElevel soure ode nd setting the proper ompiler )gsF st is interesting tht very few people tully lern ssemly lnguge ny moreF wost folks (nd tht the ompiler is the est teher of ssemly lngugeF fy dding the pproprite option @often EAD the ompiler strts giving you lessonsF s suggest tht you don9t print out ll of the odeF here re mny pges of useless vrile delrtionsD etF por these exmplesD s ut out ll of tht useless informtionF st is est

PTT

CHAPTER 5. APPENDIXES

to view the ssemly in n editor nd only print out the portion tht pertins to the prtiulr loop you re tuningF

INDEX
Index of Keywords and Terms

PTU

Keywords re listed y the setion with tht keyword @pge numers re in prenthesesAF ueywords
do not neessrily pper in the text of the pgeF hey re merely ssoited with tht setionF pplesD IFI @IA Terms re referened y the pge they pper onF pplesD I

Ex.

Ex.

ess ptternsD PFRFU@IIIAD PFRFW@IIRA uryD IFPFT@QTA tive virtul memoryD PFPFS@VIA dvned optimiztionD PFIFS@SQA lgerD IFPFR@QQAD IFPFS@QRA miguous pointersD PFQFI@VSA miguous referenesD QFIFS@IRHA emdhl9s lwD PFPFQ@TVA ntidependeniesD QFIFR@IQSA rry setionsD RFIFR@IWTA rryEvluedD RFIFR@IWTA ssemly lngugeD SFPFI@PSTA ssertionsD QFQFQ@IUUA ssignment primitiveD RFIFR@IWTA ssoitive trnsformtionD IFPFR@QQAD IFPFS@QRA utomti prlleliztionD QFQFP@IUHA verge memory utiliztionD PFPFP@TQA k edgeD QFIFQ@IQQA kwrds dependeniesD QFIFR@IQSA ndwidthD IFIFU@PHA nk stllD IFIFU@PHA si lokD PFPFR@UWA si lok pro(lerD PFPFR@UWA si loksD PFIFR@RWA si optimiztionD PFIFS@SQA enhmrkD PFPFI@TPA enhmrksD @QA inD PFPFP@TQA inry oded deimlD IFPFQ@PWA lokD PFRFW@IIRA lok refereneD PFRFW@IIRA loked sGyD PFPFP@TQA lokingD PFRFI@IHIA rnhesD PFQFQ@WHAD PFQFR@WHAD PFRFR@IHRA rodstGgtherD RFPFQ@PPSA usesD QFPFP@IRTA ypssing heD IFIFU@PHA gCCD PFIFQ@RVA heD PFPFT@VRA

he ohereny protoolD QFPFP@IRTA he orgniztionD IFIFS@IQA hesD IFIFP@VAD IFIFR@WA hingD IFIFR@WA gsgD SFIFP@PQWA lutterD PFQFS@WSAD PFQFT@IHHAD PFQFU@IHHA oherenyD QFPFP@IRTA ommon suexpressionD PFIFT@SRA ommon suexpression elimintionD PFQFS@WSA ompilerD PFIFI@RUAD PFIFQ@RVAD PFIFR@RWAD PFIFT@SRAD PFIFU@TIAD PFIFV@TIAD QFIFT@IRQAD QFIFU@IRQAD QFQFQ@IUUA ompiler )exiilityD QFIFI@IPQA ompilersD PFIFP@RUAD PFQFQ@WHA omplex instrution set omputerD SFIFP@PQWA omputer rhitetureD @QA onstnt expressionD PFIFT@SRA onstnt foldingD PFIFT@SRA ontrol dependenyD QFIFP@IPRA opykD QFPFP@IRTA g speedD IFIFI@UA g timeD PFPFP@TQA ritil setionsD QFPFR@ITQA rossrsD QFPFP@IRTA

dtD QFIFP@IPRA dt dependenyD QFIFP@IPRA dt )ow nlysisD PFIFS@SQAD QFIFP@IPRAD QFQFR@IVVA dt plementD QFPFP@IRTA dt type onversionD PFQFS@WSA dtEprllelD RFIFP@IWIAD RFIFS@PHRA ded odeD PFIFT@SRA deomposing omputtionsD RFIFS@PHRA deomposing dtD RFIFS@PHRA deomposing tskD RFIFS@PHRA dependeniesD QFIFQ@IQQA dependeny distneD QFIFS@IRHA diret mppingD IFIFS@IQA diretEmpped heD IFIFS@IQA direted yli grphD QFIFP@IPRA disk sGyD PFPFS@VIA

PTV division y zeroD IFPFIH@RQA hewD IFIFP@VAD IFIFU@PHA dynmi rndom ess memoryD IFIFP@VA dynmi shedulingD QFQFQ@IUUA hoistingD PFQFS@WSA rp ontrol struturesD RFIFT@PHTA rp intrinsi struturesD RFIFT@PHTA

INDEX
I

E F

elpsed timeD PFPFP@TQA etimeD PFPFP@TQAD PFPFQ@TVAD PFPFU@VRA ft loopsD PFRFR@IHRA fenesD PFQFI@VSA (xed pointD IFPFQ@PWA )t pro(leD PFPFQ@TVA )otD PFQFS@WSA )otingEpoint nlysisD PFIFS@SQA )otingEpoint numerD IFPFP@PWAD IFPFQ@PWAD IFPFW@RPA )otingEpoint numersD IFPFI@PWAD IFPFR@QQAD IFPFS@QRAD IFPFT@QTAD IFPFU@QUAD IFPFV@RHAD IFPFIH@RQAD IFPFII@RQAD IFPFIP@RRAD IFPFIQ@RSA )op mesurementsD PFPFT@VRA )ow dependeniesD QFIFR@IQSA )ow of ontrolD QFIFP@IPRA forkEjoin progrmmingD QFPFR@ITQA pyexD IFIFP@VAD IFPFII@RQAD PFIFP@RUAD PFIFQ@RVAD PFIFT@SRAD PFQFP@VTAD PFQFS@WSAD PFRFQ@IHQAD PFRFS@IHUAD PFRFU@IIIAD PFRFW@IIRAD PFRFII@IPHAD PFRFIP@IPHAD QFIFS@IRHAD QFIFU@IRQAD QFPFQ@ISIAD QFQFI@IUHAD QFQFQ@IUUAD RFIFI@IWIAD RFIFQ@IWSAD RFIFU@PIPAD RFPFI@PIQAD RFPFQ@PPSAD SFPFI@PSTA pyex UUD RFIFR@IWTAD RFPFR@PQUA pyex WHD RFIFR@IWTAD RFIFS@PHRAD RFIFT@PHTAD RFPFR@PQUA free rel memoryD PFPFS@VIA fully ssoitive heD IFIFS@IQA grdul under)owD IFPFW@RPA grnulrityD QFIFI@IPQA gretest ommon divisorD IFPFQ@PWA gurd digitsD IFPFT@QTA het )owD RFIFP@IWIAD RFIFR@IWTAD RFIFT@PHTAD RFPFP@PIQAD RFPFQ@PPSA righ performne omputingD @QAD RFIFI@IWIAD RFIFP@IWIAD RFIFQ@IWSAD RFIFR@IWTAD RFIFU@PIPAD SFIFI@PQWAD SFIFP@PQWAD SFIFQ@PRIAD SFIFR@PSHAD SFIFS@PSPAD SFIFT@PSQAD SFIFU@PSSAD SFIFV@PSSAD SFPFI@PSTA high performne pyexD RFIFS@PHRAD RFIFT@PHTAD RFIFU@PIPA

siiiD IFPFU@QUA siii opertionsD IFPFV@RHA sxh rryD IFIFR@WA indiret memory referenesD PFQFI@VSA indution vrile simpli(tionD PFIFT@SRA indution vrilesD PFIFT@SRA inext opertionD IFPFIH@RQA inliningD PFQFP@VTA inner loopD PFRFS@IHUA instrution setD SFIFP@PQWA instrutionElevel prllelismD QFIFI@IPQA sntel VHVVD SFPFI@PSTA interhngeD PFRFV@IIQA intermedite lngugeD PFIFR@RWA interproedurl nlysisD PFIFS@SQAD QFQFQ@IUUA invlid opertionsD IFPFIH@RQA invrintD PFQFR@WHA itertion shedulingD QFQFQ@IUUA kernel modeD PFPFP@TQA lngugeD PFIFQ@RVAD RFIFU@PIPA lnguge supportD RFIFI@IWIAD RFIFT@PHTA lrge hesD IFIFU@PHA ltenyD IFIFU@PHA loopD PFQFU@IHHAD QFIFQ@IQQAD QFQFS@IVWAD RFIFP@IWIA loop onditioningD PFQFU@IHHA loop index dependentD PFQFR@WHA loop interhngeD PFRFI@IHIAD PFRFT@IHWA loop nestD PFRFT@IHWA loop nestsD PFRFS@IHUA loop omptimiztionD PFRFR@IHRAD PFRFS@IHUA loop optimiztionD PFRFP@IHPAD PFRFQ@IHQAD PFRFT@IHWAD PFRFU@IIIAD PFRFV@IIQAD PFRFW@IIRAD PFRFIH@IIWA loop optimiztionsD PFRFII@IPHAD PFRFIP@IPHA loop unrollingD PFRFI@IHIAD PFRFQ@IHQAD PFRFR@IHRAD PFRFS@IHUA loopErried dependeniesD QFIFR@IQSA loopEinvrint expressionD PFIFT@SRA loopsD PFPFQ@TVAD PFQFR@WHAD PFRFII@IPHAD PFRFIP@IPHAD QFIFR@IQSA mlloD QFIFS@IRHA mntissGexponentD IFPFQ@PWA mppingD IFIFS@IQA

K L

G H

M mrosD PFQFP@VTA

INDEX
mtrix mnipultionsD RFIFR@IWTA memoryD @QAD IFIFI@UAD IFIFP@VAD IFIFQ@WAD IFIFR@WAD IFIFT@IUAD IFIFU@PHAD IFIFV@PUAD IFIFW@PVAD PFRFU@IIIAD PFRFW@IIRAD PFRFIH@IIWA memory ess timeD IFIFP@VA memory yle timeD IFIFP@VA memory hierrhyD IFIFP@VA memory referene optimiztionD PFRFI@IHIA messgeEpssingD RFPFI@PIQA messgeEpssing environmentsD RFPFP@PIQAD RFPFQ@PPSAD RFPFR@PQUA messgeEpssing interfeD RFPFI@PIQAD RFPFQ@PPSA miroproessorsD @QA mixed strideD PFRFW@IIRA mixed typeD PFQFS@WSA move omputtionD PFRFT@IHWA multiproessingD QFPFP@IRTA multiproessorsD QFPFI@IRSAD QFPFQ@ISIAD QFPFR@ITQAD QFPFS@ITTAD QFPFT@ITVAD QFPFU@ITWAD QFQFI@IUHAD QFQFP@IUHAD QFQFQ@IUUAD QFQFR@IVVAD QFQFS@IVWA multithredingD QFPFQ@ISIAD QFPFR@ITQA

PTW prllel progrmmingD QFQFS@IVWA prllel regionD QFQFQ@IUUA prllel virtul mhineD RFPFI@PIQAD RFPFP@PIQA prllelismD PFQFT@IHHAD PFQFU@IHHAD QFIFI@IPQAD QFIFR@IQSAD QFIFS@IRHAD QFIFT@IRQAD QFIFU@IRQA prlleliztionD QFQFQ@IUUA perent utiliztionD PFPFP@TQA permuttionD QFIFS@IRHA permuttion ssertionD QFQFQ@IUUA piplinedD QFIFI@IPQA pixieD PFPFR@UWA pointer hsingD IFIFR@WA pointersD PFIFQ@RVA ostEsgD SFIFT@PSQA preonditioning loopD PFRFQ@IHQA proedure llsD PFRFR@IHRA pro(lersD PFPFR@UWA pro(lingD PFPFI@TPAD PFPFP@TQAD PFPFQ@TVAD PFPFR@UWAD PFPFS@VIAD PFPFT@VRAD PFPFU@VRAD PFRFII@IPHAD PFRFIP@IPHA progrm lutterD PFQFT@IHHAD PFQFU@IHHA progrmmingD QFQFI@IUHAD QFQFP@IUHAD QFQFQ@IUUAD QFQFR@IVVAD QFQFS@IVWAD RFIFT@PHTA propgtionD PFIFT@SRA

nested loop optimiztionD PFRFI@IHIA no dependeniesD QFQFQ@IUUA no equivlene ssertionD QFQFQ@IUUA no optimiztionD PFIFS@SQA numeril nlysisD IFPFI@PWA onEproessor distriutionD RFIFT@PHTA opertion ountingD PFRFP@IHPA optimiztionD PFIFQ@RVAD PFIFT@SRAD PFIFU@TIAD PFIFV@TIAD QFIFT@IRQAD QFIFU@IRQAD SFPFI@PSTA optimiztion levelsD PFIFS@SQA optimizeD PFIFR@RWA optimizing ompilerD PFIFI@RUA optimizing ompilersD PFIFP@RUA outEofEore solutionsD PFRFI@IHIAD PFRFIH@IIWA outer loopD PFRFS@IHUA output dependeniesD QFIFR@IQSA over)ow to in(nityD IFPFIH@RQA pge fultsD IFIFT@IUAD PFPFP@TQAD PFPFS@VIA pge tlesD IFIFT@IUA pgesD IFIFT@IUA prllelD RFIFR@IWTA prllel lngugesD RFIFQ@IWSA prllel loopsD QFQFQ@IUUA

Q R

qudruplesD PFIFR@RWA rtionl numersD IFPFP@PWAD IFPFQ@PWA relityD IFPFP@PWA redued instrution set omputerD SFIFS@PSPAD SFIFT@PSQA redued instrution set omputingD SFIFQ@PRIAD SFIFR@PSHA redution opertionsD QFQFP@IUHA redutionsD QFIFR@IQSA registersD IFIFP@VAD IFIFQ@WA reltion ssertionD QFQFQ@IUUA representtionD IFPFQ@PWA sgD PFIFU@TIAD PFIFV@TIAD SFIFQ@PRIAD SFIFR@PSHAD SFIFS@PSPA runtime pro(le nlysisD PFIFS@SQA sle prllel omputingD QFPFQ@ISIA setEssoitive heD IFIFS@IQA shpe onformneD RFIFR@IWTA shredEmemoryD QFPFI@IRSAD QFPFQ@ISIAD QFPFR@ITQAD QFPFS@ITTAD QFPFT@ITVAD QFPFU@ITWAD QFQFI@IUHAD QFQFP@IUHAD QFQFQ@IUUAD QFQFR@IVVAD QFQFS@IVWA

PUH shredEmemory speD PFPFP@TQA signi(ndD IFPFV@RHA sinkingD PFQFS@WSA sizeD PFPFS@VIA snoopingD QFPFP@IRTA softwreD QFPFQ@ISIA egD SFPFI@PSTA sptil lolity of refereneD IFIFR@WA speedupD QFQFP@IUHA ewD IFIFP@VAD IFIFR@WAD IFIFU@PHA sttement funtionD PFQFP@VTA stti rndom ess memoryD IFIFP@VAD IFIFR@WA stremliningD PFQFQ@WHA strength redutionD PFIFT@SRA strideD PFRFU@IIIA suroutine llsD PFQFI@VSAD PFQFP@VTA suroutine pro(lingD PFPFQ@TVA superslrD QFIFI@IPQA swp reD PFPFS@VIA swpsD PFPFP@TQA synhroniztionD QFPFR@ITQA system timeD PFPFP@TQA

INDEX
thred privteD QFPFQ@ISIA thredElevel prllelismD QFIFI@IPQA thredElol vrilesD QFPFQ@ISIA timeD PFPFP@TQA timeEsed simultionD QFPFR@ITQA timingD PFPFI@TPAD PFPFP@TQAD PFPFQ@TVAD PFPFR@UWAD PFPFS@VIAD PFPFT@VRAD PFPFU@VRAD PFRFII@IPHAD PFRFIP@IPHA trnsltion lookside u'erD IFIFT@IUA trip ount ssertionD QFQFQ@IUUA trip ountsD PFRFR@IHRA tuning tehniquesD PFQFT@IHHAD PFQFU@IHHA

uniform memory essD QFPFI@IRSA unit strideD IFIFR@WA user dtgrm protoolsD RFPFP@PIQA user modeD PFPFP@TQA user timeD PFPFP@TQA useraspe thred ontext swithD QFPFQ@ISIA vrile renmingD PFIFT@SRA virtul memoryD IFIFT@IUAD PFPFS@VIA wordy testsD PFQFI@VSA worklodD PFPFI@TPA writeEthrough poliyD QFPFP@IRTA writekD QFPFP@IRTA

tgD IFIFS@IQA tehniquesD QFPFR@ITQA temporl lolity of refereneD IFIFR@WA thrshingD IFIFS@IQA thredD QFPFQ@ISIA

W wit sttesD IFIFP@VA

ATTRIBUTIONS
Attributions
golletionX idited yX ghrles everne vX httpXGGnxForgGontentGolIIIQTGIFRG vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4IFH sntrodution to the gonnexions idition4 sed here sX 4sntrodution to the gonnexions idition4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUHWGIFIG gesX IEP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4sntrodution to righ erformne gomputing4 fyX ghrles everne vX httpXGGnxForgGontentGmQPTUTGIFPG gesX QES gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUQQGIFPG gesX UEV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E wemory ehnology4 sed here sX 4wemory ehnology4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUITGIFPG gesX VEW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E egisters4 sed here sX 4egisters4 fyX ghrles everne vX httpXGGnxForgGontentGmQPTVIGIFPG geX W gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E ghes4 sed here sX 4ghes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUPSGIFPG gesX WEIQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

PUI

High Performance Computing

PUP woduleX 4wemory E ghe yrgniztion4 sed here sX 4ghe yrgniztion4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUPPGIFPG gesX IQEIU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E irtul wemory4 sed here sX 4irtul wemory4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUPVGIFPG gesX IUEPH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E smproving wemory erformne4 sed here sX 4smproving wemory erformne4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUQTGIFPG gesX PHEPU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPTWHGIFPG gesX PUEPV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wemory E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQPTWVGIFPG geX PV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUQWGIFPG geX PW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

ATTRIBUTIONS

ATTRIBUTIONS
woduleX 4plotingEoint xumers E elity4 sed here sX 4elity4 fyX ghrles everne vX httpXGGnxForgGontentGmQPURIGIFPG geX PW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E epresenttion4 sed here sX 4epresenttion4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUUPGIFPG gesX PWEQQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E i'ets of plotingEoint epresenttion4 sed here sX 4i'ets of plotingEoint epresenttion4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUSSGIFPG gesX QQEQR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E wore elger ht hoesn9t ork4 sed here sX 4wore elger ht hoesn9t ork4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUSRGIFPG gesX QREQT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E smproving eury sing qurd higits4 sed here sX 4smproving eury sing qurd higits4 fyX ghrles everne vX httpXGGnxForgGontentGmQPURRGIFPG gesX QTEQU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E ristory of siii plotingEoint pormt4 sed here sX 4ristory of siii plotingEoint pormt4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUUHGIFPG gesX QUERH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

PUQ

PUR woduleX 4plotingEoint xumers E siii ypertions4 sed here sX 4siii ypertions4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUSTGIFPG gesX RHERP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E peil lues4 sed here sX 4peil lues4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUSVGIFPG gesX RPERQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E ixeptions nd rps4 sed here sX 4ixeptions nd rps4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUTHGIFPG geX RQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E gompiler sssues4 sed here sX 4gompiler sssues4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUTPGIFPG gesX RQERR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUTVGIFPG gesX RRERS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4plotingEoint xumers E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUTSGIFPG geX RS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

ATTRIBUTIONS

ATTRIBUTIONS
woduleX 4ht gompiler hoes E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTWHGIFPG geX RU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E ristory of gompilers4 sed here sX 4ristory of gompilers4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTVTGIFPG gesX RUERV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E hih vnguge o yptimize4 sed here sX 4hih vnguge o yptimize4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTVUGIFPG gesX RVERW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E yptimizing gompiler our4 sed here sX 4yptimizing gompiler our4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTWRGIFPG gesX RWESQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E yptimiztion vevels4 sed here sX 4yptimiztion vevels4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTWPGIFPG gesX SQESR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E glssil yptimiztions4 sed here sX 4glssil yptimiztions4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTWTGIFPG gesX SRETI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

PUS

PUT woduleX 4ht gompiler hoes E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTWWGIFPG geX TI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht gompiler hoes E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUHHGIFPG gesX TIETP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUHRGIFPG gesX TPETQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E iming4 sed here sX 4iming4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUHTGIFPG gesX TQETV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E uroutine ro(ling4 sed here sX 4uroutine ro(ling4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUIQGIFPG gesX TVEUW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E fsi flok ro(lers4 sed here sX 4fsi flok ro(lers4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUIHGIFPG gesX UWEVI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

ATTRIBUTIONS

ATTRIBUTIONS
woduleX 4iming nd ro(ling E irtul wemory4 sed here sX 4irtul wemory4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUIPGIFPG gesX VIEVR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUIRGIFPG geX VR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iming nd ro(ling E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUIVGIFPG gesX VREVS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPHGIFPG gesX VSEVT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E uroutine glls4 sed here sX 4uroutine glls4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPIGIFPG gesX VTEWH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E frnhes4 sed here sX 4frnhes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPPGIFPG geX WH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

PUU

PUV woduleX 4iliminting glutter E frnhes ith voops4 sed here sX 4frnhes ith voops4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPQGIFPG gesX WHEWS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E yther glutter4 sed here sX 4yther glutter4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPRGIFPG gesX WSEIHH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPSGIFPG geX IHH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4iliminting glutter E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPUGIFPG gesX IHHEIHI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPVGIFPG gesX IHIEIHP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E ypertion gounting4 sed here sX 4ypertion gounting4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUPWGIFPG gesX IHPEIHQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

ATTRIBUTIONS

ATTRIBUTIONS
woduleX 4voop yptimiztions E fsi voop nrolling4 sed here sX 4fsi voop nrolling4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUQPGIFPG gesX IHQEIHR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E ulifying gndidtes for voop nrolling4 sed here sX 4ulifying gndidtes for voop nrolling p one level 4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUQQGIFPG gesX IHREIHU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E xested voops4 sed here sX 4xested voops4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUQRGIFPG gesX IHUEIHW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E voop snterhnge4 sed here sX 4voop snterhnge4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUQTGIFPG gesX IHWEIII gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E wemory eess tterns4 sed here sX 4wemory eess tterns4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUQVGIFPG gesX IIIEIIQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E hen snterhnge on9t ork4 sed here sX 4hen snterhnge on9t ork4 fyX ghrles everne vX httpXGGnxForgGontentGmQQURIGIFPG gesX IIQEIIR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

PUW

PVH woduleX 4voop yptimiztions E floking to ise wemory eess tterns4 sed here sX 4floking to ise wemory eess tterns4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUSTGIFPG gesX IIREIIW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E rogrms ht equire wore wemory hn ou rve4 sed here sX 4rogrms ht equire wore wemory hn ou rve4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUUHGIFPG gesX IIWEIPH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUUQGIFPG geX IPH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4voop yptimiztions E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUUUGIFPG gesX IPHEIPP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUUSGIFPG gesX IPQEIPR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E hependenies4 sed here sX 4hependenies4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUUUGIFPG gesX IPREIQQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

ATTRIBUTIONS

ATTRIBUTIONS
woduleX 4nderstnding rllelism E voops4 sed here sX 4voops4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUVRGIFPG gesX IQQEIQS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E voopEgrried hependenies4 sed here sX 4voopEgrried hependenies 4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUVPGIFPG gesX IQSEIRH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E emiguous eferenes4 sed here sX 4emiguous eferenes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUVVGIFPG gesX IRHEIRQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUVWGIFPG geX IRQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4nderstnding rllelism E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUWPGIFPG gesX IRQEIRS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUWUGIFPG gesX IRSEIRT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

PVI

PVP woduleX 4hredEwemory wultiproessors E ymmetri wultiproessing rrdwre4 sed here sX 4ymmetri wultiproessing rrdwre4 fyX ghrles everne vX httpXGGnxForgGontentGmQPUWRGIFPG gesX IRTEISI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E wultiproessor oftwre gonepts4 sed here sX 4wultiproessor oftwre gonepts 4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVHHGIFPG gesX ISIEITQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E ehniques for wultithreded rogrms4 sed here sX 4ehniques for wultithreded rogrms4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVHPGIFPG gesX ITQEITT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E e el ixmple4 sed here sX 4e el ixmple 4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVHRGIFPG gesX ITTEITV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVHUGIFPG gesX ITVEITW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4hredEwemory wultiproessors E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVIHGIFPG gesX ITWEIUH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

ATTRIBUTIONS

ATTRIBUTIONS
woduleX 4rogrmming hredEwemory wultiproessors E sntrodution4 sed here sX 4 sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVIPGIFPG geX IUH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4rogrmming hredEwemory wultiproessors E eutomti rlleliztion4 sed here sX 4eutomti rlleliztion4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVPIGIFPG gesX IUHEIUU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4rogrmming hredEwemory wultiproessors E essisting the gompiler4 sed here sX 4essisting the gompiler4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVIRGIFPG gesX IUUEIVV gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4rogrmming hredEwemory wultiproessors E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVPHGIFPG gesX IVVEIVW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4rogrmming hredEwemory wultiproessors E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQPVIWGIFPG gesX IVWEIWH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQURRGIFPG geX IWI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

PVQ

PVR woduleX 4vnguge upport for erformne E htErllel rolemX ret plow4 sed here sX 4htErllel rolemX ret plow4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUSIGIFPG gesX IWIEIWS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E ixpliity rllel vnguges4 sed here sX 4ixpliity rllel vnguges4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUSRGIFPG gesX IWSEIWT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E pyex WH4 sed here sX 4pyex WH4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUSUGIFPG gesX IWTEPHR gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E rolem heomposition4 sed here sX 4rolem heomposition4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUTPGIFPG gesX PHREPHT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E righ erformne pyex @rpA4 sed here sX 4righ erformne pyex @rpA4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUTSGIFPG gesX PHTEPIP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4vnguge upport for erformne E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUUSGIFPG geX PIP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

ATTRIBUTIONS

ATTRIBUTIONS
woduleX 4wessgeEssing invironments E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUVIGIFPG geX PIQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wessgeEssing invironments E rllel irtul whine4 sed here sX 4rllel irtul whine4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUUWGIFPG gesX PIQEPPS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wessgeEssing invironments E wessgeEssing snterfe4 sed here sX 4wessgeEssing snterfe4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUVQGIFPG gesX PPSEPQU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4wessgeEssing invironments E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUVRGIFPG geX PQU gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E sntrodution4 sed here sX 4sntrodution4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUIGIFPG geX PQW gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E hy gsgc4 sed here sX 4hy gsgc4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUPGIFPG gesX PQWEPRI gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

PVS

PVT woduleX 4ht is righ erformne gomputing E pundmentl of sg4 sed here sX 4pundmentl of sg4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUQGIFPG gesX PRIEPSH gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E eondEqenertion sg roessors4 sed here sX 4eondEqenertion sg roessors4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUSGIFPG gesX PSHEPSP gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E sg wens pst4 sed here sX 4sg wens pst4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUWGIFPG gesX PSPEPSQ gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

ATTRIBUTIONS

woduleX 4ht is righ erformne gomputing E yutEofEyrder ixeutionX he ostEsg erhiteture4 sed here sX 4yutEofEyrder ixeutionX he ostEsg erhiteture4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTUUGIFPG gesX PSQEPSS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E glosing xotes4 sed here sX 4glosing xotes4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTVQGIFPG geX PSS gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG woduleX 4ht is righ erformne gomputing E ixerises4 sed here sX 4ixerises4 fyX ghrles everne vX httpXGGnxForgGontentGmQQTVPGIFPG gesX PSSEPST gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

ATTRIBUTIONS
woduleX 4eppendix f to 4righ erformne gomputing4X vooking t essemly vnguge4 sed here sX 4essemly vnguge4 fyX ghrles everne vX httpXGGnxForgGontentGmQQUVUGIFPG gesX PSTEPTT gopyrightX ghrles everne vienseX httpXGGretiveommonsForgGliensesGyGQFHG

PVU

High Performance Computing

he purpose of this ookD righ erformne gomputing hs lwys een to teh new progrmmers nd sientists out the sis of righ erformne gomputingF his ook is for lerners with si underE stnding of modern omputer rhitetureD not dvned degrees in omputer engineeringD s it is n esily understood introdution nd overview of the topiF yriginlly pulished y y9eilly wedi in IWWVD the ook hs sine gone out of print nd hs now een relesed under the gretive gommons ettriution viense on gonnexionsF

About Connexions

ine IWWWD gonnexions hs een pioneering glol system where nyone n rete ourse mterils nd mke them fully essile nd esily reusle free of hrgeF e re eEsed uthoringD tehing nd lerning environment open to nyone interested in edutionD inluding studentsD tehersD professors nd lifelong lernersF e onnet ides nd filitte edutionl ommunitiesF gonnexions9s modulrD intertive ourses re in use worldwide y universitiesD ommunity ollegesD uEIP shoolsD distne lernersD nd lifelong lernersF gonnexions mterils re in mny lngugesD inluding inglishD pnishD ghineseD tpneseD stlinD ietnmeseD prenhD ortugueseD nd hiF gonnexions is prt of n exiting new informtion distriution system tht llows for Print on Demand BooksF gonnexions hs prtnered with innovtive onEdemnd pulisher yy to elerte the delivery of printed ourse mterils nd textooks into lssrooms worldwide t lower pries thn trditionl demi pulishersF

S-ar putea să vă placă și