Sunteți pe pagina 1din 24

Whitepaper

NVIDIAs Next Generation


CUDA Compute Architecture:
TM

TM

Kepler GK110

The Fastest, Most Efficient HPC Architecture Ever Built

V1.0

Table of Contents
KeplerGK110TheNextGenerationGPUComputingArchitecture...........................................................3
KeplerGK110ExtremePerformance,ExtremeEfficiency..........................................................................4

Dynamic Parallelism......................................................................................................................5

Hyper-Q...........................................................................................................................................5

Grid Management Unit..................................................................................................................5

NVIDIA GPUDirect.....................................................................................................................5

AnOverviewoftheGK110KeplerArchitecture...........................................................................................6
PerformanceperWatt..............................................................................................................................7
StreamingMultiprocessor(SMX)Architecture.........................................................................................8
SMXProcessingCoreArchitecture.......................................................................................................9
QuadWarpScheduler...........................................................................................................................9
NewISAEncoding:255RegistersperThread.....................................................................................11
ShuffleInstruction...............................................................................................................................11
AtomicOperations..............................................................................................................................11
TextureImprovements.......................................................................................................................12
KeplerMemorySubsystemL1,L2,ECC................................................................................................13
64KBConfigurableSharedMemoryandL1Cache............................................................................13
48KBReadOnlyDataCache...............................................................................................................13
ImprovedL2Cache.............................................................................................................................14
MemoryProtectionSupport...............................................................................................................14
DynamicParallelism................................................................................................................................14
HyperQ...................................................................................................................................................17
GridManagementUnitEfficientlyKeepingtheGPUUtilized...............................................................19
NVIDIAGPUDirect................................................................................................................................20
Conclusion...................................................................................................................................................21
Appendix A - QuickRefresheronCUDA...................................................................................................22
CUDAHardwareExecution.................................................................................................................23

Kepler GK110 The Next Generation GPU Computing


Architecture
Asthedemandforhighperformanceparallelcomputingincreasesacrossmanyareasofscience,
medicine,engineering,andfinance,NVIDIAcontinuestoinnovateandmeetthatdemandwith
extraordinarilypowerfulGPUcomputingarchitectures.NVIDIAsexistingFermiGPUshavealready
redefinedandacceleratedHighPerformanceComputing(HPC)capabilitiesinareassuchasseismic
processing,biochemistrysimulations,weatherandclimatemodeling,signalprocessing,computational
finance,computeraidedengineering,computationalfluiddynamics,anddataanalysis.NVIDIAsnew
KeplerGK110GPUraisestheparallelcomputingbarconsiderablyandwillhelpsolvetheworldsmost
difficultcomputingproblems.
ByofferingmuchhigherprocessingpowerthanthepriorGPUgenerationandbyprovidingnewmethods
tooptimizeandincreaseparallelworkloadexecutionontheGPU,KeplerGK110simplifiescreationof
parallelprogramsandwillfurtherrevolutionizehighperformancecomputing.

Kepler GK110 - Extreme Performance, Extreme Efficiency


Comprising7.1billiontransistors,KeplerGK110isnotonlythefastest,butalsothemostarchitecturally
complexmicroprocessoreverbuilt.Addingmanynewinnovativefeaturesfocusedoncompute
performance,GK110wasdesignedtobeaparallelprocessingpowerhouseforTeslaandtheHPC
market.
KeplerGK110willprovideover1TFlopofdoubleprecisionthroughputwithgreaterthan80%DGEMM
efficiencyversus6065%onthepriorFermiarchitecture.
Inadditiontogreatlyimprovedperformance,theKeplerarchitectureoffersahugeleapforwardin
powerefficiency,deliveringupto3xtheperformanceperwattofFermi.

KeplerGK110DiePhoto

ThefollowingnewfeaturesinKeplerGK110enableincreasedGPUutilization,simplifyparallelprogram
design,andaidinthedeploymentofGPUsacrossthespectrumofcomputeenvironmentsrangingfrom
personalworkstationstosupercomputers:

Dynamic ParallelismaddsthecapabilityfortheGPUtogeneratenewworkforitself,
synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,accelerated
hardwarepaths,allwithoutinvolvingtheCPU.Byprovidingtheflexibilitytoadapttothe
amountandformofparallelismthroughthecourseofaprogram'sexecution,programmerscan
exposemorevariedkindsofparallelworkandmakethemostefficientusetheGPUasa
computationevolves.Thiscapabilityallowslessstructured,morecomplextaskstoruneasily
andeffectively,enablinglargerportionsofanapplicationtorunentirelyontheGPU.Inaddition,
programsareeasiertocreate,andtheCPUisfreedforothertasks.

Hyper-QHyperQenablesmultipleCPUcorestolaunchworkonasingleGPU
simultaneously,therebydramaticallyincreasingGPUutilizationandsignificantlyreducingCPU
idletimes.HyperQincreasesthetotalnumberofconnections(workqueues)betweenthehost
andtheGK110GPUbyallowing32simultaneous,hardwaremanagedconnections(comparedto
thesingleconnectionavailablewithFermi).HyperQisaflexiblesolutionthatallowsseparate
connectionsfrommultipleCUDAstreams,frommultipleMessagePassingInterface(MPI)
processes,orevenfrommultiplethreadswithinaprocess.Applicationsthatpreviously
encounteredfalseserializationacrosstasks,therebylimitingachievedGPUutilization,cansee
uptodramaticperformanceincreasewithoutchanginganyexistingcode.

Grid Management UnitEnablingDynamicParallelismrequiresanadvanced,flexible


gridmanagementanddispatchcontrolsystem.ThenewGK110GridManagementUnit(GMU)
managesandprioritizesgridstobeexecutedontheGPU.TheGMUcanpausethedispatchof
newgridsandqueuependingandsuspendedgridsuntiltheyarereadytoexecute,providingthe
flexibilitytoenablepowerfulruntimes,suchasDynamicParallelism.TheGMUensuresboth
CPUandGPUgeneratedworkloadsareproperlymanagedanddispatched.

NVIDIA GPUDirectNVIDIAGPUDirectisacapabilitythatenablesGPUswithina
singlecomputer,orGPUsindifferentserverslocatedacrossanetwork,todirectlyexchange
datawithoutneedingtogotoCPU/systemmemory.TheRDMAfeatureinGPUDirectallows
thirdpartydevicessuchasSSDs,NICs,andIBadapterstodirectlyaccessmemoryonmultiple
GPUswithinthesamesystem,significantlydecreasingthelatencyofMPIsendandreceive
messagesto/fromGPUmemory.Italsoreducesdemandsonsystemmemorybandwidthand
freestheGPUDMAenginesforusebyotherCUDAtasks.KeplerGK110alsosupportsother
GPUDirectfeaturesincludingPeertoPeerandGPUDirectforVideo.

An Overview of the GK110 Kepler Architecture


KeplerGK110wasbuiltfirstandforemostforTesla,anditsgoalwastobethehighestperforming
parallelcomputingmicroprocessorintheworld.GK110notonlygreatlyexceedstherawcompute
horsepowerdeliveredbyFermi,butitdoessoefficiently,consumingsignificantlylesspowerand
generatingmuchlessheatoutput.
AfullKeplerGK110implementationincludes15SMXunitsandsix64bitmemorycontrollers.Different
productswillusedifferentconfigurationsofGK110.Forexample,someproductsmaydeploy13or14
SMXs.
Keyfeaturesofthearchitecturethatwillbediscussedbelowinmoredepthinclude:

ThenewSMXprocessorarchitecture
Anenhancedmemorysubsystem,offeringadditionalcachingcapabilities,morebandwidthat
eachlevelofthehierarchy,andafullyredesignedandsubstantiallyfasterDRAMI/O
implementation.
Hardwaresupportthroughoutthedesigntoenablenewprogrammingmodelcapabilities

Kepler GK110Fullchipblockdiagram

KeplerGK110supportsthenewCUDAComputeCapability3.5.(ForabriefoverviewofCUDAsee
AppendixAQuickRefresheronCUDA).ThefollowingtablecomparesparametersofdifferentCompute
CapabilitiesforFermiandKeplerGPUarchitectures:

FERMI
GF100

FERMI
GF104

KEPLER
GK104

KEPLER
GK110

ComputeCapability

2.0

2.1

3.0

3.5

Threads/Warp

32

32

32

32

MaxWarps/Multiprocessor

48

48

64

64

1536

1536

2048

2048

16

16

32768

32768

65536

65536

63

63

63

255

1024

1024

1024

1024

16K

16K

16K

16K

48K

2^161
No
No

48K

2^161
No
No

32K
48K
2^321
No
No

32K
48K
2^321
Yes
Yes

MaxThreads/Multiprocessor
MaxThreadBlocks/Multiprocessor
32bitRegisters/Multiprocessor
MaxRegisters/Thread
MaxThreads/ThreadBlock
SharedMemorySizeConfigurations(bytes)

MaxXGridDimension
HyperQ
DynamicParallelism
ComputeCapabilityofFermiandKeplerGPUs

PerformanceperWatt
AprincipaldesigngoalfortheKeplerarchitecturewasimprovingpowerefficiency.Whendesigning
Kepler,NVIDIAengineersappliedeverythinglearnedfromFermitobetteroptimizetheKepler
architectureforhighlyefficientoperation.TSMCs28nmmanufacturingprocessplaysanimportantrole
inloweringpowerconsumption,butmanyGPUarchitecturemodificationswererequiredtofurther
reducepowerconsumptionwhilemaintaininggreatperformance.
EveryhardwareunitinKeplerwasdesignedandscrubbedtoprovideoutstandingperformanceperwatt.
Thebestexampleofgreatperf/wattisseeninthedesignofKeplerGK110snewStreaming
Multiprocessor(SMX),whichissimilarinmanyrespectstotheSMXunitrecentlyintroducedinKepler
GK104,butincludessubstantiallymoredoubleprecisionunitsforcomputealgorithms.

StreamingMultiprocessor(SMX)Architecture
KeplerGK110snewSMXintroducesseveralarchitecturalinnovationsthatmakeitnotonlythemost
powerfulmultiprocessorwevebuilt,butalsothemostprogrammableandpowerefficient.

SMX:192singleprecisionCUDAcores,64doubleprecisionunits,32specialfunctionunits(SFU),and32load/storeunits
(LD/ST).

SMXProcessingCoreArchitecture
EachoftheKeplerGK110SMXunitsfeature192singleprecisionCUDAcores,andeachcorehasfully
pipelinedfloatingpointandintegerarithmeticlogicunits.KeplerretainsthefullIEEE7542008
compliantsingleanddoubleprecisionarithmeticintroducedinFermi,includingthefusedmultiplyadd
(FMA)operation.
OneofthedesigngoalsfortheKeplerGK110SMXwastosignificantlyincreasetheGPUsdelivered
doubleprecisionperformance,sincedoubleprecisionarithmeticisattheheartofmanyHPC
applications.KeplerGK110sSMXalsoretainsthespecialfunctionunits(SFUs)forfastapproximate
transcendentaloperationsasinpreviousgenerationGPUs,providing8xthenumberofSFUsofthe
FermiGF110SM.
SimilartoGK104SMXunits,thecoreswithinthenewGK110SMXunitsusetheprimaryGPUclockrather
thanthe2xshaderclock.Recallthe2xshaderclockwasintroducedintheG80TeslaarchitectureGPU
andusedinallsubsequentTeslaandFermiarchitectureGPUs.Runningexecutionunitsatahigherclock
rateallowsachiptoachieveagiventargetthroughputwithfewercopiesoftheexecutionunits,whichis
essentiallyanareaoptimization,buttheclockinglogicforthefastercoresismorepowerhungry.For
Kepler,ourprioritywasperformanceperwatt.Whilewemademanyoptimizationsthatbenefittedboth
areaandpower,wechosetooptimizeforpowerevenattheexpenseofsomeaddedareacost,witha
largernumberofprocessingcoresrunningatthelower,lesspowerhungryGPUclock.
QuadWarpScheduler
TheSMXschedulesthreadsingroupsof32parallelthreadscalledwarps.EachSMXfeaturesfourwarp
schedulersandeightinstructiondispatchunits,allowingfourwarpstobeissuedandexecuted
concurrently.Keplersquadwarpschedulerselectsfourwarps,andtwoindependentinstructionsper
warpcanbedispatchedeachcycle.UnlikeFermi,whichdidnotpermitdoubleprecisioninstructionsto
bepairedwithotherinstructions,KeplerGK110allowsdoubleprecisioninstructionstobepairedwith
otherinstructions.


EachKeplerSMXcontains4WarpSchedulers,eachwithdualInstructionDispatchUnits.AsingleWarpSchedulerUnitis
shownabove.

WealsolookedforopportunitiestooptimizethepowerintheSMXwarpschedulerlogic.Forexample,
bothKeplerandFermischedulerscontainsimilarhardwareunitstohandletheschedulingfunction,
including:
a) Registerscoreboardingforlonglatencyoperations(textureandload)
b) Interwarpschedulingdecisions(e.g.,pickthebestwarptogonextamongeligiblecandidates)
c) Threadblocklevelscheduling(e.g.,theGigaThreadengine)
However,Fermisscheduleralsocontainsacomplexhardwarestagetopreventdatahazardsinthe
mathdatapathitself.Amultiportregisterscoreboardkeepstrackofanyregistersthatarenotyetready
withvaliddata,andadependencycheckerblockanalyzesregisterusageacrossamultitudeoffully
decodedwarpinstructionsagainstthescoreboard,todeterminewhichareeligibletoissue.
ForKepler,werecognizedthatthisinformationisdeterministic(themathpipelinelatenciesarenot
variable),andthereforeitispossibleforthecompilertodetermineupfrontwheninstructionswillbe
readytoissue,andprovidethisinformationintheinstructionitself.Thisallowedustoreplaceseveral
complexandpowerexpensiveblockswithasimplehardwareblockthatextractsthepredetermined
latencyinformationandusesittomaskoutwarpsfromeligibilityattheinterwarpschedulerstage.

NewISAEEncoding:255
5RegistersperThread
Thenumb
berofregistersthatcanbe
eaccessedbyyathreadhassbeenquadrupledinGK110,allowingeeach
threadaccesstoupto255registerss.Codesthatexhibithighrregisterpresssureorspillin
ngbehaviorin
n
Fermimayseesubstan
ntialspeedupsasaresulto
oftheincreassedavailableperthreadreegistercount.A
compellin
ngexamplecaanbeseenintheQUDAlib
braryforperfforminglatticeQCD(quanttum
chromodyynamics)calculationsusinggCUDA.QUD
DAfp64baseddalgorithmsseeperformaanceincreaseesup
to5.3xdu
uetotheabiliitytouseman
nymoreregisstersperthreeadandexperriencingfeweerspillstoloccal
memory.
ShuffleIn
nstruction
Tofurtherimproveperrformance,Keplerimplem
mentsanewSShuffleinstrucction,whichaallowsthread
ds
withinaw
warptoshare
edata.Previously,sharingdatabetweennthreadswitthinawarpreequiredseparrate
storeandloadoperationstopassth
hedatathrou
ughsharedm emory.WiththeShufflein
nstruction,
withinawarpcanreadvalu
uesfromothe
erthreadsintthewarpinju
ustaboutanyyimaginable
threadsw
permutation.Shufflesupportsarbittraryindexedreferencesi.e.anythreadreadsfrom
manyother
R
thread.Usefulshufflessubsetsincludingnextthrread(offsetu pordownbyyafixedamount)andXOR
utationsamongthethread
dsinawarp, arealsoavaillableasCUDA
Aintrinsics.
butterflyystylepermu
Shuffleofffersaperform
manceadvantageovershaaredmemoryy,inthatasto
oreandloadoperationis
carriedou
utinasinglesstep.Shufflealsocanredu
ucetheamou ntofsharedmemoryneededperthreaad
block,sinccedataexchaangedatthew
warplevelne
everneedstoobeplacedinsharedmem
mory.Inthecaaseof
FFT,whichrequiresdatasharingwithinawarp,aa6%perform
mancegaincanbeseenjusstbyusingSh
huffle.

Thisexampleshowssomeo
ofthevariationspossibleusinggthenewShuffl einstructioninKepler.

AtomicOperations
Atomicmemoryoperaationsareimp
portantinparrallelprogram
mming,allowiingconcurren
ntthreadsto
dmodifywritteoperations onshareddaatastructuress.Atomicopeerationssuchas
correctlyperformread
mpareandsw
wapareatom
micinthesenssethatthereead,modify,aandwrite
add,min,max,andcom
nsareperform
medwithoutiinterruptionb
byotherthreeads.Atomicm
memoryoperrationsarew
widely
operation
usedforp
parallelsortin
ng,reductionoperations,aandbuildingddatastructureesinparallelwithoutlockssthat
serializethreadexecuttion.

ThroughputofglobalmemoryatomicoperationsonKeplerGK110issubstantiallyimprovedcompared
totheFermigeneration.Atomicoperationthroughputtoacommonglobalmemoryaddressisimproved
by9xtooneoperationperclock.Atomicoperationthroughputtoindependentglobaladdressesisalso
significantlyaccelerated,andlogictohandleaddressconflictshasbeenmademoreefficient.Atomic
operationscanoftenbeprocessedatratessimilartogloballoadoperations.Thisspeedincreasemakes
atomicsfastenoughtousefrequentlywithinkernelinnerloops,eliminatingtheseparatereduction
passesthatwerepreviouslyrequiredbysomealgorithmstoconsolidateresults.KeplerGK110also
expandsthenativesupportfor64bitatomicoperationsinglobalmemory.InadditiontoatomicAdd,
atomicCAS,andatomicExch(whichwerealsosupportedbyFermiandKeplerGK104),GK110supports
thefollowing:

atomicMin
atomicMax
atomicAnd
atomicOr
atomicXor

Otheratomicoperationswhicharenotsupportednatively(forexample64bitfloatingpointatomics)
maybeemulatedusingthecompareandswap(CAS)instruction.
TextureImprovements
TheGPUsdedicatedhardwareTextureunitsareavaluableresourceforcomputeprogramswithaneed
tosampleorfilterimagedata.ThetexturethroughputinKeplerissignificantlyincreasedcomparedto
FermieachSMXunitcontains16texturefilteringunits,a4xincreasevstheFermiGF110SM.
Inaddition,Keplerchangesthewaytexturestateismanaged.IntheFermigeneration,fortheGPUto
referenceatexture,ithadtobeassignedaslotinafixedsizebindingtablepriortogridlaunch.The
numberofslotsinthattableultimatelylimitshowmanyuniquetexturesaprogramcanreadfromatrun
time.Ultimately,aprogramwaslimitedtoaccessingonly128simultaneoustexturesinFermi.
WithbindlesstexturesinKepler,theadditionalstepofusingslotsisntnecessary:texturestateisnow
savedasanobjectinmemoryandthehardwarefetchesthesestateobjectsondemand,makingbinding
tablesobsolete.Thiseffectivelyeliminatesanylimitsonthenumberofuniquetexturesthatcanbe
referencedbyacomputeprogram.Instead,programscanmaptexturesatanytimeandpasstexture
handlesaroundastheywouldanyotherpointer.

KeplerMemorySubsystemL1,L2,ECC
KeplersmemoryhierarchyisorganizedsimilarlytoFermi.TheKeplerarchitecturesupportsaunified
memoryrequestpathforloadsandstores,withanL1cacheperSMXmultiprocessor.KeplerGK110also
enablescompilerdirecteduseofanadditionalnewcacheforreadonlydata,asdescribedbelow.

64KBConfigurableSharedMemoryandL1Cache
IntheKeplerGK110architecture,asinthepreviousgenerationFermiarchitecture,eachSMXhas64KB
ofonchipmemorythatcanbeconfiguredas48KBofSharedmemorywith16KBofL1cache,oras16
KBofsharedmemorywith48KBofL1cache.Keplernowallowsforadditionalflexibilityinconfiguring
theallocationofsharedmemoryandL1cachebypermittinga32KB/32KBsplitbetweenshared
memoryandL1cache.TosupporttheincreasedthroughputofeachSMXunit,thesharedmemory
bandwidthfor64bandlargerloadoperationsisalsodoubledcomparedtotheFermiSM,to256Bper
coreclock.
48KBReadOnlyDataCache
InadditiontotheL1cache,Keplerintroducesa48KBcachefordatathatisknowntobereadonlyfor
thedurationofthefunction.IntheFermigeneration,thiscachewasaccessibleonlybytheTextureunit.
Expertprogrammersoftenfounditadvantageoustoloaddatathroughthispathexplicitlybymapping
theirdataastextures,butthisapproachhadmanylimitations.

InKepler,inadditiontosignificantlyincreasingthecapacityofthiscachealongwiththetexture
horsepowerincrease,wedecidedtomakethecachedirectlyaccessibletotheSMforgeneralload
operations.Useofthereadonlypathisbeneficialbecauseittakesbothloadandworkingsetfootprint
offoftheShared/L1cachepath.Inaddition,theReadOnlyDataCacheshighertagbandwidthsupports
fullspeedunalignedmemoryaccesspatternsamongotherscenarios.
Useofthispathismanagedautomaticallybythecompileraccesstoanyvariableordatastructurethat
isknowntobeconstantthroughprogrammeruseoftheC99standardconst__restrictkeywordwillbe
taggedbythecompilertobeloadedthroughtheReadOnlyDataCache.
ImprovedL2Cache
TheKeplerGK110GPUfeatures1536KBofdedicatedL2cachememory,doubletheamountofL2
availableintheFermiarchitecture.TheL2cacheistheprimarypointofdataunificationbetweenthe
SMXunits,servicingallload,store,andtexturerequestsandprovidingefficient,highspeeddatasharing
acrosstheGPU.TheL2cacheonKepleroffersupto2xofthebandwidthperclockavailableinFermi.
Algorithmsforwhichdataaddressesarenotknownbeforehand,suchasphysicssolvers,raytracing,and
sparsematrixmultiplicationespeciallybenefitfromthecachehierarchy.Filterandconvolutionkernels
thatrequiremultipleSMstoreadthesamedataalsobenefit.
MemoryProtectionSupport
LikeFermi,Keplersregisterfiles,sharedmemories,L1cache,L2cacheandDRAMmemoryare
protectedbyaSingleErrorCorrectDoubleErrorDetect(SECDED)ECCcode.Inaddition,theReadOnly
DataCachesupportssingleerrorcorrectionthroughaparitycheck;intheeventofaparityerror,the
cacheunitautomaticallyinvalidatesthefailedline,forcingareadofthecorrectdatafromL2.
ECCcheckbitfetchesfromDRAMnecessarilyconsumesomeamountofDRAMbandwidth,whichresults
inaperformancedifferencebetweenECCenabledandECCdisabledoperation,especiallyonmemory
bandwidthsensitiveapplications.KeplerGK110implementsseveraloptimizationstoECCcheckbitfetch
handlingbasedonFermiexperience.Asaresult,theECConvsoffperformancedeltahasbeenreduced
byanaverageof66%,asmeasuredacrossourinternalcomputeapplicationtestsuite.

DynamicParallelism
InahybridCPUGPUsystem,enablingalargeramountofparallelcodeinanapplicationtorunefficiently
andentirelywithintheGPUimprovesscalabilityandperformanceasGPUsincreaseinperf/watt.To
acceleratetheseadditionalparallelportionsoftheapplication,GPUsmustsupportmorevariedtypesof
parallelworkloads.
DynamicParallelismisanewfeatureintroducedwithKeplerGK110thatallowstheGPUtogenerate
newworkforitself,synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,
acceleratedhardwarepaths,allwithoutinvolvingtheCPU.

Fermiwasverygoodatprocessinglargeparalleldatastructureswhenthescaleandparametersofthe
problemwereknownatkernellaunchtime.AllworkwaslaunchedfromthehostCPU,wouldrunto
completion,andreturnaresultbacktotheCPU.Theresultwouldthenbeusedaspartofthefinal
solution,orwouldbeanalyzedbytheCPUwhichwouldthensendadditionalrequestsbacktotheGPU
foradditionalprocessing.
InKeplerGK110anykernelcanlaunchanotherkernel,andcancreatethenecessarystreams,eventsand
managethedependenciesneededtoprocessadditionalworkwithouttheneedforhostCPU
interaction.Thisarchitecturalinnovationmakesiteasierfordeveloperstocreateandoptimizerecursive
anddatadependentexecutionpatterns,andallowsmoreofaprogramtoberundirectlyonGPU.The
systemCPUcanthenbefreedupforadditionaltasks,orthesystemcouldbeconfiguredwithaless
powerfulCPUtocarryoutthesameworkload.

DynamicParallelismallowsmoreparallelcodeinanapplicationtobelauncheddirectlybytheGPUontoitself(rightsideof
image)ratherthanrequiringCPUintervention(leftsideofimage).

DynamicParallelismallowsmorevarietiesofparallelalgorithmstobeimplementedontheGPU,
includingnestedloopswithdifferingamountsofparallelism,parallelteamsofserialcontroltask
threads,orsimpleserialcontrolcodeoffloadedtotheGPUinordertopromotedatalocalitywiththe
parallelportionoftheapplication.
Becauseakernelhastheabilitytolaunchadditionalworkloadsbasedonintermediate,onGPUresults,
programmerscannowintelligentlyloadbalanceworktofocusthebulkoftheirresourcesontheareas
oftheproblemthateitherrequirethemostprocessingpoweroraremostrelevanttothesolution.

Oneexamplewouldbedynamicallysettingupagridforanumericalsimulationtypicallygridcellsare
focusedinregionsofgreatestchange,requiringanexpensivepreprocessingpassthroughthedata.
Alternatively,auniformlycoarsegridcouldbeusedtopreventwastedGPUresources,orauniformly
finegridcouldbeusedtoensureallthefeaturesarecaptured,buttheseoptionsriskmissingsimulation
featuresoroverspendingcomputeresourcesonregionsoflessinterest.
WithDynamicParallelism,thegridresolutioncanbedetermineddynamicallyatruntimeinadata
dependentmanner.Startingwithacoarsegrid,thesimulationcanzoominonareasofinterestwhile
avoidingunnecessarycalculationinareaswithlittlechange.Thoughthiscouldbeaccomplishedusinga
sequenceofCPUlaunchedkernels,itwouldbefarsimplertoallowtheGPUtorefinethegriditselfby
analyzingthedataandlaunchingadditionalworkaspartofasinglesimulationkernel,eliminating
interruptionoftheCPUanddatatransfersbetweentheCPUandGPU.

ImageattributionCharlesReid

Theaboveexampleillustratesthebenefitsofusingadynamicallysizedgridinanumericalsimulation.To
meetpeakprecisionrequirements,afixedresolutionsimulationmustrunatanexcessivelyfine
resolutionacrosstheentiresimulationdomain,whereasamultiresolutiongridappliesthecorrect
simulationresolutiontoeachareabasedonlocalvariation.

HyperQ
Q
OneofthechallengesinthepasthaasbeenkeepingtheGPUssuppliedwith
hanoptimallyyscheduledlo
oad
ofworkfrrommultiplestreams.The
eFermiarchitecturesuppoorted16wayconcurrencyofkernel
launchesffromseparatestreams,bu
utultimatelythestreamsw
wereallmulttiplexedintotthesame
hardware
eworkqueue.Thisallowed
dforfalseintrrastreamde pendencies,rrequiringdep
pendentkernels
withinonestreamtoccompletebefo
oreadditionaalkernelsinaseparatestreeamcouldbeeexecuted.W
While
dbealleviated
dtosomeexttentthroughtheuseofabbreadthfirstlaunchorder,asprogram
thiscould
complexittyincreases,tthiscanbecomemoreand
dmoredifficuulttomanageeefficiently.
KeplerGK
K110improve
esonthisfuncctionalitywith
hthenewHyyperQfeaturee.HyperQin
ncreasesthettotal
numbero
ofconnections(workqueues)betweenthehostand theCUDAW
WorkDistributtor(CWD)loggicin
theGPUb
byallowing32
2simultaneous,hardware
emanagedcoonnections(co
omparedtotthesingle
connectio
onavailablew
withFermi).H
HyperQisaflexiblesolutioonthatallowssconnectionssfrommultip
ple
CUDAstre
eams,fromm
multipleMessagePassingInterface(MPPI)processes,orevenfrom
mmultiplethrreads
withinap
process.Appliicationsthatpreviouslyen
ncounteredfaalseserializationacrosstassks,thereby
limitingGPUutilization
n,canseeuptoa32xperfformanceincrreasewithouttchangingan
nyexistingcode.

HyperQpe
ermitsmoresimultaneousconnectionsbetweenCPUandGPU .

EachCUD
DAstreamism
managedwith
hinitsownhaardwareworkkqueue,interrstreamdepeendenciesaree
optimized
d,andoperationsinonesttreamwillnolongerblock otherstream
ms,enablingsstreamstoexxecute
concurren
ntlywithoutn
needingtospecificallytailo
orthelaunch ordertoelim
minatepossib
blefalse
dependen
ncies.

HyperQofferssignificantbenefitsforuseinMPIbasedparallelcomputersystems.LegacyMPIbased
algorithmswereoftencreatedtorunonmulticoreCPUsystems,withtheamountofworkassignedto
eachMPIprocessscaledaccordingly.ThiscanleadtoasingleMPIprocesshavinginsufficientworkto
fullyoccupytheGPU.WhileithasalwaysbeenpossibleformultipleMPIprocessestoshareaGPU,
theseprocessescouldbecomebottleneckedbyfalsedependencies.HyperQremovesthosefalse
dependencies,dramaticallyincreasingtheefficiencyofGPUsharingacrossMPIprocesses.

HyperQworkingwithCUDAStreams:IntheFermimodelshownontheleft,only(C,P)&(R,X)canrunconcurrentlydueto
intrastreamdependenciescausedbythesinglehardwareworkqueue.TheKeplerHyperQmodelallowsallstreamstorun
concurrentlyusingseparateworkqueues.

GridManagementUnitEfficientlyKeepingtheGPUUtilized
NewfeaturesinKeplerGK110,suchastheabilityforCUDAkernelstolaunchworkdirectlyontheGPU
withDynamicParallelism,requiredthattheCPUtoGPUworkflowinKeplerofferincreasedfunctionality
overtheFermidesign.OnFermi,agridofthreadblockswouldbelaunchedbytheCPUandwould
alwaysruntocompletion,creatingasimpleunidirectionalflowofworkfromthehosttotheSMsviathe
CUDAWorkDistributor(CWD)unit.KeplerGK110wasdesignedtoimprovetheCPUtoGPUworkflow
byallowingtheGPUtoefficientlymanagebothCPUandCUDAcreatedworkloads.
WediscussedtheabilityoftheKeplerGK110GPUtoallowkernelstolaunchworkdirectlyontheGPU,
anditsimportanttounderstandthechangesmadeintheKeplerGK110architecturetofacilitatethese
newfunctions.InKepler,agridcanbelaunchedfromtheCPUjustaswasthecasewithFermi,however
newgridscanalsobecreatedprogrammaticallybyCUDAwithintheKeplerSMXunit.Tomanageboth
CUDAcreatedandhostoriginatedgrids,anewGridManagementUnit(GMU)wasintroducedinKepler
GK110.ThiscontrolunitmanagesandprioritizesgridsthatarepassedintotheCWDtobesenttothe
SMXunitsforexecution.
TheCWDinKeplerholdsgridsthatarereadytodispatch,anditisabletodispatch32activegrids,which
isdoublethecapacityoftheFermiCWD.TheKeplerCWDcommunicateswiththeGMUviaabi
directionallinkthatallowstheGMUtopausethedispatchofnewgridsandtoholdpendingand
suspendedgridsuntilneeded.TheGMUalsohasadirectconnectiontotheKeplerSMXunitstopermit
gridsthatlaunchadditionalworkontheGPUviaDynamicParallelismtosendthenewworkbackto
GMUtobeprioritizedanddispatched.Ifthekernelthatdispatchedtheadditionalworkloadpauses,the
GMUwillholditinactiveuntilthedependentworkhascompleted.


TheredesignedKeplerHOSTtoGPUworkflowshowsthenewGridManagementUnit,whichallowsittomanagetheactively
dispatchinggrids,pausedispatch,andholdpendingandsuspendedgrids.

NVIDIAGPUDirect
Whenworkingwithalargeamountofdata,increasingthedatathroughputandreducinglatencyisvital
toincreasingcomputeperformance.KeplerGK110supportstheRDMAfeatureinNVIDIAGPUDirect,
whichisdesignedtoimproveperformancebyallowingdirectaccesstoGPUmemorybythirdparty
devicessuchasIBadapters,NICs,andSSDs.WhenusingCUDA5.0,GPUDirectprovidesthefollowing
importantfeatures:

Directmemoryaccess(DMA)betweenNICandGPUwithouttheneedforCPUsidedata
buffering.
SignificantlyimprovedMPISend/MPIRecvefficiencybetweenGPUandothernodesinanetwork.
EliminatesCPUbandwidthandlatencybottlenecks
Workswithvarietyof3rdpartynetwork,capture,andstoragedevices

Applicationslikereversetimemigration(usedinseismicimagingforoil&gasexploration)distributethe
largeimagingdataacrossseveralGPUs.HundredsofGPUsmustcollaboratetocrunchthedata,often
communicatingintermediateresults.GPUDirectenablesmuchhigheraggregatebandwidthforthisGPU
toGPUcommunicationscenariowithinaserverandacrossserverswiththeP2PandRDMAfeatures.
KeplerGK110alsosupportsotherGPUDirectfeaturessuchasPeertoPeerandGPUDirectforVideo.

GPUDirectRDMAallowsdirectaccesstoGPUmemoryfrom3rdpartydevicessuchasnetworkadapters,whichtranslatesinto
directtransfersbetweenGPUsacrossnodesaswell.

Conclusion
WiththelaunchofFermiin2010,NVIDIAusheredinanewerainthehighperformancecomputing
(HPC)industrybasedonahybridcomputingmodelwhereCPUsandGPUsworktogethertosolve
computationallyintensiveworkloads.Now,withthenewKeplerGK110GPU,NVIDIAagainraisesthebar
fortheHPCindustry.
KeplerGK110wasdesignedfromthegrounduptomaximizecomputationalperformanceand
throughputcomputingwithoutstandingpowerefficiency.Thearchitecturehasmanynewinnovations
suchasSMX,DynamicParallelism,andHyperQthatmakehybridcomputingdramaticallyfaster,easier
toprogram,andapplicabletoabroadersetofapplications.KeplerGK110GPUswillbeusedin
numeroussystemsrangingfromworkstationstosupercomputerstoaddressthemostdaunting
challengesinHPC.

Appen
ndix A - Quick Refreshe
R
er on CU
UDA
CUDAisacombination
nhardware/so
oftwareplatfformthatenaablesNVIDIAGPUstoexeccuteprograms
writtenw
withC,C++,Fo
ortran,andottherlanguage
es.ACUDAprrograminvokesparallelfunctionscalled
d
kernelsth
hatexecuteaccrossmanyparallelthread
ds.Theprograammerorcom
mpilerorganizesthesethrreads
intothreaadblocksandgridsofthreadblocks,asshowninFiggure1.Eachth
hreadwithinathreadblocck
executesaninstanceo
ofthekernel.EachthreadalsohasthreaadandblockIDswithinitssthreadblockkand
ogramcounte
er,registers,p
perthreadprrivatememorry,inputs,and
doutputresu
ults.
grid,apro
Athreadb
blockisasetofconcurrentlyexecutinggthreadsthattcancooperaateamongtheemselvesthro
ough
barriersynchronization
nandsharedmemory.AthreadblockhhasablockID
Dwithinitsgrrid.Agridisan
hreadblocksthatexecutethesamekerrnel,readinpputsfromglob
balmemory,writeresultsto
arrayofth
globalme
emory,andsyynchronizebe
etweendepen
ndentkernel calls.IntheC
CUDAparallellprogrammin
ng
model,eaachthreadhaasaperthreadprivatememoryspaceuusedforregissterspills,fun
nctioncalls,an
ndC
automaticcarrayvariab
bles.Eachthre
eadblockhassaperblock sharedmemo
oryspaceuseedforinterth
hread
communication,datassharing,andrresultsharingginparallelallgorithms.Grridsofthreadblocksshare
resultsinGlobalMemo
oryspaceafte
erkernelwideglobalsyncchronization.

Figure1:CUDAHierarchyofthreads,blocks,andgrids,withcorrespondingperthreadprivate,perblockshared,
andperapplicationglobalmemoryspaces.

CUDAHardwareExecution
CUDAshierarchyofthreadsmapstoahierarchyofprocessorsontheGPU;aGPUexecutesoneormore
kernelgrids;astreamingmultiprocessor(SMonFermi/SMXonKepler)executesoneormorethread
blocks;andCUDAcoresandotherexecutionunitsintheSMXexecutethreadinstructions.TheSMX
executesthreadsingroupsof32threadscalledwarps.Whileprogrammerscangenerallyignorewarp
executionforfunctionalcorrectnessandfocusonprogrammingindividualscalarthreads,theycan
greatlyimproveperformancebyhavingthreadsinawarpexecutethesamecodepathandaccess
memorywithnearbyaddresses.

Notice
ALLINFORMATIONPROVIDEDINTHISWHITEPAPER,INCLUDINGCOMMENTARY,OPINION,NVIDIA
DESIGNSPECIFICATIONS,REFERENCEBOARDS,FILES,DRAWINGS,DIAGNOSTICS,LISTS,ANDOTHER
DOCUMENTS(TOGETHERANDSEPARATELY,MATERIALS)AREBEINGPROVIDEDASIS.NVIDIA
MAKESNOWARRANTIES,EXPRESSED,IMPLIED,STATUTORY,OROTHERWISEWITHRESPECTTO
MATERIALS,ANDEXPRESSLYDISCLAIMSALLIMPLIEDWARRANTIESOFNONINFRINGEMENT,
MERCHANTABILITY,ANDFITNESSFORAPARTICULARPURPOSE.

Informationfurnishedisbelievedtobeaccurateandreliable.However,NVIDIACorporationassumesno
responsibilityfortheconsequencesofuseofsuchinformationorforanyinfringementofpatentsor
otherrightsofthirdpartiesthatmayresultfromitsuse.Nolicenseisgrantedbyimplicationor
otherwiseunderanypatentorpatentrightsofNVIDIACorporation.Specificationsmentionedinthis
publicationaresubjecttochangewithoutnotice.Thispublicationsupersedesandreplacesall
informationpreviouslysupplied.NVIDIACorporationproductsarenotauthorizedforuseascritical
componentsinlifesupportdevicesorsystemswithoutexpresswrittenapprovalofNVIDIACorporation.

Trademarks
NVIDIA,theNVIDIAlogo,CUDA,FERMI,KEPLERandGeForcearetrademarksorregisteredtrademarksof
NVIDIACorporationintheUnitedStatesandothercountries.Othercompanyandproductnamesmay
betrademarksoftherespectivecompanieswithwhichtheyareassociated.

Copyright
2012NVIDIACorporation.Allrightsreserved.