NVIDIA Kepler GK110 Architecture Whitepaper

Whitepaper
NVIDIAs Next Generation

CUDA Compute Architecture:
TM
TM
Kepler GK110
The Fastest, Most Efficient HPC Architecture Ever Built
V1.0
Table of Contents
KeplerGK110TheNextGenerationGPUComputingArchitecture...........................................................3
KeplerGK110ExtremePerformance,ExtremeEfficiency..........................................................................4
Dynamic Parallelism......................................................................................................................5
Hyper-Q...........................................................................................................................................5
Grid Management Unit..................................................................................................................5
NVIDIA GPUDirect.....................................................................................................................5
AnOverviewoftheGK110KeplerArchitecture...........................................................................................6
PerformanceperWatt..............................................................................................................................7
StreamingMultiprocessor(SMX)Architecture.........................................................................................8
SMXProcessingCoreArchitecture.......................................................................................................9
QuadWarpScheduler...........................................................................................................................9
NewISAEncoding:255RegistersperThread.....................................................................................11
ShuffleInstruction...............................................................................................................................11
AtomicOperations..............................................................................................................................11
TextureImprovements.......................................................................................................................12
KeplerMemorySubsystemL1,L2,ECC................................................................................................13
64KBConfigurableSharedMemoryandL1Cache............................................................................13
48KBReadOnlyDataCache...............................................................................................................13
ImprovedL2Cache.............................................................................................................................14
MemoryProtectionSupport...............................................................................................................14
DynamicParallelism................................................................................................................................14
HyperQ...................................................................................................................................................17
GridManagementUnitEfficientlyKeepingtheGPUUtilized...............................................................19
NVIDIAGPUDirect................................................................................................................................20
Conclusion...................................................................................................................................................21
Appendix A - QuickRefresheronCUDA...................................................................................................22
CUDAHardwareExecution.................................................................................................................23
Kepler GK110 The Next Generation GPU Computing

Architecture
Asthedemandforhighperformanceparallelcomputingincreasesacrossmanyareasofscience,
medicine,engineering,andfinance,NVIDIAcontinuestoinnovateandmeetthatdemandwith
extraordinarilypowerfulGPUcomputingarchitectures.NVIDIAsexistingFermiGPUshavealready
redefinedandacceleratedHighPerformanceComputing(HPC)capabilitiesinareassuchasseismic
processing,biochemistrysimulations,weatherandclimatemodeling,signalprocessing,computational
finance,computeraidedengineering,computationalfluiddynamics,anddataanalysis.NVIDIAsnew
KeplerGK110GPUraisestheparallelcomputingbarconsiderablyandwillhelpsolvetheworldsmost
difficultcomputingproblems.
ByofferingmuchhigherprocessingpowerthanthepriorGPUgenerationandbyprovidingnewmethods
tooptimizeandincreaseparallelworkloadexecutionontheGPU,KeplerGK110simplifiescreationof
parallelprogramsandwillfurtherrevolutionizehighperformancecomputing.
Kepler GK110 - Extreme Performance, Extreme Efficiency

Comprising7.1billiontransistors,KeplerGK110isnotonlythefastest,butalsothemostarchitecturally
complexmicroprocessoreverbuilt.Addingmanynewinnovativefeaturesfocusedoncompute
performance,GK110wasdesignedtobeaparallelprocessingpowerhouseforTeslaandtheHPC
market.
KeplerGK110willprovideover1TFlopofdoubleprecisionthroughputwithgreaterthan80%DGEMM
efficiencyversus6065%onthepriorFermiarchitecture.
Inadditiontogreatlyimprovedperformance,theKeplerarchitectureoffersahugeleapforwardin
powerefficiency,deliveringupto3xtheperformanceperwattofFermi.
KeplerGK110DiePhoto
ThefollowingnewfeaturesinKeplerGK110enableincreasedGPUutilization,simplifyparallelprogram
design,andaidinthedeploymentofGPUsacrossthespectrumofcomputeenvironmentsrangingfrom
personalworkstationstosupercomputers:
Dynamic ParallelismaddsthecapabilityfortheGPUtogeneratenewworkforitself,
synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,accelerated
hardwarepaths,allwithoutinvolvingtheCPU.Byprovidingtheflexibilitytoadapttothe
amountandformofparallelismthroughthecourseofaprogram'sexecution,programmerscan
exposemorevariedkindsofparallelworkandmakethemostefficientusetheGPUasa
computationevolves.Thiscapabilityallowslessstructured,morecomplextaskstoruneasily
andeffectively,enablinglargerportionsofanapplicationtorunentirelyontheGPU.Inaddition,
programsareeasiertocreate,andtheCPUisfreedforothertasks.
Hyper-QHyperQenablesmultipleCPUcorestolaunchworkonasingleGPU
simultaneously,therebydramaticallyincreasingGPUutilizationandsignificantlyreducingCPU
idletimes.HyperQincreasesthetotalnumberofconnections(workqueues)betweenthehost
andtheGK110GPUbyallowing32simultaneous,hardwaremanagedconnections(comparedto
thesingleconnectionavailablewithFermi).HyperQisaflexiblesolutionthatallowsseparate
connectionsfrommultipleCUDAstreams,frommultipleMessagePassingInterface(MPI)
processes,orevenfrommultiplethreadswithinaprocess.Applicationsthatpreviously
encounteredfalseserializationacrosstasks,therebylimitingachievedGPUutilization,cansee
uptodramaticperformanceincreasewithoutchanginganyexistingcode.
Grid Management UnitEnablingDynamicParallelismrequiresanadvanced,flexible

gridmanagementanddispatchcontrolsystem.ThenewGK110GridManagementUnit(GMU)
managesandprioritizesgridstobeexecutedontheGPU.TheGMUcanpausethedispatchof
newgridsandqueuependingandsuspendedgridsuntiltheyarereadytoexecute,providingthe
flexibilitytoenablepowerfulruntimes,suchasDynamicParallelism.TheGMUensuresboth
CPUandGPUgeneratedworkloadsareproperlymanagedanddispatched.
NVIDIA GPUDirectNVIDIAGPUDirectisacapabilitythatenablesGPUswithina
singlecomputer,orGPUsindifferentserverslocatedacrossanetwork,todirectlyexchange
datawithoutneedingtogotoCPU/systemmemory.TheRDMAfeatureinGPUDirectallows
thirdpartydevicessuchasSSDs,NICs,andIBadapterstodirectlyaccessmemoryonmultiple
GPUswithinthesamesystem,significantlydecreasingthelatencyofMPIsendandreceive
messagesto/fromGPUmemory.Italsoreducesdemandsonsystemmemorybandwidthand
freestheGPUDMAenginesforusebyotherCUDAtasks.KeplerGK110alsosupportsother
GPUDirectfeaturesincludingPeertoPeerandGPUDirectforVideo.
An Overview of the GK110 Kepler Architecture

KeplerGK110wasbuiltfirstandforemostforTesla,anditsgoalwastobethehighestperforming
parallelcomputingmicroprocessorintheworld.GK110notonlygreatlyexceedstherawcompute
horsepowerdeliveredbyFermi,butitdoessoefficiently,consumingsignificantlylesspowerand
generatingmuchlessheatoutput.
AfullKeplerGK110implementationincludes15SMXunitsandsix64bitmemorycontrollers.Different
productswillusedifferentconfigurationsofGK110.Forexample,someproductsmaydeploy13or14
SMXs.
Keyfeaturesofthearchitecturethatwillbediscussedbelowinmoredepthinclude:
ThenewSMXprocessorarchitecture
Anenhancedmemorysubsystem,offeringadditionalcachingcapabilities,morebandwidthat
eachlevelofthehierarchy,andafullyredesignedandsubstantiallyfasterDRAMI/O
implementation.
Hardwaresupportthroughoutthedesigntoenablenewprogrammingmodelcapabilities
Kepler GK110Fullchipblockdiagram
KeplerGK110supportsthenewCUDAComputeCapability3.5.(ForabriefoverviewofCUDAsee
AppendixAQuickRefresheronCUDA).ThefollowingtablecomparesparametersofdifferentCompute
CapabilitiesforFermiandKeplerGPUarchitectures:
FERMI
GF100
FERMI
GF104
KEPLER
GK104
KEPLER
GK110
ComputeCapability
2.0
2.1
3.0
3.5
Threads/Warp
32
32
32
32
MaxWarps/Multiprocessor
48
48
64
64
1536
1536
2048
2048
16
16
32768
32768
65536
65536
63
63
63
255
1024
1024
1024
1024
16K
16K
16K
16K
48K
2^161
No
No
48K
2^161
No
No
32K
48K
2^321
No
No
32K
48K
2^321
Yes
Yes
MaxThreads/Multiprocessor
MaxThreadBlocks/Multiprocessor
32bitRegisters/Multiprocessor
MaxRegisters/Thread
MaxThreads/ThreadBlock
SharedMemorySizeConfigurations(bytes)
MaxXGridDimension
HyperQ
DynamicParallelism
ComputeCapabilityofFermiandKeplerGPUs
PerformanceperWatt
AprincipaldesigngoalfortheKeplerarchitecturewasimprovingpowerefficiency.Whendesigning
Kepler,NVIDIAengineersappliedeverythinglearnedfromFermitobetteroptimizetheKepler
architectureforhighlyefficientoperation.TSMCs28nmmanufacturingprocessplaysanimportantrole
inloweringpowerconsumption,butmanyGPUarchitecturemodificationswererequiredtofurther
reducepowerconsumptionwhilemaintaininggreatperformance.
EveryhardwareunitinKeplerwasdesignedandscrubbedtoprovideoutstandingperformanceperwatt.
Thebestexampleofgreatperf/wattisseeninthedesignofKeplerGK110snewStreaming
Multiprocessor(SMX),whichissimilarinmanyrespectstotheSMXunitrecentlyintroducedinKepler
GK104,butincludessubstantiallymoredoubleprecisionunitsforcomputealgorithms.
StreamingMultiprocessor(SMX)Architecture
KeplerGK110snewSMXintroducesseveralarchitecturalinnovationsthatmakeitnotonlythemost
powerfulmultiprocessorwevebuilt,butalsothemostprogrammableandpowerefficient.
SMX:192singleprecisionCUDAcores,64doubleprecisionunits,32specialfunctionunits(SFU),and32load/storeunits
(LD/ST).
SMXProcessingCoreArchitecture
EachoftheKeplerGK110SMXunitsfeature192singleprecisionCUDAcores,andeachcorehasfully
pipelinedfloatingpointandintegerarithmeticlogicunits.KeplerretainsthefullIEEE7542008
compliantsingleanddoubleprecisionarithmeticintroducedinFermi,includingthefusedmultiplyadd
(FMA)operation.
OneofthedesigngoalsfortheKeplerGK110SMXwastosignificantlyincreasetheGPUsdelivered
doubleprecisionperformance,sincedoubleprecisionarithmeticisattheheartofmanyHPC
applications.KeplerGK110sSMXalsoretainsthespecialfunctionunits(SFUs)forfastapproximate
transcendentaloperationsasinpreviousgenerationGPUs,providing8xthenumberofSFUsofthe
FermiGF110SM.
SimilartoGK104SMXunits,thecoreswithinthenewGK110SMXunitsusetheprimaryGPUclockrather
thanthe2xshaderclock.Recallthe2xshaderclockwasintroducedintheG80TeslaarchitectureGPU
andusedinallsubsequentTeslaandFermiarchitectureGPUs.Runningexecutionunitsatahigherclock
rateallowsachiptoachieveagiventargetthroughputwithfewercopiesoftheexecutionunits,whichis
essentiallyanareaoptimization,buttheclockinglogicforthefastercoresismorepowerhungry.For
Kepler,ourprioritywasperformanceperwatt.Whilewemademanyoptimizationsthatbenefittedboth
areaandpower,wechosetooptimizeforpowerevenattheexpenseofsomeaddedareacost,witha
largernumberofprocessingcoresrunningatthelower,lesspowerhungryGPUclock.
QuadWarpScheduler
TheSMXschedulesthreadsingroupsof32parallelthreadscalledwarps.EachSMXfeaturesfourwarp
schedulersandeightinstructiondispatchunits,allowingfourwarpstobeissuedandexecuted
concurrently.Keplersquadwarpschedulerselectsfourwarps,andtwoindependentinstructionsper
warpcanbedispatchedeachcycle.UnlikeFermi,whichdidnotpermitdoubleprecisioninstructionsto
bepairedwithotherinstructions,KeplerGK110allowsdoubleprecisioninstructionstobepairedwith
otherinstructions.

EachKeplerSMXcontains4WarpSchedulers,eachwithdualInstructionDispatchUnits.AsingleWarpSchedulerUnitis
shownabove.
WealsolookedforopportunitiestooptimizethepowerintheSMXwarpschedulerlogic.Forexample,
bothKeplerandFermischedulerscontainsimilarhardwareunitstohandletheschedulingfunction,
including:
a) Registerscoreboardingforlonglatencyoperations(textureandload)
b) Interwarpschedulingdecisions(e.g.,pickthebestwarptogonextamongeligiblecandidates)
c) Threadblocklevelscheduling(e.g.,theGigaThreadengine)
However,Fermisscheduleralsocontainsacomplexhardwarestagetopreventdatahazardsinthe
mathdatapathitself.Amultiportregisterscoreboardkeepstrackofanyregistersthatarenotyetready
withvaliddata,andadependencycheckerblockanalyzesregisterusageacrossamultitudeoffully
decodedwarpinstructionsagainstthescoreboard,todeterminewhichareeligibletoissue.
ForKepler,werecognizedthatthisinformationisdeterministic(themathpipelinelatenciesarenot
variable),andthereforeitispossibleforthecompilertodetermineupfrontwheninstructionswillbe
readytoissue,andprovidethisinformationintheinstructionitself.Thisallowedustoreplaceseveral
complexandpowerexpensiveblockswithasimplehardwareblockthatextractsthepredetermined
latencyinformationandusesittomaskoutwarpsfromeligibilityattheinterwarpschedulerstage.
NewISAEEncoding:255
5RegistersperThread
Thenumb
berofregistersthatcanbe
eaccessedbyyathreadhassbeenquadrupledinGK110,allowingeeach
threadaccesstoupto255registerss.Codesthatexhibithighrregisterpresssureorspillin
ngbehaviorin
n
Fermimayseesubstan
ntialspeedupsasaresulto
oftheincreassedavailableperthreadreegistercount.A
compellin
ngexamplecaanbeseenintheQUDAlib
braryforperfforminglatticeQCD(quanttum
chromodyynamics)calculationsusinggCUDA.QUD
DAfp64baseddalgorithmsseeperformaanceincreaseesup
to5.3xdu
uetotheabiliitytouseman
nymoreregisstersperthreeadandexperriencingfeweerspillstoloccal
memory.
ShuffleIn
nstruction
Tofurtherimproveperrformance,Keplerimplem
mentsanewSShuffleinstrucction,whichaallowsthread
ds
withinaw
warptoshare
edata.Previously,sharingdatabetweennthreadswitthinawarpreequiredseparrate
storeandloadoperationstopassth
hedatathrou
ughsharedm emory.WiththeShufflein
nstruction,
withinawarpcanreadvalu
uesfromothe
erthreadsintthewarpinju
ustaboutanyyimaginable
threadsw
permutation.Shufflesupportsarbittraryindexedreferencesi.e.anythreadreadsfrom
manyother
R
thread.Usefulshufflessubsetsincludingnextthrread(offsetu pordownbyyafixedamount)andXOR
utationsamongthethread
dsinawarp, arealsoavaillableasCUDA
Aintrinsics.
butterflyystylepermu
Shuffleofffersaperform
manceadvantageovershaaredmemoryy,inthatasto
oreandloadoperationis
carriedou
utinasinglesstep.Shufflealsocanredu
ucetheamou ntofsharedmemoryneededperthreaad
block,sinccedataexchaangedatthew
warplevelne
everneedstoobeplacedinsharedmem
mory.Inthecaaseof
FFT,whichrequiresdatasharingwithinawarp,aa6%perform
mancegaincanbeseenjusstbyusingSh
huffle.
Thisexampleshowssomeo
ofthevariationspossibleusinggthenewShuffl einstructioninKepler.
AtomicOperations
Atomicmemoryoperaationsareimp
portantinparrallelprogram
mming,allowiingconcurren
ntthreadsto
dmodifywritteoperations onshareddaatastructuress.Atomicopeerationssuchas
correctlyperformread
mpareandsw
wapareatom
micinthesenssethatthereead,modify,aandwrite
add,min,max,andcom
nsareperform
medwithoutiinterruptionb
byotherthreeads.Atomicm
memoryoperrationsarew
widely
operation
usedforp
parallelsortin
ng,reductionoperations,aandbuildingddatastructureesinparallelwithoutlockssthat
serializethreadexecuttion.
ThroughputofglobalmemoryatomicoperationsonKeplerGK110issubstantiallyimprovedcompared
totheFermigeneration.Atomicoperationthroughputtoacommonglobalmemoryaddressisimproved
by9xtooneoperationperclock.Atomicoperationthroughputtoindependentglobaladdressesisalso
significantlyaccelerated,andlogictohandleaddressconflictshasbeenmademoreefficient.Atomic
operationscanoftenbeprocessedatratessimilartogloballoadoperations.Thisspeedincreasemakes
atomicsfastenoughtousefrequentlywithinkernelinnerloops,eliminatingtheseparatereduction
passesthatwerepreviouslyrequiredbysomealgorithmstoconsolidateresults.KeplerGK110also
expandsthenativesupportfor64bitatomicoperationsinglobalmemory.InadditiontoatomicAdd,
atomicCAS,andatomicExch(whichwerealsosupportedbyFermiandKeplerGK104),GK110supports
thefollowing:
atomicMin
atomicMax
atomicAnd
atomicOr
atomicXor
Otheratomicoperationswhicharenotsupportednatively(forexample64bitfloatingpointatomics)
maybeemulatedusingthecompareandswap(CAS)instruction.
TextureImprovements
TheGPUsdedicatedhardwareTextureunitsareavaluableresourceforcomputeprogramswithaneed
tosampleorfilterimagedata.ThetexturethroughputinKeplerissignificantlyincreasedcomparedto
FermieachSMXunitcontains16texturefilteringunits,a4xincreasevstheFermiGF110SM.
Inaddition,Keplerchangesthewaytexturestateismanaged.IntheFermigeneration,fortheGPUto
referenceatexture,ithadtobeassignedaslotinafixedsizebindingtablepriortogridlaunch.The
numberofslotsinthattableultimatelylimitshowmanyuniquetexturesaprogramcanreadfromatrun
time.Ultimately,aprogramwaslimitedtoaccessingonly128simultaneoustexturesinFermi.
WithbindlesstexturesinKepler,theadditionalstepofusingslotsisntnecessary:texturestateisnow
savedasanobjectinmemoryandthehardwarefetchesthesestateobjectsondemand,makingbinding
tablesobsolete.Thiseffectivelyeliminatesanylimitsonthenumberofuniquetexturesthatcanbe
referencedbyacomputeprogram.Instead,programscanmaptexturesatanytimeandpasstexture
handlesaroundastheywouldanyotherpointer.
KeplerMemorySubsystemL1,L2,ECC
KeplersmemoryhierarchyisorganizedsimilarlytoFermi.TheKeplerarchitecturesupportsaunified
memoryrequestpathforloadsandstores,withanL1cacheperSMXmultiprocessor.KeplerGK110also
enablescompilerdirecteduseofanadditionalnewcacheforreadonlydata,asdescribedbelow.
64KBConfigurableSharedMemoryandL1Cache
IntheKeplerGK110architecture,asinthepreviousgenerationFermiarchitecture,eachSMXhas64KB
ofonchipmemorythatcanbeconfiguredas48KBofSharedmemorywith16KBofL1cache,oras16
KBofsharedmemorywith48KBofL1cache.Keplernowallowsforadditionalflexibilityinconfiguring
theallocationofsharedmemoryandL1cachebypermittinga32KB/32KBsplitbetweenshared
memoryandL1cache.TosupporttheincreasedthroughputofeachSMXunit,thesharedmemory
bandwidthfor64bandlargerloadoperationsisalsodoubledcomparedtotheFermiSM,to256Bper
coreclock.
48KBReadOnlyDataCache
InadditiontotheL1cache,Keplerintroducesa48KBcachefordatathatisknowntobereadonlyfor
thedurationofthefunction.IntheFermigeneration,thiscachewasaccessibleonlybytheTextureunit.
Expertprogrammersoftenfounditadvantageoustoloaddatathroughthispathexplicitlybymapping
theirdataastextures,butthisapproachhadmanylimitations.
InKepler,inadditiontosignificantlyincreasingthecapacityofthiscachealongwiththetexture
horsepowerincrease,wedecidedtomakethecachedirectlyaccessibletotheSMforgeneralload
operations.Useofthereadonlypathisbeneficialbecauseittakesbothloadandworkingsetfootprint
offoftheShared/L1cachepath.Inaddition,theReadOnlyDataCacheshighertagbandwidthsupports
fullspeedunalignedmemoryaccesspatternsamongotherscenarios.
Useofthispathismanagedautomaticallybythecompileraccesstoanyvariableordatastructurethat
isknowntobeconstantthroughprogrammeruseoftheC99standardconst__restrictkeywordwillbe
taggedbythecompilertobeloadedthroughtheReadOnlyDataCache.
ImprovedL2Cache
TheKeplerGK110GPUfeatures1536KBofdedicatedL2cachememory,doubletheamountofL2
availableintheFermiarchitecture.TheL2cacheistheprimarypointofdataunificationbetweenthe
SMXunits,servicingallload,store,andtexturerequestsandprovidingefficient,highspeeddatasharing
acrosstheGPU.TheL2cacheonKepleroffersupto2xofthebandwidthperclockavailableinFermi.
Algorithmsforwhichdataaddressesarenotknownbeforehand,suchasphysicssolvers,raytracing,and
sparsematrixmultiplicationespeciallybenefitfromthecachehierarchy.Filterandconvolutionkernels
thatrequiremultipleSMstoreadthesamedataalsobenefit.
MemoryProtectionSupport
LikeFermi,Keplersregisterfiles,sharedmemories,L1cache,L2cacheandDRAMmemoryare
protectedbyaSingleErrorCorrectDoubleErrorDetect(SECDED)ECCcode.Inaddition,theReadOnly
DataCachesupportssingleerrorcorrectionthroughaparitycheck;intheeventofaparityerror,the
cacheunitautomaticallyinvalidatesthefailedline,forcingareadofthecorrectdatafromL2.
ECCcheckbitfetchesfromDRAMnecessarilyconsumesomeamountofDRAMbandwidth,whichresults
inaperformancedifferencebetweenECCenabledandECCdisabledoperation,especiallyonmemory
bandwidthsensitiveapplications.KeplerGK110implementsseveraloptimizationstoECCcheckbitfetch
handlingbasedonFermiexperience.Asaresult,theECConvsoffperformancedeltahasbeenreduced
byanaverageof66%,asmeasuredacrossourinternalcomputeapplicationtestsuite.
DynamicParallelism
InahybridCPUGPUsystem,enablingalargeramountofparallelcodeinanapplicationtorunefficiently
andentirelywithintheGPUimprovesscalabilityandperformanceasGPUsincreaseinperf/watt.To
acceleratetheseadditionalparallelportionsoftheapplication,GPUsmustsupportmorevariedtypesof
parallelworkloads.
DynamicParallelismisanewfeatureintroducedwithKeplerGK110thatallowstheGPUtogenerate
newworkforitself,synchronizeonresults,andcontroltheschedulingofthatworkviadedicated,
acceleratedhardwarepaths,allwithoutinvolvingtheCPU.
Fermiwasverygoodatprocessinglargeparalleldatastructureswhenthescaleandparametersofthe
problemwereknownatkernellaunchtime.AllworkwaslaunchedfromthehostCPU,wouldrunto
completion,andreturnaresultbacktotheCPU.Theresultwouldthenbeusedaspartofthefinal
solution,orwouldbeanalyzedbytheCPUwhichwouldthensendadditionalrequestsbacktotheGPU
foradditionalprocessing.
InKeplerGK110anykernelcanlaunchanotherkernel,andcancreatethenecessarystreams,eventsand
managethedependenciesneededtoprocessadditionalworkwithouttheneedforhostCPU
interaction.Thisarchitecturalinnovationmakesiteasierfordeveloperstocreateandoptimizerecursive
anddatadependentexecutionpatterns,andallowsmoreofaprogramtoberundirectlyonGPU.The
systemCPUcanthenbefreedupforadditionaltasks,orthesystemcouldbeconfiguredwithaless
powerfulCPUtocarryoutthesameworkload.
DynamicParallelismallowsmoreparallelcodeinanapplicationtobelauncheddirectlybytheGPUontoitself(rightsideof
image)ratherthanrequiringCPUintervention(leftsideofimage).
DynamicParallelismallowsmorevarietiesofparallelalgorithmstobeimplementedontheGPU,
includingnestedloopswithdifferingamountsofparallelism,parallelteamsofserialcontroltask
threads,orsimpleserialcontrolcodeoffloadedtotheGPUinordertopromotedatalocalitywiththe
parallelportionoftheapplication.
Becauseakernelhastheabilitytolaunchadditionalworkloadsbasedonintermediate,onGPUresults,
programmerscannowintelligentlyloadbalanceworktofocusthebulkoftheirresourcesontheareas
oftheproblemthateitherrequirethemostprocessingpoweroraremostrelevanttothesolution.
Oneexamplewouldbedynamicallysettingupagridforanumericalsimulationtypicallygridcellsare
focusedinregionsofgreatestchange,requiringanexpensivepreprocessingpassthroughthedata.
Alternatively,auniformlycoarsegridcouldbeusedtopreventwastedGPUresources,orauniformly
finegridcouldbeusedtoensureallthefeaturesarecaptured,buttheseoptionsriskmissingsimulation
featuresoroverspendingcomputeresourcesonregionsoflessinterest.
WithDynamicParallelism,thegridresolutioncanbedetermineddynamicallyatruntimeinadata
dependentmanner.Startingwithacoarsegrid,thesimulationcanzoominonareasofinterestwhile
avoidingunnecessarycalculationinareaswithlittlechange.Thoughthiscouldbeaccomplishedusinga
sequenceofCPUlaunchedkernels,itwouldbefarsimplertoallowtheGPUtorefinethegriditselfby
analyzingthedataandlaunchingadditionalworkaspartofasinglesimulationkernel,eliminating
interruptionoftheCPUanddatatransfersbetweentheCPUandGPU.
ImageattributionCharlesReid
Theaboveexampleillustratesthebenefitsofusingadynamicallysizedgridinanumericalsimulation.To
meetpeakprecisionrequirements,afixedresolutionsimulationmustrunatanexcessivelyfine
resolutionacrosstheentiresimulationdomain,whereasamultiresolutiongridappliesthecorrect
simulationresolutiontoeachareabasedonlocalvariation.
HyperQ
Q
OneofthechallengesinthepasthaasbeenkeepingtheGPUssuppliedwith
hanoptimallyyscheduledlo
oad
ofworkfrrommultiplestreams.The
eFermiarchitecturesuppoorted16wayconcurrencyofkernel
launchesffromseparatestreams,bu
utultimatelythestreamsw
wereallmulttiplexedintotthesame
hardware
eworkqueue.Thisallowed
dforfalseintrrastreamde pendencies,rrequiringdep
pendentkernels
withinonestreamtoccompletebefo
oreadditionaalkernelsinaseparatestreeamcouldbeeexecuted.W
While
dbealleviated
dtosomeexttentthroughtheuseofabbreadthfirstlaunchorder,asprogram
thiscould
complexittyincreases,tthiscanbecomemoreand
dmoredifficuulttomanageeefficiently.
KeplerGK
K110improve
esonthisfuncctionalitywith
hthenewHyyperQfeaturee.HyperQin
ncreasesthettotal
numbero
ofconnections(workqueues)betweenthehostand theCUDAW
WorkDistributtor(CWD)loggicin
theGPUb
byallowing32
2simultaneous,hardware
emanagedcoonnections(co
omparedtotthesingle
connectio
onavailablew
withFermi).H
HyperQisaflexiblesolutioonthatallowssconnectionssfrommultip
ple
CUDAstre
eams,fromm
multipleMessagePassingInterface(MPPI)processes,orevenfrom
mmultiplethrreads
withinap
process.Appliicationsthatpreviouslyen
ncounteredfaalseserializationacrosstassks,thereby
limitingGPUutilization
n,canseeuptoa32xperfformanceincrreasewithouttchangingan
nyexistingcode.
HyperQpe
ermitsmoresimultaneousconnectionsbetweenCPUandGPU .
EachCUD
DAstreamism
managedwith
hinitsownhaardwareworkkqueue,interrstreamdepeendenciesaree
optimized
d,andoperationsinonesttreamwillnolongerblock otherstream
ms,enablingsstreamstoexxecute
concurren
ntlywithoutn
needingtospecificallytailo
orthelaunch ordertoelim
minatepossib
blefalse
dependen
ncies.
HyperQofferssignificantbenefitsforuseinMPIbasedparallelcomputersystems.LegacyMPIbased
algorithmswereoftencreatedtorunonmulticoreCPUsystems,withtheamountofworkassignedto
eachMPIprocessscaledaccordingly.ThiscanleadtoasingleMPIprocesshavinginsufficientworkto
fullyoccupytheGPU.WhileithasalwaysbeenpossibleformultipleMPIprocessestoshareaGPU,
theseprocessescouldbecomebottleneckedbyfalsedependencies.HyperQremovesthosefalse
dependencies,dramaticallyincreasingtheefficiencyofGPUsharingacrossMPIprocesses.
HyperQworkingwithCUDAStreams:IntheFermimodelshownontheleft,only(C,P)&(R,X)canrunconcurrentlydueto
intrastreamdependenciescausedbythesinglehardwareworkqueue.TheKeplerHyperQmodelallowsallstreamstorun
concurrentlyusingseparateworkqueues.
GridManagementUnitEfficientlyKeepingtheGPUUtilized
NewfeaturesinKeplerGK110,suchastheabilityforCUDAkernelstolaunchworkdirectlyontheGPU
withDynamicParallelism,requiredthattheCPUtoGPUworkflowinKeplerofferincreasedfunctionality
overtheFermidesign.OnFermi,agridofthreadblockswouldbelaunchedbytheCPUandwould
alwaysruntocompletion,creatingasimpleunidirectionalflowofworkfromthehosttotheSMsviathe
CUDAWorkDistributor(CWD)unit.KeplerGK110wasdesignedtoimprovetheCPUtoGPUworkflow
byallowingtheGPUtoefficientlymanagebothCPUandCUDAcreatedworkloads.
WediscussedtheabilityoftheKeplerGK110GPUtoallowkernelstolaunchworkdirectlyontheGPU,
anditsimportanttounderstandthechangesmadeintheKeplerGK110architecturetofacilitatethese
newfunctions.InKepler,agridcanbelaunchedfromtheCPUjustaswasthecasewithFermi,however
newgridscanalsobecreatedprogrammaticallybyCUDAwithintheKeplerSMXunit.Tomanageboth
CUDAcreatedandhostoriginatedgrids,anewGridManagementUnit(GMU)wasintroducedinKepler
GK110.ThiscontrolunitmanagesandprioritizesgridsthatarepassedintotheCWDtobesenttothe
SMXunitsforexecution.
TheCWDinKeplerholdsgridsthatarereadytodispatch,anditisabletodispatch32activegrids,which
isdoublethecapacityoftheFermiCWD.TheKeplerCWDcommunicateswiththeGMUviaabi
directionallinkthatallowstheGMUtopausethedispatchofnewgridsandtoholdpendingand
suspendedgridsuntilneeded.TheGMUalsohasadirectconnectiontotheKeplerSMXunitstopermit
gridsthatlaunchadditionalworkontheGPUviaDynamicParallelismtosendthenewworkbackto
GMUtobeprioritizedanddispatched.Ifthekernelthatdispatchedtheadditionalworkloadpauses,the
GMUwillholditinactiveuntilthedependentworkhascompleted.

TheredesignedKeplerHOSTtoGPUworkflowshowsthenewGridManagementUnit,whichallowsittomanagetheactively
dispatchinggrids,pausedispatch,andholdpendingandsuspendedgrids.
NVIDIAGPUDirect
Whenworkingwithalargeamountofdata,increasingthedatathroughputandreducinglatencyisvital
toincreasingcomputeperformance.KeplerGK110supportstheRDMAfeatureinNVIDIAGPUDirect,
whichisdesignedtoimproveperformancebyallowingdirectaccesstoGPUmemorybythirdparty
devicessuchasIBadapters,NICs,andSSDs.WhenusingCUDA5.0,GPUDirectprovidesthefollowing
importantfeatures:
Directmemoryaccess(DMA)betweenNICandGPUwithouttheneedforCPUsidedata
buffering.
SignificantlyimprovedMPISend/MPIRecvefficiencybetweenGPUandothernodesinanetwork.
EliminatesCPUbandwidthandlatencybottlenecks
Workswithvarietyof3rdpartynetwork,capture,andstoragedevices
Applicationslikereversetimemigration(usedinseismicimagingforoil&gasexploration)distributethe
largeimagingdataacrossseveralGPUs.HundredsofGPUsmustcollaboratetocrunchthedata,often
communicatingintermediateresults.GPUDirectenablesmuchhigheraggregatebandwidthforthisGPU
toGPUcommunicationscenariowithinaserverandacrossserverswiththeP2PandRDMAfeatures.
KeplerGK110alsosupportsotherGPUDirectfeaturessuchasPeertoPeerandGPUDirectforVideo.
GPUDirectRDMAallowsdirectaccesstoGPUmemoryfrom3rdpartydevicessuchasnetworkadapters,whichtranslatesinto
directtransfersbetweenGPUsacrossnodesaswell.
Conclusion
WiththelaunchofFermiin2010,NVIDIAusheredinanewerainthehighperformancecomputing
(HPC)industrybasedonahybridcomputingmodelwhereCPUsandGPUsworktogethertosolve
computationallyintensiveworkloads.Now,withthenewKeplerGK110GPU,NVIDIAagainraisesthebar
fortheHPCindustry.
KeplerGK110wasdesignedfromthegrounduptomaximizecomputationalperformanceand
throughputcomputingwithoutstandingpowerefficiency.Thearchitecturehasmanynewinnovations
suchasSMX,DynamicParallelism,andHyperQthatmakehybridcomputingdramaticallyfaster,easier
toprogram,andapplicabletoabroadersetofapplications.KeplerGK110GPUswillbeusedin
numeroussystemsrangingfromworkstationstosupercomputerstoaddressthemostdaunting
challengesinHPC.
Appen
ndix A - Quick Refreshe
R
er on CU
UDA
CUDAisacombination
nhardware/so
oftwareplatfformthatenaablesNVIDIAGPUstoexeccuteprograms
writtenw
withC,C++,Fo
ortran,andottherlanguage
es.ACUDAprrograminvokesparallelfunctionscalled
d
kernelsth
hatexecuteaccrossmanyparallelthread
ds.Theprograammerorcom
mpilerorganizesthesethrreads
intothreaadblocksandgridsofthreadblocks,asshowninFiggure1.Eachth
hreadwithinathreadblocck
executesaninstanceo
ofthekernel.EachthreadalsohasthreaadandblockIDswithinitssthreadblockkand
ogramcounte
er,registers,p
perthreadprrivatememorry,inputs,and
doutputresu
ults.
grid,apro
Athreadb
blockisasetofconcurrentlyexecutinggthreadsthattcancooperaateamongtheemselvesthro
ough
barriersynchronization
nandsharedmemory.AthreadblockhhasablockID
Dwithinitsgrrid.Agridisan
hreadblocksthatexecutethesamekerrnel,readinpputsfromglob
balmemory,writeresultsto
arrayofth
globalme
emory,andsyynchronizebe
etweendepen
ndentkernel calls.IntheC
CUDAparallellprogrammin
ng
model,eaachthreadhaasaperthreadprivatememoryspaceuusedforregissterspills,fun
nctioncalls,an
ndC
automaticcarrayvariab
bles.Eachthre
eadblockhassaperblock sharedmemo
oryspaceuseedforinterth
hread
communication,datassharing,andrresultsharingginparallelallgorithms.Grridsofthreadblocksshare
resultsinGlobalMemo
oryspaceafte
erkernelwideglobalsyncchronization.
Figure1:CUDAHierarchyofthreads,blocks,andgrids,withcorrespondingperthreadprivate,perblockshared,
andperapplicationglobalmemoryspaces.
CUDAHardwareExecution
CUDAshierarchyofthreadsmapstoahierarchyofprocessorsontheGPU;aGPUexecutesoneormore
kernelgrids;astreamingmultiprocessor(SMonFermi/SMXonKepler)executesoneormorethread
blocks;andCUDAcoresandotherexecutionunitsintheSMXexecutethreadinstructions.TheSMX
executesthreadsingroupsof32threadscalledwarps.Whileprogrammerscangenerallyignorewarp
executionforfunctionalcorrectnessandfocusonprogrammingindividualscalarthreads,theycan
greatlyimproveperformancebyhavingthreadsinawarpexecutethesamecodepathandaccess
memorywithnearbyaddresses.
Notice
ALLINFORMATIONPROVIDEDINTHISWHITEPAPER,INCLUDINGCOMMENTARY,OPINION,NVIDIA
DESIGNSPECIFICATIONS,REFERENCEBOARDS,FILES,DRAWINGS,DIAGNOSTICS,LISTS,ANDOTHER
DOCUMENTS(TOGETHERANDSEPARATELY,MATERIALS)AREBEINGPROVIDEDASIS.NVIDIA
MAKESNOWARRANTIES,EXPRESSED,IMPLIED,STATUTORY,OROTHERWISEWITHRESPECTTO
MATERIALS,ANDEXPRESSLYDISCLAIMSALLIMPLIEDWARRANTIESOFNONINFRINGEMENT,
MERCHANTABILITY,ANDFITNESSFORAPARTICULARPURPOSE.
Informationfurnishedisbelievedtobeaccurateandreliable.However,NVIDIACorporationassumesno
responsibilityfortheconsequencesofuseofsuchinformationorforanyinfringementofpatentsor
otherrightsofthirdpartiesthatmayresultfromitsuse.Nolicenseisgrantedbyimplicationor
otherwiseunderanypatentorpatentrightsofNVIDIACorporation.Specificationsmentionedinthis
publicationaresubjecttochangewithoutnotice.Thispublicationsupersedesandreplacesall
informationpreviouslysupplied.NVIDIACorporationproductsarenotauthorizedforuseascritical
componentsinlifesupportdevicesorsystemswithoutexpresswrittenapprovalofNVIDIACorporation.
Trademarks
NVIDIA,theNVIDIAlogo,CUDA,FERMI,KEPLERandGeForcearetrademarksorregisteredtrademarksof
NVIDIACorporationintheUnitedStatesandothercountries.Othercompanyandproductnamesmay
betrademarksoftherespectivecompanieswithwhichtheyareassociated.
Copyright
2012NVIDIACorporation.Allrightsreserved.

NVIDIA Kepler GK110 Architecture Whitepaper

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

NVIDIA Kepler GK110 Architecture Whitepaper

Încărcat de

Drepturi de autor:

Formate disponibile

Whitepaper

NVIDIAs Next Generation

The Fastest, Most Efficient HPC Architecture Ever Built

Grid Management Unit..................................................................................................................5

Kepler GK110 The Next Generation GPU Computing

Kepler GK110 - Extreme Performance, Extreme Efficiency

Grid Management UnitEnablingDynamicParallelismrequiresanadvanced,flexible

An Overview of the GK110 Kepler Architecture

S-ar putea să vă placă și