Sunteți pe pagina 1din 39

MulticoreandParallelProcessing

HakimWeatherspoon
CS3410,Spring2012
ComputerScience
CornellUniversity
P&HChapter4.1011,7.16

xkcd/619

Pitfall:AmdahlsLaw
Executiontimeafterimprovement=
affectedexecutiontime
amountofimprovement
+executiontimeunaffected

Pitfall:AmdahlsLaw
Improvinganaspectofacomputerandexpectinga
proportionalimprovementinoverallperformance

Example: multiply accounts for 80s out of 100s


How much improvement do we need in the multiply
performance to get 5 overall improvement?
Cant be done!
4

ScalingExample
Workload:sumof10scalars,and10 10matrix
sum
Speedupfrom10to100processors?

Singleprocessor:Time=(10+100) tadd
10processors
Time=100/10 tadd +10 tadd =20 tadd
Speedup=110/20=5.5(55%ofpotential)

100processors
Time=100/100 tadd +10 tadd =11 tadd
Speedup=110/11=10(10%ofpotential)

Assumesloadcanbebalancedacrossprocessors

ScalingExample
Whatifmatrixsizeis100 100?
Singleprocessor:Time=(10+10000) tadd
10processors
Time=10 tadd +10000/10 tadd =1010 tadd
Speedup=10010/1010=9.9(99%ofpotential)

100processors
Time=10 tadd +10000/100 tadd =110 tadd
Speedup=10010/110=91(91%ofpotential)

Assumingloadbalanced

GoalsforToday
HowtoimproveSystemPerformance?
InstructionLevelParallelism(ILP)
Multicore
Increaseclockfrequencyvs multicore

BewareofAmdahls Law

Nexttime:
Concurrency,programming,andsynchronization

ProblemStatement
Q:Howtoimprovesystemperformance?
IncreaseCPUclockrate?
ButI/Ospeedsarelimited
Disk,Memory,Networks,etc.
Recall:AmdahlsLaw
Solution:Parallelism

InstructionLevelParallelism(ILP)
Pipelining:executemultipleinstructionsinparallel
Q:Howtogetmoreinstructionlevelparallelism?
A:Deeperpipeline
E.g.250MHz1stage;500Mhz2stage;1GHz4stage;4GHz
16stage

Pipelinedepthlimitedby
maxclockspeed(lessworkperstage shorterclockcycle)
minunitofwork
dependencies,hazards/forwardinglogic
9

InstructionLevelParallelism(ILP)
Pipelining:executemultipleinstructionsinparallel
Q:Howtogetmoreinstructionlevelparallelism?
A:Multipleissuepipeline
Startmultipleinstructionsperclockcycleinduplicatestages

10

StaticMultipleIssue
StaticMultipleIssue
a.k.a. VeryLongInstructionWord(VLIW)
Compilergroupsinstructionstobeissuedtogether
Packagesthemintoissueslots

Q:HowdoesHWdetectandresolvehazards?
A:Itdoesnt.
SimpleHW,assumescompileravoidshazards
Example:StaticDualIssue32bitMIPS
Instructionscomeinpairs(64bitaligned)
OneALU/branchinstruction(ornop)
Oneload/storeinstruction(ornop)
11

MIPSwithStaticDualIssue
Twoissuepackets
OneALU/branchinstruction
Oneload/storeinstruction
64bitaligned
ALU/branch,thenload/store
Padanunusedinstructionwithnop
Address

Instruction type

Pipeline Stages

ALU/branch

IF

ID

EX

MEM

WB

n+4

Load/store

IF

ID

EX

MEM

WB

n+8

ALU/branch

IF

ID

EX

MEM

WB

n + 12

Load/store

IF

ID

EX

MEM

WB

n + 16

ALU/branch

IF

ID

EX

MEM

WB

n + 20

Load/store

IF

ID

EX

MEM

WB
12

SchedulingExample
SchedulethisfordualissueMIPS
Loop: lw
addu
sw
addi
bne

Loop:

$t0,
$t0,
$t0,
$s1,
$s1,

0($s1)
$t0, $s2
0($s1)
$s1,4
$zero, Loop

#
#
#
#
#

$t0=array element
add scalar in $s2
store result
decrement pointer
branch $s1!=0

ALU/branch

Load/store

cycle

nop

lw

addi $s1, $s1,4

nop

addu $t0, $t0, $s2

nop

bne

sw

$s1, $zero, Loop

$t0, 0($s1)

$t0, 4($s1)

IPC = 5/4 = 1.25 (c.f. peak IPC = 2)


13

SchedulingExample

CompilerschedulingfordualissueMIPS
Loop: lw$t0,0($s1)
lw$t1,4($s1)
addu $t0,$t0,$s2
addu $t1,$t1,$s2
sw$t0,0($s1)
sw$t1,4($s1)
addi $s1,$s1,+8
bne $s1,$s3,TOP
ALU/branchslot
Loop: nop
nop
addu $t0,$t0,$s2
addu $t1,$t1,$s2
addi $s1,$s1,+8
bne $s1,$s3,TOP

#$t0=A[i]
#$t1=A[i+1]
#add$s2
#add$s2
#storeA[i]
#storeA[i+1]
#incrementpointer
#continueif$s1!=end
Load/storeslot
lw$t0,0($s1)
lw$t1,4($s1)
nop
sw$t0,0($s1)
sw$t1,4($s1)
nop

cycle
1
2
3
4
5
6

14

SchedulingExample

CompilerschedulingfordualissueMIPS
Loop: lw$t0,0($s1)
lw$t1,4($s1)
addu $t0,$t0,$s2
addu $t1,$t1,$s2
sw$t0,0($s1)
sw$t1,4($s1)
addi $s1,$s1,+8
bne $s1,$s3,TOP
ALU/branchslot
Loop: nop
addi $s1,$s1,+8
addu $t0,$t0,$s2
addu $t1,$t1,$s2
bne $s1,$s3,Loop

#$t0=A[i]
#$t1=A[i+1]
#add$s2
#add$s2
#storeA[i]
#storeA[i+1]
#incrementpointer
#continueif$s1!=end
Load/storeslot
lw$t0,0($s1)
lw$t1,4($s1)
nop
sw$t0,8($s1)
sw$t1,4($s1)

cycle
1
2
3
4
5

15

LimitsofStaticScheduling

CompilerschedulingfordualissueMIPS
lw$t0,0($s1)
addi $t0,$t0,+1
sw$t0,0($s1)
lw$t0,0($s2)
addi $t0,$t0,+1
sw$t0,0($s2)
ALU/branchslot
nop
nop
addi $t0,$t0,+1
nop
nop
nop
addi $t0,$t0,+1
nop

#loadA
#incrementA
#storeA
#loadB
#incrementB
#storeB
Load/storeslot
lw$t0,0($s1)
nop
nop
sw$t0,0($s1)
lw$t0,0($s2)
nop
nop
sw$t0,0($s2)

cycle
1
2
3
4
5
6
7
8

16

LimitsofStaticScheduling

CompilerschedulingfordualissueMIPS
lw$t0,0($s1)
addi $t0,$t0,+1
sw$t0,0($s1)
lw$t1,0($s2)
addi $t1,$t1,+1
sw$t0,0($s2)
ALU/branchslot
nop
nop
addi $t0,$t0,+1
addi $t1,$t1,+1
nop

#loadA
#incrementA
#storeA
#loadB
#incrementB
#storeB
Load/storeslot
lw$t0,0($s1)
lw
$t1,0($s2)
nop
sw$t0,0($s1)
sw
$t1,0($s2)

cycle
1
2
3
4
5

Problem:Whatif$s1and$s2areequal(aliasing)?Wontwork
17

DynamicMultipleIssue
DynamicMultipleIssue
a.k.a.SuperScalar Processor(c.f.Intel)
CPUexaminesinstructionstreamandchoosesmultiple
instructionstoissueeachcycle
Compilercanhelpbyreorderinginstructions.
butCPUisresponsibleforresolvinghazards

Evenbetter:Speculation/OutoforderExecution

Executeinstructionsasearlyaspossible
Aggressiveregisterrenaming
Guessresultsofbranches,loads,etc.
Rollbackifguesseswerewrong
Dontcommitresultsuntilallpreviousinsts.areretired
18

DynamicMultipleIssue

19

DoesMultipleIssueWork?
Q:Doesmultipleissue/ILPwork?
A:Kindofbutnotasmuchaswedlike
Limitingfactors?
Programsdependencies
Hardtodetectdependencies beconservative
e.g.PointerAliasing:A[0]+=1;B[0]*=2;

Hardtoexposeparallelism
CanonlyissueafewinstructionsaheadofPC

Structurallimits
Memorydelaysandlimitedbandwidth

Hardtokeeppipelinesfull
20

PowerEfficiency

Q:Doesmultipleissue/ILPcostmuch?
A:Yes.
Dynamicissueandspeculationrequirespower
CPU

Year

Clock
Rate

Pipeline
Stages

Issue
width

Out-of-order/ Cores
Speculation

Power

i486

1989

25MHz

No

5W

Pentium

1993

66MHz

No

10W

Pentium Pro

1997

200MHz

10

Yes

29W

P4 Willamette 2001

2000MHz

22

Yes

75W

UltraSparc III

2003

1950MHz

14

No

90W

P4 Prescott

2004

3600MHz

31

Yes

103W

Core

2006

2930MHz

14

Yes

75W

UltraSparc T1

2005

1200MHz

No

70W

Multiplesimplercoresmaybebetter?
21

MooresLaw
DualcoreItanium2
K10
Itanium2
K8
P4
Atom

486
386
286

Pentium

8088
8080
4004 8008
22

WhyMulticore?
Mooreslaw
Alawabouttransistors
Smallermeansmoretransistorsperdie
Andsmallermeansfastertoo

But:Powerconsumptiongrowingtoo

23

PowerLimits
SurfaceofSun
RocketNozzle
NuclearReactor
Xeon
HotPlate

24

PowerWall
Power=capacitance*voltage2 *frequency
Inpractice:Power~voltage3
Reducingvoltagehelps(alot)
...sodoesreducingclockspeed
Bettercoolinghelps
Thepowerwall
Wecantreducevoltagefurther
Wecantremovemoreheat
25

WhyMulticore?
Performance
Power

1.2x

SingleCore
1.7x Overclocked +20%

Performance
Power

1.0x
1.0x

Performance
Power

0.8x
1.6x
0.51x 1.02x

SingleCore
SingleCore
DualCore
Underclocked 20%

26

InsidetheProcessor
AMDBarcelonaQuadCore:4processorcores

27

InsidetheProcessor
IntelNehalemHexCore

28

Hyperthreading
MultiCore vs.MultiIssue
N
Programs:
Num.Pipelines: N
PipelineWidth: 1

1
1
N

vs.HT
N
1
N

.
29

Hyperthreading
MultiCore vs.MultiIssue
N
Programs:
Num.Pipelines: N
PipelineWidth: 1

vs.HT

1
1
N

N
1
N

Hyperthreads
HT=MultiIssue +extraPCsandregisters dependencylogic
HT=MultiCore redundantfunctionalunits+hazardavoidance

Hyperthreads (Intel)
Illusionofmultiplecoresonasinglecore
EasytokeepHTpipelinesfull+sharefunctionalunits
30

Example:Alloftheabove

31

ParallelProgramming
Q:Soletsjustallusemulticorefromnowon!
A:Softwaremustbewrittenasparallelprogram
Multicoredifficulties

Partitioningwork
Coordination&synchronization
Communicationsoverhead
Balancingloadovercores
Howdoyouwriteparallelprograms?
...withoutknowingexactunderlyingarchitecture?
32

WorkPartitioning
Partitionworksoallcoreshavesomethingtodo

33

LoadBalancing

LoadBalancing
Needtopartitionsoallcoresareactuallyworking

34

AmdahlsLaw
Iftaskshaveaserialpartandaparallelpart
Example:
step1:divideinputdatainton pieces
step2:doworkoneachpiece
step3:combineallresults
Recall:AmdahlsLaw
Asnumberofcoresincreases
timetoexecuteparallelpart? goestozero
timetoexecuteserialpart? Remainsthesame
Serialparteventuallydominates
35

AmdahlsLaw

36

ParallelProgramming
Q:Soletsjustallusemulticorefromnowon!
A:Softwaremustbewrittenasparallelprogram
Multicoredifficulties

Partitioningwork
Coordination&synchronization
Communicationsoverhead
Balancingloadovercores
Howdoyouwriteparallelprograms?
...withoutknowingexactunderlyingarchitecture?
37

Administrivia
FlameWar GamesNightNextFriday,April27th
5pminUpsonB17
Pleasecome,eat,drinkandhavefun

NoLab4orLabSectionnext week!

38

Administrivia
PA3:FlameWar isduenextMonday,April23rd
Thegoalistohavefunwithit
Recitationstodaywilltalkaboutit

HW6DuenextTuesday,April24th
Prelim3nextThursday,April26th

39