hw5 Soln

McGill
ECE

Question 1 (5 pts): H&P 3.3
ECSE 425: Computer Organization and Architecture

Winter 2015
Assignment 5
Solutions
Chapter 3 Solutions
15
Consider a multiple-issue
Execution pipe 0
Execution pipe 1
design. Suppose you have two
Loop:
LD
F2,0(Rx)
; <nop>
execution pipelines, each
<stall
for
LD
latency>
; <nop>
capable of beginning execution
<stall for LD latency>
; <nop>
of one instruction per cycle, and
<stall
for
LD
latency>
; <nop>
enough fetch/ decode
; <nop>
bandwidth in the front end so
DIVD
F8,F2,F0
; MULTD
F2,F6,F2
that it will not stall your
LD
F4,0(Ry)
; <nop>
execution. Assume results can
<stall
for
LD
latency>
; <nop>
be immediately forwarded from
; <nop>
one execution unit to another,
<stall
for
LD
latency>
; <nop>
or to itself. Further assume that
; <nop>
the only reason an execution
ADD
F4,F0,F4
; <nop>
pipeline would stall is to
<stall due to DIVD latency>
; <nop>
observe a true data dependency.
<stall
due
to
DIVD
latency>
; <nop>
Now how many cycles does the
; <nop>
loop require? The answer is 22,
<stall
due
to
DIVD
latency>
; <nop>
as shown in Figure S.4. The LD
; <nop>
goes first, as before, and the
<stall
due
to
DIVD
latency>
; <nop>
DIVD must wait for it through 4
ADDD
F10,F8,F2
; ADDI
Rx,Rx,#8
extra latency cycles. After the
ADDI
Ry,Ry,#8
;
SD
F4,0(Ry)
DIVD comes the MULTD, which
SUB
R20,R4,Rx
; BNZ
R20,Loop
can run in the second pipe along
<nop>
;
<stall
due
to BNZ>
with the DIVD, since theres no
dependency between them.
cycles per loop iter 22
(Note that they both need the
Figure S.4 Number of cycles required per loop.
same input, F2, and they must
both wait on F2s readiness, but there is no constraint between them.) The LD following the
thatnLD
conceivably
have
been
executed
with the DIVD and the
MULTD does not depend on the DIVD
or could
the M
ULTD, so
had
this
been aconcurrently
superscalar-
MULTD.
Since
this
problem
posited
a
two-execution-pipe
machine,
LD executes in
order-3 machine, hat LD could conceivably have been executed concurrently with the the
DIVD
the cycle following the DIVD/MULTD. The loop overhead instructions at the loops
and the MULTD. Since this problem bottom
posited
a two-execution-pipe machine, the LD executes
also exhibit some potential for concurrency because they do not depend on
in the cycle following the DIVD/MULTD.
T
he
loop
overhead instructions at the loops
any long-latency
instructions.
bottom also exhibit some potential
for concurrency
because they do not depend on any
3.4 Possible
answers:
long-latency instructions.
1. If an interrupt occurs between N and N + 1, then N + 1 must not have been
allowed to write its results to any permanent architectural state. Alternatively,
it might be permissible to delay the interrupt until N + 1 completes.
2. If N and N + 1 happen to target the same register or architectural state (say,
memory), then allowing N to overwrite what N + 1 wrote would be wrong.
3. N might be a long floating-point op that eventually traps. N + 1 cannot be
allowed to change arch state in case N is to be retried.
Copyright 2012 Elsevier, Inc. All rights reserved.
McGill ECE


Winter 2015

Possible answers:
1. If an interrupt occurs between N and N + 1, then N + 1 must not have been allowed
to write its results to any permanent architectural state. Alternatively, it might be
permissible to delay the interrupt until N + 1 completes.
2. If N and N + 1 happen to target the same register or architectural state (say,
memory), then allowing N to overwrite what N + 1 wrote would be wrong.
3. N might be a long floating-point op that eventually traps. N + 1 cannot be allowed to
change arch state in case N is to be retried.
Long-latency ops are at highest risk of being passed by a subsequent op. The DIVD
instr will complete long after the LD F4,0(Ry), for example.
Processor A issues up to two instructions each from two threads, or a total of four
instructions per cycle. In SMT, round-robin scheduling means switching between threads
on each clock cycle and issuing any ready instructions.
1
A
A
B
B
2
C
C
D
D
3
A
A
B
B
4
C
C
D
D
5
A
A
B
B
6
C
C
D
D
7
A
A
B
B
8
C
C
D
D
9
A

B

10
C

D

11

12
A

B

13
C

D

14

15
A

B

16
C

D

17

18
A

B

19
C

D

20

22
C

D

23

24
A

B

25
C

D

26

27
A

B

28
C

D

29

30
A

B

31
C

D

32

33
A
A
B
B
34
C
C
D
D
35

36
A
A
B
B
37
C
C
D
D
38
A
A
B
B
39
C
C
D
D
40
A
A
B
B

21
A

B

Plain: alu. Bold: load; Italics: branch.

The second iteration starts in cycle 36; 35 cycles are required for each iteration, for a total
of 70 cycles for the first two iterations.
McGill ECE


Winter 2015
Processor B issues up to four instructions per cycle from a single thread, and switches
threads on any stall.
1
A
A
A
A
2
A
A
A
A
3
A

4
B
B
B
B
5
B
B
B
B
6
B

7
C
C
C
C
8
C
C
C
C
9
C

10
D
D
D
D
11
D
D
D
D
12
D

13
A

14
B

15
C

16
D

17
A

18
B

19
C

20
D

22
B

23
C

24
D

25
A

26
B

27
C

28
D

29
A

30
B

31
C

32
D

33
A

34
B

35
C

36
D

37
A

38
B

39
C

40
D

42
B
B

43
C
C

44
D
D

45
A
A
A
A
46
A
A
A
A
47

48

49

50

51

52

53

54

55

56

57

58

59

60

21
A

41
A
A


The second iteration starts in cycle 45; 44 iterations are required for each iteration, for a
total of 88 for the first two.
Processor C issues up to eight instructions per cycle from a single thread but only switches
context on an L1 cache miss, i.e., after the first two iterations are complete.
1
A
A
A
A
A
A
A
A
4
A

7
A

10
A

11

12

13
A

14

15

16
A

17

18

19
A

20

22
A

23

24

25
A

26

27

28
A
A

29

30

31
A
A
A
A
A
A
A
A
32

33

34
A

35

36

37
A

38

39

40
A

21


The second iteration for the first thread starts in cycle 31; 30 cycles are required for the
first iteration. When the third iteration starts, in cycle 30+30+1 = 61, a context switch
occurs, and the second thread starts in cycle 62. The first thread requires 61 cycles to
execute, and so do all subsequent threads, for a total of 61 * 4 = 244.
Chapter 4 Solutions
Chapter 4 Solutions
4.8
39
39
It will perform well, since there are no branch divergences, all memory references
McGill ECE are coalesced, and there are 500 threads

ECSE
425: Computer Organization and Architecture
spread across 6 blocks (3000 total threads),
4.8
It
will
perform
well,
since
there
are
no
branch
divergences, all memory references Winter 2015

which provides many instructions to hide memory latency.
are coalesced, and there are 500 threads spread across 6 blocks (3000 total threads),
provides
Question 4 (which
5 pts):
H&P many
4.9 instructions to hide memory latency.
Exercises
a. This code reads four floats and writes two floats for every six FLOPs, so arithmetic
4.9 Exercises
This code reads four floats and writes two floats for every six FLOPs, so
intensity
=a. 6/6
= 1.
arithmetic intensity = 6/6 = 1.
4.9
a.
This
b. Assume MVL =code
64: reads four floats and writes two floats for every six FLOPs, so
b. arithmetic
Assume MVL
= 64:= 6/6 = 1.
intensity
li = 64:
$VL,44
b. Assume MVL
li
$r1,0
$VL,44
loop: li
lv
$v1,a_re+$r1
li
$r1,0
lv
$v3,b_re+$r1
loop: lv
$v1,a_re+$r1
mulvv.s
$v5,$v1,$v3
lv
$v3,b_re+$r1
lv
$v2,a_im+$r1
mulvv.s
$v5,$v1,$v3
lv
$v4,b_im+$r1
lv
$v2,a_im+$r1
mulvv.s
$v6,$v2,$v4
lv
$v4,b_im+$r1
subvv.s
$v5,$v5,$v6
mulvv.s
$v6,$v2,$v4
sv
$v5,c_re+$r1
subvv.s
$v5,$v5,$v6
mulvv.s
$v5,$v1,$v4
sv
$v5,c_re+$r1
mulvv.s
$v6,$v2,$v3
mulvv.s
$v5,$v1,$v4
addvv.s
$v5,$v5,$v6
mulvv.s
$v6,$v2,$v3
sv
$v5,c_im+$r1
addvv.s
$v5,$v5,$v6
bne
$r1,0,else
sv
$v5,c_im+$r1
addi
$r1,$r1,#44
bne
$r1,0,else
addi
$r1,$r1,#44
j loop
# perform the first 44 ops

# initialize index
## perform
the first 44 ops
load a_re
## initialize
load b_re index
## load
a_re
a+re*b_re
## load
load b_re
a_im
# a+re*b_re
load a_im
b_im
## load
# a+im*b_im
## load
b_im - a+im*b_im
a+re*b_re
## a+im*b_im
store c_re
## a+re*b_re
a+re*b_im - a+im*b_im
## store
c_re
a+im*b_re
## a+re*b_im
a+re*b_im + a+im*b_re
## a+im*b_re
store c_im
## a+re*b_im
+ a+im*b_re
check if first
iteration
## store
c_im
first iteration,
#increment
check if by
first
44 iteration
## first
iteration,
guaranteed next iteration
by iteration,
44
else: addi
$r1,$r1,#256 increment
# not first
j loop
#increment
guaranteed
by next
256 iteration
else:
addi
$r1,$r1,#256
#
not
first
iteration,
skip: blt
$r1,1200,loop # next iteration?
increment by 256
c.
skip: blt
$r1,1200,loop # next iteration?
c. 6 chimes:
1.
mulvv.s
lv
# a_re * b_re (assume already
c.
# loaded), load a_im
1.
mulvv.s
lv
b_re a_im*b_im
(assume already
2.
lv
mulvv.s ## a_re
load *b_im,
#
loaded),
load
a_im c _re
3.
subvv.s
sv
# subtract and store
2.
lv
mulvv.s
#
load
b_im,
a_im*b_im
4.
mulvv.s
lv
# a_re*b_im, load next a_re vector
3.
subvv.s
sv
## subtract
andload
store
c _re
5.
mulvv.s
lv
a_im*b_re,
next
b_re vector
4.
mulvv.s
lv
#
a_re*b_im,
load
next
6.
addvv.s
sv
# add and store c_im a_re vector
5.
mulvv.s
lv
# a_im*b_re, load next b_re vector
6 chimes
40 Solutions 6.
to
Case addvv.s
Studies and Exercises
sv
# add and store c_im
6 chimes
d. total cycles per iteration

=
d. total cycles per iteration =
6 chimes 64 elements + 15 cycles (load/store) 6 + 8 cycles (multiply) 4 +
6 chimes 64 elements + 15 cycles (load/store) 6 + 8 cycles (multiply)
5 cycles (add/subtract) 4+25=cycles
516 (add/subtract) 2 = 516
e. Same cycles per result as in part c. Adding additional load/store units did not improve
cycles per result = 516/128 = 4
performance.
Copyright e.
2012 Elsevier, Inc. All rights reserved.
1.
2.
3.
4.
5.
6.
mulvv.s
a_re*b_re
Copyright 2012 Elsevier, Inc. All rights #reserved.
mulvv.s
# a_im*b_im
subvv.s sv
# subtract and store c_re
mulvv.s
# a_re*b_im
mulvv.s lv
# a_im*b_re, load next a_re
addvv.s sv
lv
lv lv
# add, store c_im, load next b_re,a_im,b_im
Same cycles per result as in part c. Adding additional load/store units did not
improve performance.
4.10
Vector processor requires:
(200 MB + 100 MB)/(30 GB/s) = 10 ms for vector memory access +
400 ms for scalar execution.
Assuming that vector computation can be overlapped with memory access, total

hw5 Soln

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

hw5 Soln

Încărcat de

Drepturi de autor:

Formate disponibile

McGill

ECSE 425: Computer Organization and Architecture

ECSE 425: Computer Organization and Architecture

Question 2 (5 pts): H&P 3.4

Plain: alu. Bold: load; Italics: branch.

ECSE 425: Computer Organization and Architecture

Plain: alu. Bold: load; Italics: branch.

Plain: alu. Bold: load; Italics: branch.

McGill ECE are coalesced, and there are 500 threads

# perform the first 44 ops

d. total cycles per iteration

Vector processor requires:

(200 MB + 100 MB)/(30 GB/s) = 10 ms for vector memory access +

400 ms for scalar execution.

S-ar putea să vă placă și