Documente Academic
Documente Profesional
Documente Cultură
ECE
Question
1
(5
pts):
H&P
3.3
Assignment
5
Solutions
Chapter 3 Solutions
15
Consider
a
multiple-issue
Execution pipe 0
Execution pipe 1
design.
Suppose
you
have
two
Loop:
LD
F2,0(Rx)
; <nop>
execution
pipelines,
each
<stall
for
LD
latency>
; <nop>
capable
of
beginning
execution
<stall for LD latency>
; <nop>
of
one
instruction
per
cycle,
and
<stall
for
LD
latency>
; <nop>
enough
fetch/
decode
<stall for LD latency>
; <nop>
bandwidth
in
the
front
end
so
DIVD
F8,F2,F0
; MULTD
F2,F6,F2
that
it
will
not
stall
your
LD
F4,0(Ry)
; <nop>
execution.
Assume
results
can
<stall
for
LD
latency>
; <nop>
be
immediately
forwarded
from
<stall for LD latency>
; <nop>
one
execution
unit
to
another,
<stall
for
LD
latency>
; <nop>
or
to
itself.
Further
assume
that
<stall for LD latency>
; <nop>
the
only
reason
an
execution
ADD
F4,F0,F4
; <nop>
pipeline
would
stall
is
to
<stall due to DIVD latency>
; <nop>
observe
a
true
data
dependency.
<stall
due
to
DIVD
latency>
; <nop>
Now
how
many
cycles
does
the
<stall due to DIVD latency>
; <nop>
loop
require?
The
answer
is
22,
<stall
due
to
DIVD
latency>
; <nop>
as
shown
in
Figure
S.4.
The
LD
<stall due to DIVD latency>
; <nop>
goes
first,
as
before,
and
the
<stall
due
to
DIVD
latency>
; <nop>
DIVD
must
wait
for
it
through
4
ADDD
F10,F8,F2
; ADDI
Rx,Rx,#8
extra
latency
cycles.
After
the
ADDI
Ry,Ry,#8
;
SD
F4,0(Ry)
DIVD
comes
the
MULTD,
which
SUB
R20,R4,Rx
; BNZ
R20,Loop
can
run
in
the
second
pipe
along
<nop>
;
<stall
due
to BNZ>
with
the
DIVD,
since
theres
no
dependency
between
them.
cycles per loop iter 22
(Note
that
they
both
need
the
Figure S.4 Number of cycles required per loop.
same
input,
F2,
and
they
must
both
wait
on
F2s
readiness,
but
there
is
no
constraint
between
them.)
The
LD
following
the
thatnLD
conceivably
have
been
executed
with the DIVD and the
MULTD
does
not
depend
on
the
DIVD
or
could
the
M
ULTD,
so
had
this
been
aconcurrently
superscalar-
MULTD.
Since
this
problem
posited
a
two-execution-pipe
machine,
LD executes in
order-3
machine,
hat
LD
could
conceivably
have
been
executed
concurrently
with
the
the
DIVD
the cycle following the DIVD/MULTD. The loop overhead instructions at the loops
and
the
MULTD.
Since
this
problem
bottom
posited
a
two-execution-pipe
machine,
the
LD
executes
also exhibit some potential for concurrency because they do not depend on
in
the
cycle
following
the
DIVD/MULTD.
T
he
loop
overhead
instructions
at
the
loops
any long-latency
instructions.
bottom
also
exhibit
some
potential
for
concurrency
because
they
do
not
depend
on
any
3.4 Possible
answers:
long-latency
instructions.
1. If an interrupt occurs between N and N + 1, then N + 1 must not have been
allowed to write its results to any permanent architectural state. Alternatively,
it might be permissible to delay the interrupt until N + 1 completes.
2. If N and N + 1 happen to target the same register or architectural state (say,
memory), then allowing N to overwrite what N + 1 wrote would be wrong.
3. N might be a long floating-point op that eventually traps. N + 1 cannot be
allowed to change arch state in case N is to be retried.
Copyright 2012 Elsevier, Inc. All rights reserved.
McGill
ECE
Long-latency ops are at highest risk of being passed by a subsequent op. The DIVD
instr will complete long after the LD F4,0(Ry), for example.
Question
3
(5
pts):
H&P
3.13
Processor
A
issues
up
to
two
instructions
each
from
two
threads,
or
a
total
of
four
instructions
per
cycle.
In
SMT,
round-robin
scheduling
means
switching
between
threads
on
each
clock
cycle
and
issuing
any
ready
instructions.
1
A
A
B
B
2
C
C
D
D
3
A
A
B
B
4
C
C
D
D
5
A
A
B
B
6
C
C
D
D
7
A
A
B
B
8
C
C
D
D
9
A
B
10
C
D
11
12
A
B
13
C
D
14
15
A
B
16
C
D
17
18
A
B
19
C
D
20
22
C
D
23
24
A
B
25
C
D
26
27
A
B
28
C
D
29
30
A
B
31
C
D
32
33
A
A
B
B
34
C
C
D
D
35
36
A
A
B
B
37
C
C
D
D
38
A
A
B
B
39
C
C
D
D
40
A
A
B
B
21
A
B
McGill
ECE
Processor
B
issues
up
to
four
instructions
per
cycle
from
a
single
thread,
and
switches
threads
on
any
stall.
1
A
A
A
A
2
A
A
A
A
3
A
4
B
B
B
B
5
B
B
B
B
6
B
7
C
C
C
C
8
C
C
C
C
9
C
10
D
D
D
D
11
D
D
D
D
12
D
13
A
14
B
15
C
16
D
17
A
18
B
19
C
20
D
22
B
23
C
24
D
25
A
26
B
27
C
28
D
29
A
30
B
31
C
32
D
33
A
34
B
35
C
36
D
37
A
38
B
39
C
40
D
42
B
B
43
C
C
44
D
D
45
A
A
A
A
46
A
A
A
A
47
48
49
50
51
52
53
54
55
56
57
58
59
60
21
A
41
A
A
4
A
7
A
10
A
11
12
13
A
14
15
16
A
17
18
19
A
20
22
A
23
24
25
A
26
27
28
A
A
29
30
31
A
A
A
A
A
A
A
A
32
33
34
A
35
36
37
A
38
39
40
A
21
Chapter 4 Solutions
Chapter 4 Solutions
4.8
39
39
It will perform well, since there are no branch divergences, all memory references
provides
Question
4
(which
5
pts):
H&P
many
4.9
instructions to hide memory latency.
Exercises
a. This
code
reads
four
floats
and
writes
two
floats
for
every
six
FLOPs,
so
arithmetic
4.9 Exercises
This code reads four floats and writes two floats for every six FLOPs, so
intensity
=a.
6/6
=
1.
arithmetic intensity = 6/6 = 1.
4.9
a.
This
b. Assume MVL =code
64: reads four floats and writes two floats for every six FLOPs, so
b. arithmetic
Assume MVL
= 64:= 6/6 = 1.
intensity
li = 64:
$VL,44
b. Assume MVL
li
$r1,0
$VL,44
loop: li
lv
$v1,a_re+$r1
li
$r1,0
lv
$v3,b_re+$r1
loop: lv
$v1,a_re+$r1
mulvv.s
$v5,$v1,$v3
lv
$v3,b_re+$r1
lv
$v2,a_im+$r1
mulvv.s
$v5,$v1,$v3
lv
$v4,b_im+$r1
lv
$v2,a_im+$r1
mulvv.s
$v6,$v2,$v4
lv
$v4,b_im+$r1
subvv.s
$v5,$v5,$v6
mulvv.s
$v6,$v2,$v4
sv
$v5,c_re+$r1
subvv.s
$v5,$v5,$v6
mulvv.s
$v5,$v1,$v4
sv
$v5,c_re+$r1
mulvv.s
$v6,$v2,$v3
mulvv.s
$v5,$v1,$v4
addvv.s
$v5,$v5,$v6
mulvv.s
$v6,$v2,$v3
sv
$v5,c_im+$r1
addvv.s
$v5,$v5,$v6
bne
$r1,0,else
sv
$v5,c_im+$r1
addi
$r1,$r1,#44
bne
$r1,0,else
addi
$r1,$r1,#44
j loop
6 chimes
1.
2.
3.
4.
5.
6.
mulvv.s
a_re*b_re
Copyright 2012 Elsevier, Inc. All rights #reserved.
mulvv.s
# a_im*b_im
subvv.s sv
# subtract and store c_re
mulvv.s
# a_re*b_im
mulvv.s lv
# a_im*b_re, load next a_re
addvv.s sv
lv
lv lv
# add, store c_im, load next b_re,a_im,b_im
Same cycles per result as in part c. Adding additional load/store units did not
improve performance.
4.10
Assuming that vector computation can be overlapped with memory access, total