Documente Academic
Documente Profesional
Documente Cultură
Superpipeline
Dependency issues
Parallel instruction
execution
TDTS 08 Lecture 5
Superscalar Architecture
TDTS 08 Lecture 5
2012-11-13
TDTS 08 Lecture 5
Motivation
TDTS 08 Lecture 5
2012-11-13
Superpipelining
FI
EI
FI
EI
FI
EI
FI
EI
FI
EI
FI
EI
Superpipelining (Contd)
For a given architecture and the corresponding instruction set
there is an optimal number of pipeline stages/sub-stages.
Increasing the number of stages/sub-stages over this limit
reduces the overall performance.
FI
EI
FI
EI
FI
EI
FI
EI
FI
EI
2012-11-13
Instruction fetch
Operation decode
Operation execution
Result write back
Superpipeline of degree 2
Superscalar of degree 2
TDTS 08 Lecture 5
FI
FI
FI
FI
FI
FI
FI
FI
I9
I10
I11
I12
DI
DI
DI
DI
FI
FI
FI
FI
FI
FI
FI
FI
EI
EI
EI
EI
DI
DI
DI
DI
DI
DI
DI
DI
DI
DI
DI
DI
EI
EI
EI
EI
WO
WO
WO
WO
EI WO WO
EI WO WO
EI WO WO
EI WO WO
EI
WO
EI
WO
EI
WO
EI
WO
time
Superpipeline of
degree 3 and
superscalar of
degree 4:
12 times speed-up
over the base
machine.
48 times speedup over sequential
execution.
TDTS 08 Lecture 5
2012-11-13
TDTS 08 Lecture 5
An SSA Example
Instruction
Cache
Instruction Buffer
Ins. Fetch
Unit
Floating
Point Unit
Register Files
Integer
Unit
Instruction
Issuing
Instruction Window
(Queues, reservation
stations, etc.)
Integer
Unit
Memory
Decode,
Rename &
Dispatch
Floating
Point Unit
Commit
Zebo Peng, IDA, LiTH
10
Data
Cache
TDTS 08 Lecture 5
2012-11-13
Superpipeline
Dependency issues
Parallel instruction
execution
11
TDTS 08 Lecture 5
12
TDTS 08 Lecture 5
2012-11-13
Resource Conflicts
13
TDTS 08 Lecture 5
Procedural Dependency
14
TDTS 08 Lecture 5
2012-11-13
Data Conflicts
15
TDTS 08 Lecture 5
Window of Execution
16
TDTS 08 Lecture 5
2012-11-13
L2 move
load
add
load
ble
r3,r7
r8,(r3)
r3,r3,#4
r9,(r3)
r8,r9,L3
move
r3,r7
store r9,(r3)
add
r3,r3,#4
store r8,(r3)
add
r5,r5,#1
L3 add
add
blt
r6,r6,#1
r7,r7,#4
r6,r4,L2
r3 := r7
r8 := a[i]
r3 := r3+4
r9 := a[i+1]
jump if r8r9
r3 := r7
a[i] := r9
r3 := r3+4
a[i+1] := r8
change++
i++
r7 := r7+4
jump if r6<r4
Basic Blocks
17
TDTS 08 Lecture 5
L2 move
load
add
load
ble
r3,r7
r8,(r3)
r3,r3,#4
r9,(r3)
r8,r9,L3
move
r3,r7
store r9,(r3)
add
r3,r3,#4
store r8,(r3)
add
r5,r5,#1
L3:
L3 add
add
blt
r6,r6,#1
r7,r7,#4
r6,r4,L2
r8 := a[i]
r3 := r3+4
r9 := a[i+1]
jump if r8r9
a[i] := r9
a[i+1] := r8
change++
i++
jump if r6<r4
Basic Blocks
Zebo Peng, IDA, LiTH
18
TDTS 08 Lecture 5
2012-11-13
19
TDTS 08 Lecture 5
Data Dependencies
Artificial dependencies
Anti-dependency
20
TDTS 08 Lecture 5
10
2012-11-13
21
TDTS 08 Lecture 5
r3,r7
load
r8,(r3)
add
r3,r3,#4
load
r9,(r3)
ble
r8,r9,L3
22
TDTS 08 Lecture 5
11
2012-11-13
Output Dependency
r3,r7
load
r8,(r3)
add
r3,r3,#4
load
r9,(r3)
ble
r8,r9,L3
23
TDTS 08 Lecture 5
Anti-dependency
r3,r7
load
r8,(r3)
add
r3,r3,#4
load
r9,(r3)
ble
r8,r9,L3
24
TDTS 08 Lecture 5
12
2012-11-13
25
TDTS 08 Lecture 5
(R4 := R3 * R1)
MUL R4,R3,R1
. . .
ADD R9
R3,R2,R5
(R4 := R3 * R1)
(R4
R9 := R2 + R5)
(R3
R9 := R2 + R5)
26
TDTS 08 Lecture 5
13
2012-11-13
Effect of Dependencies
Data
dependency
Procedural
dependency
Resource
dependency
27
TDTS 08 Lecture 5
Superpipeline
Dependency issues
Parallel instruction
execution
28
TDTS 08 Lecture 5
14
2012-11-13
The ideal situation is that we have the same ILP and machine
parallelism.
Zebo Peng, IDA, LiTH
29
TDTS 08 Lecture 5
30
TDTS 08 Lecture 5
15
2012-11-13
31
TDTS 08 Lecture 5
Example:
Assume a processor that can issue and decode two instructions
per cycle, that has three functional units (two single-cycle integer
units, and a two-cycle floating-point unit), and that can complete
and write back two results per cycle.
And an instruction sequence with the characteristics given in the
next slide.
Zebo Peng, IDA, LiTH
32
TDTS 08 Lecture 5
16
2012-11-13
Cycle Issue/Decode
I1
I2
1
I3
I4
2
I3
I4
3
I5
4
I4
I6
I5
5
I6
6
Execute
I1
I1
I2
Stall
I3
I4
I5
I6
7
8
Zebo Peng, IDA, LiTH
Write/Complete
33
I1
I3
I4
I5
I6
I2
TDTS 08 Lecture 5
34
TDTS 08 Lecture 5
17
2012-11-13
I1 needs two
cycles
I2
I3
I4 conflict
with I3
I5 depending
on I4
I6 conflict
with I5
Cycle Issue/Decode
I1
I2
1
I3
I4
2
I5
I4
3
I6
4
I6
5
6
7
8
35
Execute
I1
I1
Write/Complete
I2
I3
I4
I5
I6
I2
I1
I4
I5
I6
I3
TDTS 08 Lecture 5
36
TDTS 08 Lecture 5
18
2012-11-13
conflict with I3
depending on I4
conflict with I5
Cycle
1
2
3
4
5
6
7
8
Decode
I1
I2
I3
I4
I5
I6
Execute
Ins. Window
I1, I2
I3, I4
I4, I5, I6
I5
I1
I1
I2
I6
I5
37
Write/Complete
I3
I4
I2
I1
I4
I5
I3
I6
TDTS 08 Lecture 5
38
TDTS 08 Lecture 5
19
2012-11-13
Summary
Register renaming.
Register renaming can improve performance with more than 30%; in this
case performance is limited only by true dependencies.
39
TDTS 08 Lecture 5
20