Documente Academic
Documente Profesional
Documente Cultură
Pipeline Optimization
Dinesh Sharma
January 9, 2018
State
Instructions
Data Instruction
Processing Processing
Data
Instructions A common bus is used for
data as well as instructions.
Bus
The system can become ‘bus
bound’.
Bottleneck! Memory
Harvard Architecture
State
Data Instruction
Processing Processing Separate data and instruction
Instructions paths
Good performance
Data Instructions Needs 2 buses → expensive!
Traffic on the buses is not
balanced.
Data Instruction Instruction bus may remain
Memory Memory idle.
State
Data Instruction
Processing Processing
Instructions Constants can be stored with
Instructions in ROM.
Mux Instructions Better Bus balancing is
possible.
Variables Constants Typically, 1 instruction read, 1
constant read, 1 data read and
1 result write per instruction.
Data Read Only 2 mem ops per bus.
Memory Memory
State
Data Instruction
Processing Processing
Instructions
Pipelining
... ...
Job i
... ...
Job 2
outputs
Job 1
Stage 1 Stage 2 ... ... Stage j
Task 1 Task 2 ... ... Task j ... ...
Pipelining
... ...
Job i
... ...
Job 2
outputs
Job 1
Stage 1 Stage 2 ... ... Stage j
Task 1 Task 2 ... ... Task j ... ...
Pipelining
A pipelined processor
ROM address
Fetch Op Code (ROM)
ROM Address
ROM data
Instruction
RAM Address
RAM data
.
A pipelined processor
Constant
Fetch Op Code (ROM)
ROM Address
ROM data Fetch variable (RAM)
RAM Address
RAM data Fetch constant (ROM)
Data
A pipelined processor
A pipelined processor
Resource Reservation
Overlapping Operations
0 1 2 3 4 5 6 7 8 9 10
ROM 0 0
RAM 0 0
ALU 0
Pipelining
0 1 2 3 4 5 6 7 8 9 10
ROM 0 0 1 1 2 2
RAM 0 0 1 1 2 2
ALU 0 1 2
Overlapping Operations
0 1 2 3 4 5 6 7 8 9 10
ROM 0 0 1 1 2 2
RAM 0 0 1 1 2 2
ALU 0 1 2
Improved Scheduling
Improved Scheduling
Improved Scheduling
Sampling Period
0 1 2 3 4 5 6 7 8 9 10
ROM 0 1 0 1 2 3 2 3 4 5 4
RAM 0 1 0 1 2 3 2 3 4 5
ALU 0 1 2 3 4
Pipeline Optimization
Design Sets
Row Vectors
In this example, the row vector for ROM is {0,1}, for RAM is
{1,3} and for ALU is {2}.
X X X X 0,2,5,6
D
0 1 2 3 4 5 6 7
R X X X X 1,3,4,6
D X X X X 1,3,6,7
R X X X X 1,3,4,6
If R(i) < D(i)
Add D(i) - R(i) delays to all D X X X X 1,3,6,7
members of R at position i 0 1 2 3 4 5 6 7 8
and beyond.
X X X X 1,3,6,8
X X X X 1,3,6,7
Peridicity p = 2
0 1 2 3 4 5 6 7
If D(i) < R(i)
R X X X X 1,3,4,6
(for Example, p = 2
D X X X X 1,2,5,6
R = {1,3,4,6}, D = {1,2,5,6}.
Break here and
Now D2 < R2) move forward by p (=2) steps
0 1 2 3 4 5 6 7 8 9
1 Add sufficient multiples of p to
R X X X X 1,3,4,6
D(i) such that it is ≥ R(i).
2 Add the same number to X X X X 1,4,7,8
D
members of D beyond i. Now align R
3 Now if R(i) < D(i), add D(i) - 0 1 2 3 4 5 6 7 8 9
R(i) delays to all members of R X X X X 1,4,5,7
R at position i and beyond.
X X X X 1,4,7,8
D
Alignment Example
Alignment Example
0 1 2 3 4 5 6 7 8 9
10 Move R3 and beyond
D X X X X forward by 2
R X X X X So R = 1,4,7,9
X X X X and D = 1,4,7,8.
0 1 2 3 4 5 6 7 8 9 10 D4 < R4
R X X X X Move D4 forward by 2
to 10.
D X X X X
Now R4 < D4.
D X X X X
Move R4 forward by 1
R X X X X
to 10
Vectors are now aligned at 1,4,7,10.
Example System
Since the ROM and the RAM are used for 2 cycles each in
every operation, MASP = 2.
However, as we had seen before, ASP = 3 in this case.
Therefore, the schedule needs improvement.
Example Application
0 1 2 3
ROM 0 0
RAM 0 0
ALU 0
So no alignment is required.
ALU Schedule
One can trade off power for speed when designing the
ALU.
By using optimization techniques, we are able to reach a
higher throughput, even with a slower ALU!
0 1 2 3
ROM 0 0
RAM 0 0
ALU 0
0 1 2 3
ROM 0 0
RAM 0 0
ALU 0
Time Ordering
0 1 2 3 4 5 6 7 8 9 10
ROM 0 1 0 1 2 3 2 3 4 5 4
RAM 0 1 0 1 2 3 2 3 4 5
ALU 0 1 2 3 4
Time Ordering
0 1 2 3 4 5 6 7 8 9 10
ROM 2 3 2 3 4 5 4 5 6 7 6
RAM 0 1 0 1 2 3 2 3 4 5
ALU 0 1 2 3 4
Conclusions