Documente Academic
Documente Profesional
Documente Cultură
Instruction Pipelining
• As a simple approach, consider subdividing instruction processing
into two stages: fetch instruction and execute instruction.
• The pipeline has two independent stages. The first stage fetches an
instruction and buffers it.When the second stage is free, the first
stage passes it the buffered instruction. While the second stage is
executing the instruction, the first stage takes advantage of any
unused memory cycles to fetch and buffer the next instruction. This
is called instruction prefetch or fetch overlap. It should be clear
that this process will speed up instruction execution.
2
Six Stage Instruction Pipeline
Let us see a real life example that works on the concept of
pipelined operation. Consider a water bottle packaging plant. Let
there be 3 stages that a bottle should pass through, Inserting the
bottle(I), Filling water in the bottle(F), and Sealing the bottle(S). Let
us consider these stages as stage 1, stage 2 and stage 3
respectively. Let each stage take 1 minute to complete its
operation.
Now, in a non pipelined operation, a bottle is first inserted in the
plant, after 1 minute it is moved to stage 2 where water is filled.
Now, in stage 1 nothing is happening. Similarly, when the bottle
moves to stage 3, both stage 1 and stage 2 are idle. But in pipelined
operation, when the bottle is in stage 2, another bottle can be
loaded at stage 1. Similarly, when the bottle is in stage 3, there can
be one bottle each in stage 1 and stage 2. So, after each minute, we
get a new bottle at the end of stage 3. Hence, the average time
taken to manufacture 1 bottle is :
Without pipelining = 9/3 minutes = 3m
3
Six Stage Instruction Pipeline
IFS||||||
|||IFS|||
| | | | | | I F S (9 minutes)
With pipelining = 5/3 minutes = 1.67m
IFS||
|IFS|
| | I F S (5 minutes)
Thus, pipelined operation increases the efficiency of a system.
4
Instruction Pipelining
• The idea is to have more than one instruction being
processed by the processor at the same time.
• The success of a pipeline depends upon dividing the
execution of an instruction among a number of subunits
(stages), each performing part of the required operations.
• A possible division is to consider the following as the
subtasks needed for the execution of an instruction.
– Instruction fetch (F)
– instruction decode (D)
– operand fetch (F)
– instruction execution (E)
– store of results (S)
• In this case, it is possible to have up to five instructions in
the pipeline at the same time, thus reducing instruction
execution latency
5
Instruction Pipelining (cont.)
• Pipelining refers to the technique in which a given
task is divided into a number of subtasks that
need to be performed in sequence.
• Each subtask is performed by a given functional
unit.
• The units are connected in a serial fashion and all
of them operate simultaneously.
• The use of pipelining improves the performance
compared to the traditional sequential execution
of tasks.
6
Space Time Chart - Gantt’s chart
• Used to formulate some performance
measures for the goodness of a pipeline in
processing a series of tasks
• The chart shows the succession of the
subtasks in the pipe with respect to time.
7
Space Time Chart - Gantt’s chart
10
The Effect of a Conditional Branch on Instruction Pipeline operation
11
Six Stage Instruction Pipeline
12
Pipeline Performance
Consider a ‘k’ segment pipeline with clock cycle time as ‘Tp’. Let there
be ‘n’ tasks to be completed in the pipelined processor. Now, the first
instruction is going to take ‘k’ cycles to come out of the pipeline but
the other ‘n – 1’ instructions will take only ‘1’ cycle each, i.e, a total of
‘n – 1’ cycles. So, time taken to execute ‘n’ instructions in a pipelined
processor:
ETpipeline = k + n – 1 cycles
= (k + n – 1) Tp
In the same case, for a non-pipelined processor, execution time of ‘n’
instructions will be:
ETnon-pipeline = n * k * Tp
So, speedup (S) of the pipelined processor over non-pipelined
processor, when ‘n’ tasks are executed on the same processor is:
13
Pipeline Performance
So, speedup (S) of the pipelined processor over non-pipelined processor, when ‘n’
tasks are executed on the same processor is:
S = Performance of pipelined processor /
Performance of Non-pipelined processor
As the performance of a processor is inversely proportional to the execution time,
we have,
S = ETnon-pipeline / ETpipeline
=> S = [n * k * Tp] / [(k + n – 1) * Tp]
S = [n * k] / [k + n – 1]
When the number of tasks ‘n’ are significantly larger than k, that is, n >> k
S=n*k/n
S=k
where ‘k’ are the number of stages in the pipeline.
Also, Efficiency = Given speed up / Max speed up = S / Smax
We know that, Smax = k
So, Efficiency = S / k
Throughput = Number of instructions / Total time to complete the instructions
So, Throughput = n / (k + n – 1) * Tp
14
Types of pipeline
Uniform delay pipeline
In this type of pipeline, all the stages will take same time to complete
an operation.
In uniform delay pipeline, Cycle Time (Tp) = Stage Delay
If buffers are included between the stages then, Cycle Time (Tp) =
Stage Delay + Buffer Delay
Non-Uniform delay pipeline
In this type of pipeline, different stages take different time to
complete an operation.
In this type of pipeline, Cycle Time (Tp) = Maximum(Stage Delay)
For example, if there are 4 stages with delays, 1 ns, 2 ns, 3 ns, and 4
ns, then
Tp = Maximum(1 ns, 2 ns, 3 ns, 4 ns) = 4 ns
If buffers are included between the stages,
Tp = Maximum(Stage delay + Buffer delay)
15
Types of pipeline
Example : Consider a 4 segment pipeline with stage delays (2 ns, 8 ns,
3 ns, 10 ns). Find the time taken to execute 100 tasks in the above
pipeline.
Solution : As the above pipeline is a non-linear pipeline,
Tp = max(2, 8, 3, 10) = 10 ns
We know that ETpipeline = (k + n – 1) Tp = (4 + 100 – 1) 10 ns = 1030 ns
16
Types of pipeline
17
Types of pipeline
18
Types of pipeline
Consider the sequence of machine instructions given below:
MUL R5, R0, R1
DIV R6, R2, R3
ADD R7, R5, R6
SUB R8, R7, R4
In the above sequence, R0 to R8 are general purpose registers. In the
instructions shown, the first register stores the result of the operation
performed on the second and the third registers. This sequence of
instructions is to be executed in a pipelined instruction processor with
the following 4 stages: (1) Instruction Fetch and Decode (IF), (2) Operand
Fetch (OF), (3) Perform Operation (PO) and (4) Write back the Result
(WB). The IF, OF and WB stages take 1 clock cycle each for any
instruction. The PO stage takes 1 clock cycle for ADD or SUB instruction, 3
clock cycles for MUL instruction and 5 clock cycles for DIV instruction.
The pipelined processor uses operand forwarding from the PO stage to
the OF stage. The number of clock cycles taken for the execution of the
above sequence of instructions is _____13______ 19
Types of pipeline
20
Types of pipeline
A 4-stage pipeline has the stage delays as 150, 120, 160 and 140 nanoseconds
respectively. Registers that are used between the stages have a delay of 5
nanoseconds each. Assuming constant clocking rate, the total time taken to
process 1000 data items on this pipeline will be
165.5 microseconds
21
Types of pipeline
23
superscalar Design
• There are multiple functional units, each of which is
implemented as a pipeline, which support parallel execution of
several instructions. In this example, two integer, two floating-
point, and one memory (either load or store) operations can be
executing at the same time.
24
superpipeline
• Super pipelining exploits the fact that many pipeline stages
perform tasks that require less than half a clock cycle. Thus, a
doubled internal clock speed allows the performance of two
tasks in one external clock cycle.
25
superpipeline
• The base pipeline issues one instruction per clock cycle and can
perform one pipeline stage per clock cycle. The pipeline has
four stages: instruction fetch, operation decode, operation
execution , and result write back. Note that although several
instructions are executing concurrently, only one instruction is
in its execution stage at any one time.
• The next part of the diagram shows a superpipelined
implementation that is capable of performing two pipeline
stages per clock cycle. An alternative way of looking at this is
that the functions performed in each stage can be split into two
nonoverlapping parts and each can execute in half a clock
cycle.A super pipeline implementation that behaves in this
fashion is said to be of degree 2.
• A superscalar implementation capable of executing two
instances of each stage in parallel. Higher-degree super pipeline
and superscalar implementations are of course possible.
26
instruction-level parallelism
• The superscalar approach depends on the ability to execute
multiple instructions in parallel. The term instruction-level
parallelism refers to the degree to which , on average , the
instructions of a program can be executed in parallel . A
combination of compiler-based optimization and hardware
techniques can be used to maximize instruction-level
parallelism.
• five limitations:
• True data dependency
• Procedural dependency
• Resource conflicts
• Output dependency
• Antidependency
27
instruction-level parallelism
• TRUE DATA DEPENDENCY Consider the following sequence:
• Add r1, r2 Move r3,r1
• The second instruction can be fetched and decoded but cannot
execute until the first instruction executes. The reason is that the
second instruction needs data produced by the first instruction.
• PROCEDURAL DEPENDENCIES
• the presence of branches in an instruction sequence complicates
the pipeline operation. The instructions following a branch (taken
or not taken) have a procedural dependency on the branch and
can not be executed until the branch is executed.
• RESOURCE CONFLICT A resource conflict is a competition of two or
more instructions for the same resource at the same time.
Examples of resources include memories ,caches , buses, register-
file ports , and functional units (e.g. ALU adder).
28
instruction-level parallelism
29