Chap-10: Speed and Efficiency

Chap-10: Speed and Efficiency
Instruction Pipelining
• As a simple approach, consider subdividing instruction processing
into two stages: fetch instruction and execute instruction.
• The pipeline has two independent stages. The first stage fetches an
instruction and buffers it.When the second stage is free, the first
stage passes it the buffered instruction. While the second stage is
executing the instruction, the first stage takes advantage of any
unused memory cycles to fetch and buffer the next instruction. This
is called instruction prefetch or fetch overlap. It should be clear
that this process will speed up instruction execution.
2
Six Stage Instruction Pipeline
Let us see a real life example that works on the concept of
pipelined operation. Consider a water bottle packaging plant. Let
there be 3 stages that a bottle should pass through, Inserting the
bottle(I), Filling water in the bottle(F), and Sealing the bottle(S). Let
us consider these stages as stage 1, stage 2 and stage 3
respectively. Let each stage take 1 minute to complete its
operation.
Now, in a non pipelined operation, a bottle is first inserted in the
plant, after 1 minute it is moved to stage 2 where water is filled.
Now, in stage 1 nothing is happening. Similarly, when the bottle
moves to stage 3, both stage 1 and stage 2 are idle. But in pipelined
operation, when the bottle is in stage 2, another bottle can be
loaded at stage 1. Similarly, when the bottle is in stage 3, there can
be one bottle each in stage 1 and stage 2. So, after each minute, we
get a new bottle at the end of stage 3. Hence, the average time
taken to manufacture 1 bottle is :
Without pipelining = 9/3 minutes = 3m
3
IFS||||||
|||IFS|||
| | | | | | I F S (9 minutes)
With pipelining = 5/3 minutes = 1.67m
IFS||
|IFS|
| | I F S (5 minutes)
Thus, pipelined operation increases the efficiency of a system.
4
• The idea is to have more than one instruction being
processed by the processor at the same time.
• The success of a pipeline depends upon dividing the
execution of an instruction among a number of subunits
(stages), each performing part of the required operations.
• A possible division is to consider the following as the
subtasks needed for the execution of an instruction.
– Instruction fetch (F)
– instruction decode (D)
– operand fetch (F)
– instruction execution (E)
– store of results (S)
• In this case, it is possible to have up to five instructions in
the pipeline at the same time, thus reducing instruction
execution latency
5
Instruction Pipelining (cont.)
• Pipelining refers to the technique in which a given
task is divided into a number of subtasks that
need to be performed in sequence.
• Each subtask is performed by a given functional
unit.
• The units are connected in a serial fashion and all
of them operate simultaneously.
• The use of pipelining improves the performance
compared to the traditional sequential execution
of tasks.
6
Space Time Chart - Gantt’s chart
• Used to formulate some performance
measures for the goodness of a pipeline in
processing a series of tasks
• The chart shows the succession of the
subtasks in the pipe with respect to time.
7
Space Time Chart - Gantt’s chart
• Vertical axis represents the subunits (four in this case) and

the horizontal axis represents time (measured in terms of the
time unit required for each unit to perform its task).
• In developing the Gantt’s chart, we assume that the time (T)
taken by each subunit to perform its task is the same; we call
this the unit time.
• From figure, 13 time units are needed to finish executing 10
instructions (I1 to I10). This is to be compared to 40 time
units if sequential processing is used (ten instructions each
requiring four time units).
8
• Figure illustrates the difference between executing four
subtasks of a given instruction (fetching F, decoding D,
execution E, and writing the results W) using pipelining
and sequential processing.
• Total time required to process three instructions (I1, I2, I3) is

only six time units if four-stage pipelining is used as compared to
12 time units if sequential processing is used.
• A saving of up to 50% in the execution time of these three
instructions is obtained. 9
• The execution time will generally be longer than the fetch time.
Execution will involve reading and storing operands and the
performance of some operation. Thus, the fetch stage may have to
wait for some time before it can empty its buffer.
• A conditional branch instruction makes the address of the next
instruction to be fetched unknown.Thus,the fetch stage must wait
until it receives the next instruction address from the execute stage.
The execute stage may then have to wait while the next instruction
is fetched.
• Guessing can reduce the time loss from the second reason.A simple
rule is the following:When a conditional branch instruction is
passed on from the fetch to the execute stage,the fetch stage
fetches the next instruction in memory after the branch
instruction.Then,if the branch is not taken,no time is lost.If the
branch is taken,the fetched instruction must be discarded and a
new instruction fetched.
10
The Effect of a Conditional Branch on Instruction Pipeline operation
11
12
Pipeline Performance
Consider a ‘k’ segment pipeline with clock cycle time as ‘Tp’. Let there
be ‘n’ tasks to be completed in the pipelined processor. Now, the first
instruction is going to take ‘k’ cycles to come out of the pipeline but
the other ‘n – 1’ instructions will take only ‘1’ cycle each, i.e, a total of
‘n – 1’ cycles. So, time taken to execute ‘n’ instructions in a pipelined
processor:
ETpipeline = k + n – 1 cycles
= (k + n – 1) Tp
In the same case, for a non-pipelined processor, execution time of ‘n’
instructions will be:
ETnon-pipeline = n * k * Tp
So, speedup (S) of the pipelined processor over non-pipelined
processor, when ‘n’ tasks are executed on the same processor is:
13
Pipeline Performance
So, speedup (S) of the pipelined processor over non-pipelined processor, when ‘n’
tasks are executed on the same processor is:
S = Performance of pipelined processor /
Performance of Non-pipelined processor
As the performance of a processor is inversely proportional to the execution time,
we have,
S = ETnon-pipeline / ETpipeline
=> S = [n * k * Tp] / [(k + n – 1) * Tp]
S = [n * k] / [k + n – 1]
When the number of tasks ‘n’ are significantly larger than k, that is, n >> k
S=n*k/n
S=k
where ‘k’ are the number of stages in the pipeline.
Also, Efficiency = Given speed up / Max speed up = S / Smax
We know that, Smax = k
So, Efficiency = S / k
Throughput = Number of instructions / Total time to complete the instructions
So, Throughput = n / (k + n – 1) * Tp
14
Types of pipeline
Uniform delay pipeline
In this type of pipeline, all the stages will take same time to complete
an operation.
In uniform delay pipeline, Cycle Time (Tp) = Stage Delay
If buffers are included between the stages then, Cycle Time (Tp) =
Stage Delay + Buffer Delay
Non-Uniform delay pipeline
In this type of pipeline, different stages take different time to
complete an operation.
In this type of pipeline, Cycle Time (Tp) = Maximum(Stage Delay)
For example, if there are 4 stages with delays, 1 ns, 2 ns, 3 ns, and 4
ns, then
Tp = Maximum(1 ns, 2 ns, 3 ns, 4 ns) = 4 ns
If buffers are included between the stages,
Tp = Maximum(Stage delay + Buffer delay)
15
Types of pipeline
Example : Consider a 4 segment pipeline with stage delays (2 ns, 8 ns,
3 ns, 10 ns). Find the time taken to execute 100 tasks in the above
pipeline.
Solution : As the above pipeline is a non-linear pipeline,
Tp = max(2, 8, 3, 10) = 10 ns
We know that ETpipeline = (k + n – 1) Tp = (4 + 100 – 1) 10 ns = 1030 ns
16
Types of pipeline
We have two designs D1 and D2 for asynchronous pipeline processor,

D1 has 5 pipe line stages with execution times of 3 nano seconds, 2
nano seconds, 4 nano seconds, 2 nano seconds, and 3 nano seconds,
while the design D2 has 8 pipeline stages each with 2 nano seconds,
execution time. How much time can be saved using design D2 over
Design D1 for executing 100 instructions?
17
Types of pipeline
An instruction pipeline consists of 4 stages: Fetch(F), Decode operand

field (D), Execute (E), and Result-Write (W). The five instructions in a
certain instruction sequence need these stages for the different
number of clock cycles as shown by the table below. No. of cycles
needed for
Find the number of clock cycles needed to perform the 5 instructions.
18
Types of pipeline
Consider the sequence of machine instructions given below:
MUL R5, R0, R1
DIV R6, R2, R3
ADD R7, R5, R6
SUB R8, R7, R4
In the above sequence, R0 to R8 are general purpose registers. In the
instructions shown, the first register stores the result of the operation
performed on the second and the third registers. This sequence of
instructions is to be executed in a pipelined instruction processor with
the following 4 stages: (1) Instruction Fetch and Decode (IF), (2) Operand
Fetch (OF), (3) Perform Operation (PO) and (4) Write back the Result
(WB). The IF, OF and WB stages take 1 clock cycle each for any
instruction. The PO stage takes 1 clock cycle for ADD or SUB instruction, 3
clock cycles for MUL instruction and 5 clock cycles for DIV instruction.
The pipelined processor uses operand forwarding from the PO stage to
the OF stage. The number of clock cycles taken for the execution of the
above sequence of instructions is _____13______ 19
Types of pipeline
20
Types of pipeline
A 4-stage pipeline has the stage delays as 150, 120, 160 and 140 nanoseconds
respectively. Registers that are used between the stages have a delay of 5
nanoseconds each. Assuming constant clocking rate, the total time taken to
process 1000 data items on this pipeline will be
165.5 microseconds
21
Types of pipeline
IF: Instruction Fetch

ID: Instruction Decode and Operand Fetch
EX: Execute
WB: Write Bank
The IF, ID and WB stages take one clock cycle each to complete the
operation. The number of clock cycles for the EX stage depends on
the instruction. The ADD and SUB instructions need 1 clock cycle
and the MUL instruction need 3 clock cycles in the EX stage.
Operand forwarding is used in the pipelined processor. What is the
number of clock cycles taken to complete the following sequence
of instructions?
ADD R2, R1, R0 R2!R1+R0
MUL R4, R3, R2 R4!R3*R2
SUB R6, R5, R4 R6!R5-R4
8 22
superscalar Design
• A superscalar implementation of a processor architecture is one
in which common instructions—integer and floating-point
arithmetic, loads, stores, and conditional branches—can be
initiated simultaneously and executed independently. Such
implementations raise a number of complex design issues
related to the instruction pipeline.
• The essence of the superscalar approach is the ability to
execute instructions independently and concurrently in
different pipelines.The concept can be further exploited by
allowing instructions to be executed in an order different from
the program order.
23
superscalar Design
• There are multiple functional units, each of which is
implemented as a pipeline, which support parallel execution of
several instructions. In this example, two integer, two floating-
point, and one memory (either load or store) operations can be
executing at the same time.
24
superpipeline
• Super pipelining exploits the fact that many pipeline stages
perform tasks that require less than half a clock cycle. Thus, a
doubled internal clock speed allows the performance of two
tasks in one external clock cycle.
25
superpipeline
• The base pipeline issues one instruction per clock cycle and can
perform one pipeline stage per clock cycle. The pipeline has
four stages: instruction fetch, operation decode, operation
execution , and result write back. Note that although several
instructions are executing concurrently, only one instruction is
in its execution stage at any one time.
• The next part of the diagram shows a superpipelined
implementation that is capable of performing two pipeline
stages per clock cycle. An alternative way of looking at this is
that the functions performed in each stage can be split into two
nonoverlapping parts and each can execute in half a clock
cycle.A super pipeline implementation that behaves in this
fashion is said to be of degree 2.
• A superscalar implementation capable of executing two
instances of each stage in parallel. Higher-degree super pipeline
and superscalar implementations are of course possible.
26
instruction-level parallelism
• The superscalar approach depends on the ability to execute
multiple instructions in parallel. The term instruction-level
parallelism refers to the degree to which , on average , the
instructions of a program can be executed in parallel . A
combination of compiler-based optimization and hardware
techniques can be used to maximize instruction-level
parallelism.
• five limitations:
• True data dependency
• Procedural dependency
• Resource conflicts
• Output dependency
• Antidependency
27
• TRUE DATA DEPENDENCY Consider the following sequence:
• Add r1, r2 Move r3,r1
• The second instruction can be fetched and decoded but cannot
execute until the first instruction executes. The reason is that the
second instruction needs data produced by the first instruction.
• PROCEDURAL DEPENDENCIES
• the presence of branches in an instruction sequence complicates
the pipeline operation. The instructions following a branch (taken
or not taken) have a procedural dependency on the branch and
can not be executed until the branch is executed.
• RESOURCE CONFLICT A resource conflict is a competition of two or
more instructions for the same resource at the same time.
Examples of resources include memories ,caches , buses, register-
file ports , and functional units (e.g. ALU adder).
28
29

Chap-10: Speed and Efficiency

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Chap-10: Speed and Efficiency

Încărcat de

Drepturi de autor:

Formate disponibile

Chap-10: Speed and Efficiency

• Vertical axis represents the subunits (four in this case) and

• Total time required to process three instructions (I1, I2, I3) is

We have two designs D1 and D2 for asynchronous pipeline processor,

An instruction pipeline consists of 4 stages: Fetch(F), Decode operand

Find the number of clock cycles needed to perform the 5 instructions.

IF: Instruction Fetch

S-ar putea să vă placă și