Superscalar and Superpipelined Processors

Superscalar and Superpipelined Processors
Logical evolution of pipeline designs resulted in 2 high-performance execution techniques:superscalar and

superpipelined CPUs.
Superscalar CPUs
Most executed operations are on scalar quantities
Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
The superscalar CPU has more than one pipelined functional unit (e.g. ALU) which can operate in parallel
Superpipelined CPUs
result from the observation that a large number of pipeline operations do not require a full clock cycle to
complete.
Dividing the clock cycle into smaller subcycles and and subdividing the "macro" pipeline stages into smaller
(and faster) substages means that although the time to complete individual instructions does not change the
perceived throughput increases.
Comparison of superpipelined and superscalar performance

o
Problems in the Pipeline

Pipeline stalls (aka bubbles) are not only caused by the delay due to fetching operands from memory
Hazards causing stalls come in several flavours: data, structural and control
Data hazards are those where one instruction is dependent on a preceding one
add r1, r4, r5

sub r2, r5, r6
In this classic Read After Write (RAW) dependency, the second instruction cannot fetch one of its operands (r5)
until after the first instruction has completed the write-back stage of the pipeline (and updated r5). In a 5-stage
pipeline (Fetch, Decode, Operand fetch, Execute, Write-back) this means introducing a bubble to prevent the
second instruction entering the operand fetch before the first has completed the write-back. Other classic
hazards are Write after Read (WAR) and Write after Write (WAW)
Structural (or resource) hazards occur when the system does not have sufficient resources to handle a particular
combination of successive instructions, e.g. too much data required to/from cache simultaneously, or three
successive ADDs on a machine with only two ALUs
add r1, r3, r5

sub r4, r6, r9
inc r12
addc r8, r7, r10
Although there are no dependencies in the above code, the execution speed will depend on the number of ALUs.
Can always attempt to solve this by adding more hardware (e.g. the ALU)
Control hazards (aka procedural dependency) are caused by branch instructions, especially conditional
branches. We cannot execute instructions after the branch in parallel with instructions before it. Moreover, if
the instruction length is not fixed instructions must be decoded to find how how many fetches are needed
Instruction Issue/Completion Policy
With superscalar architectures we have the potential for beginning issuing and/or completing (retiring)
instructions either in or out of order
To understand why these are attractive options, consider the following code:
add r1, r3, r5
and r4, 0x7f, r3
sub r6, r12, r6
load Fred,,r9
There are no data dependencies, but if we only have two ALUs then we shall have to stall the pipeline when we
fetch the sub instruction.
Issuing instructions out of order would mean that the CPU could fetch and begin work on the load instruction which
would not involve the ALU.
1.
2.
3.
4.
There are four possibilities:

In-order Issue, In-order Completion
In-order Issue, Out-of-order Completion
Out-of-order Issue, In-order Completion
Out-of-order Issue, Out-of-order Completion
Doing everything in order is the simplest approach, but the slowest. We may need to stall the pipeline, as noted
above
As soon as we either issue or retire instructions out of order, the CPU is involved with considerable book-keeping
overhead in order to ensure correctness.
One common technique, known as scoreboarding was pioneered in what many consider to have been the first
superscalar machine, Seymour Cray's CDC6600, in 1964.
Examples of the problems which can arise include:
Output dependencies
add r3, r5, r3
add r3, 1, r4
add r5, 1, r3
If the third instruction completes before the first, the the value in R3 will be incorrect.
Antidependencies
add r3, r5, r3
add r3, 1, r4
add r5, 1, r3
add r3, r4, r7
The third instruction must not complete before the second fetches its operands
Both output dependencies and antidependencies occur because changing register contents may not reflect the
original program sequence. This may result in a pipeline stall, wasting clock cycles
o
Register Renaming
One form of resource duplication is register renaming, a technique whereby physical registers are dynamically
allocated by the CPU. This also requries register-name to physical register mapping.
add r3a, r5a, r3b
add r3b, 1, r4b
add r5a, 1, r3c
add r3c, r4b, r7b
Without register renaming instruction three cannot be issued before instruction one has completed and instruction
two has been issued.
With renaming instruction three can be issued immediately. As instructions are retired their destination registers
become the "real" ones.
o
0.
1.
o
Interrupts
If out-of-order completion is allowd, what PC value do we save at interrupt time, in order to ensure that:
Instructions are not repeated on restarting the program
Instructions are not missed on restarting the program
Superscalar Processors
CPU with single pipeline is called a scalar processor
CPU with multiple pipelines are called superscalar
Superscalar CPU can execute more than one instruction per clock cycle, thus a superscalar processor with 3
functional units (i.e. 3 pipelines) will (in theory) run 3 times faster than a scalar processor with the same clock
speed

Superscalar and Superpipelined Processors

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Superscalar and Superpipelined Processors

Încărcat de

Drepturi de autor:

Formate disponibile

Superscalar and Superpipelined Processors

Logical evolution of pipeline designs resulted in 2 high-performance execution techniques:superscalar and

Comparison of superpipelined and superscalar performance

Problems in the Pipeline

add r1, r4, r5

add r1, r3, r5

There are four possibilities:

S-ar putea să vă placă și