General Principles of Pipelining: Andrew Warfield CS313

General
Principles of Pipelining
Andrew Wareld
CS313
Distrac;ons
News ash: If you havent started Assignment
1 yet, you may be in trouble.
Learning Goals
Now that we understand how the sequen;al
CPU works, lets talk about how to make it go
faster.
This lecture will talk about the basic ideas
behind pipelining, performance ramica;ons,
and the challenges that result.
Slides for this unit are slightly modied versions of Bryant and
OHallarons Chapter 4 mini-course at hSp://www.cs.cmu.edu/afs/
cs/academic/class/15349-s02/www/lectures.html
Mo;va;on
Whats wrong
with the
sequen;al
y86?
Real-world Pipelines: Car Washes

Sequen;al
Pipelined
Parallel
Processor Eciency
How can we measure it?
Latency:
Throughput?
Computa;onal Example
300 ps
Combinational
logic
20 ps
R
Delay = 320 ps
e
Throughput = 3.12 GOPS
g
Clock
System
Computa;on requires total of 300 picoseconds
Addi;onal 20 picoseconds to save result in register
Must have clock cycle of at least 320 ps
3-Way Pipelined Version

100 ps
Comb.
logic
A
20 ps
R
e
g
100 ps
20 ps
Comb.
logic
B
R
e
g
100 ps
Comb.
logic
C
20 ps
R
Latency = ?
e
Throughput = ?
g
Clock
System
Divide combina;onal logic into 3 blocks of 100 ps each

Can begin new opera;on as soon as previous one
passes through stage A.
Begin new opera;on every 120 ps
Overall latency increases
360 ps from start to nish
Pipeline Diagrams
Unpipelined
OP1
OP2
OP3
Time
Cannot start new opera;on un;l previous one completes
3-Way Pipelined
OP1 A
OP2
OP3
Time
Up to 3 opera;ons in process simultaneously
Opera;ng a Pipeline
239
241 300 359
Clock
OP1
OP2
OP3
0
120
240
360
C
480
640
Time
100 ps
Comb.
logic
A
20 ps
R
e
g
100 ps
Comb.
logic
B
20 ps
R
e
g
100 ps
Comb.
logic
C
20 ps
R
e
g
Clock
PIPE- Hardware
Pipeline registers hold
intermediate values
from instruc;on
execu;on
Forward (Upward)
Paths
Values passed from
one stage to next
Cannot jump past
stages
e.g., valC passes
through decode
Passing Data Across Stages

Example: rmmovl %edx, $0x1234(%esi)
Limita;ons: Nonuniform Delays
50 ps
Comb.
logic
20 ps
R
e
g
150 ps
20 ps
Comb.
logic
B
OP1
OP2
R
e
g
100 ps
Comb.
logic
C
R
Delay = 510 ps
e
Throughput = 5.88 GOPS
g
Clock
OP3
20 ps
Time
Throughput limited by slowest stage

Other stages sit idle for much of the ;me
Challenging to par;;on system into balanced stages
Limita;ons: Register Overhead

50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Clock
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Comb.
logic
R
e
g
Delay = 420 ps, Throughput = 14.29 GOPS
As try to deepen pipeline, overhead of loading registers

becomes more signicant
Percentage of clock cycle spent loading register:
1-stage pipeline:
3-stage pipeline:
6-stage pipeline:
6.25%
16.67%
28.57%
High speeds of modern processor designs obtained through

very deep pipelining
Pipeline Depths
Early MIPS: RISC in the early 80s: 5 stage

Sparc, PowerPC: 9 stage
Intel Pen;um IV PrescoS: 31 stage!
Intel i7 core: 14 stages
Instruc;on-Level Parallelism
Instruc;on-level parallelism
Eec;ve pipeline usage requires some parallelism
Instruc;ons that can run concurrently

Some processors can even run instruc;ons out of order
The program must have the same output as if

instruc1ons were executed sequen1ally.
Dependencies constrain parallelism
Compilers can reorder instruc;ons to expose as much

instruc;on-level parallelism as possible.
However they can not know every detail of the processor's
pipeline (e.g. later Pen;um IV's had more stages than earlier
ones).
So the pipeline must handle all dependencies correctly.
Sequen;al Consistency
Programming languages like C, C++ and Java
are based on the sequen;al consistency
model:
The eect of execu;ng the program must be
the same as if instruc;ons were executed one
by one in the order they are wriSen.
If people were smarter and there was only
one CPU implementa;on, we could go faster.
Lets learn why this might be hard.

(human pipeline example)
Dependencies
Types of dependencies:
Data dependencies:
Causal: A B if B reads a value wriSen by A.
Output: A B if B writes to a loca;on wriSen by A.
Alias (an;): A B if B writes to a loca;on read by A.
Control dependencies:
Whether a branch is taken or not taken ( jmp, jxx, call,
ret).
When an instruc;on writes to instruc;on memory (self-
modifying code).
Data Dependencies
Combinational
logic
R
e
g
Clock
OP1
OP2
OP3
Time
System
Each opera;on depends on result from preceding one
Data Hazards
Comb.
logic
A
R
e
g
OP1
OP2
Comb.
logic
B
R
e
g
Comb.
logic
C
Clock
OP3
OP4
R
e
g
Time
Result does not feed back around in ;me for next

opera;on
Pipelining has changed behavior of system
Data Dependencies in Processors

1
irmovl $50, %eax
addl %eax ,
mrmovl 100( %ebx ),
%ebx
%edx
Result from one instruc;on used as operand for

another
Read-aoer-write (RAW) dependency
Very common in actual programs

Must make sure our pipeline handles these properly
Get correct results
Minimize performance impact
Control Hazards
Condi;onal branches.
Self modifying code.
Learning Goals
Now that we understand how the sequen;al
CPU works, lets talk about how to make it go
faster.
This lecture will talk about the basic ideas
behind pipelining, performance ramica;ons,
and the challenges that result.
Slides for this unit are slightly modied versions of Bryant and
OHallarons Chapter 4 mini-course at hSp://www.cs.cmu.edu/afs/
cs/academic/class/15349-s02/www/lectures.html

General Principles of Pipelining: Andrew Warfield CS313

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

General Principles of Pipelining: Andrew Warfield CS313

Încărcat de

Drepturi de autor:

Formate disponibile

General

Real-world Pipelines: Car Washes

3-Way Pipelined Version

Divide combina;onal logic into 3 blocks of 100 ps each

Overall latency increases

360 ps from start to nish

Cannot start new opera;on un;l previous one completes

Up to 3 opera;ons in process simultaneously

Passing Data Across Stages

Limita;ons: Nonuniform Delays

Throughput limited by slowest stage

Limita;ons: Register Overhead

Delay = 420 ps, Throughput = 14.29 GOPS

As try to deepen pipeline, overhead of loading registers

High speeds of modern processor designs obtained through

Early MIPS: RISC in the early 80s: 5 stage

Eec;ve pipeline usage requires some parallelism

Instruc;ons that can run concurrently

The program must have the same output as if

Compilers can reorder instruc;ons to expose as much

Lets learn why this might be hard.

Result does not feed back around in ;me for next

Data Dependencies in Processors

irmovl $50, %eax

mrmovl 100( %ebx ),

Result from one instruc;on used as operand for

Very common in actual programs

S-ar putea să vă placă și