Sunteți pe pagina 1din 25

General

Principles of Pipelining
Andrew Wareld
CS313

Distrac;ons
News ash: If you havent started Assignment
1 yet, you may be in trouble.

Learning Goals
Now that we understand how the sequen;al
CPU works, lets talk about how to make it go
faster.
This lecture will talk about the basic ideas
behind pipelining, performance ramica;ons,
and the challenges that result.
Slides for this unit are slightly modied versions of Bryant and
OHallarons Chapter 4 mini-course at hSp://www.cs.cmu.edu/afs/
cs/academic/class/15349-s02/www/lectures.html

Mo;va;on
Whats wrong
with the
sequen;al
y86?

Real-world Pipelines: Car Washes


Sequen;al

Pipelined

Parallel

Processor Eciency
How can we measure it?
Latency:

Throughput?

Computa;onal Example
300 ps
Combinational
logic

20 ps
R
Delay = 320 ps
e
Throughput = 3.12 GOPS
g

Clock

System
Computa;on requires total of 300 picoseconds
Addi;onal 20 picoseconds to save result in register
Must have clock cycle of at least 320 ps

3-Way Pipelined Version


100 ps
Comb.
logic
A

20 ps
R
e
g

100 ps

20 ps

Comb.
logic
B

R
e
g

100 ps
Comb.
logic
C

20 ps
R
Latency = ?
e
Throughput = ?
g

Clock

System

Divide combina;onal logic into 3 blocks of 100 ps each


Can begin new opera;on as soon as previous one
passes through stage A.
Begin new opera;on every 120 ps

Overall latency increases

360 ps from start to nish

Pipeline Diagrams
Unpipelined
OP1
OP2
OP3

Time

Cannot start new opera;on un;l previous one completes

3-Way Pipelined
OP1 A

OP2

OP3

Time

Up to 3 opera;ons in process simultaneously

Opera;ng a Pipeline
239
241 300 359
Clock
OP1

OP2

OP3
0

120

240

360

C
480

640

Time
100 ps
Comb.
logic
A

20 ps
R
e
g

100 ps
Comb.
logic
B

20 ps
R
e
g

100 ps
Comb.
logic
C

20 ps
R
e
g

Clock

PIPE- Hardware
Pipeline registers hold
intermediate values
from instruc;on
execu;on

Forward (Upward)
Paths
Values passed from
one stage to next
Cannot jump past
stages
e.g., valC passes
through decode

Passing Data Across Stages


Example: rmmovl %edx, $0x1234(%esi)

Limita;ons: Nonuniform Delays

50 ps

Comb.
logic

20 ps
R
e
g

150 ps

20 ps

Comb.
logic
B

OP1
OP2

R
e
g

100 ps

Comb.
logic
C

R
Delay = 510 ps
e
Throughput = 5.88 GOPS
g

Clock

OP3

20 ps

Time

Throughput limited by slowest stage


Other stages sit idle for much of the ;me
Challenging to par;;on system into balanced stages

Limita;ons: Register Overhead


50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps 50 ps 20 ps
Comb.
logic

R
e
g

Comb.
logic

R
e
g

Comb.
logic

R
e
g

Clock

Comb.
logic

R
e
g

Comb.
logic

R
e
g

Comb.
logic

R
e
g

Delay = 420 ps, Throughput = 14.29 GOPS

As try to deepen pipeline, overhead of loading registers


becomes more signicant
Percentage of clock cycle spent loading register:
1-stage pipeline:
3-stage pipeline:
6-stage pipeline:

6.25%
16.67%
28.57%

High speeds of modern processor designs obtained through


very deep pipelining

Pipeline Depths

Early MIPS: RISC in the early 80s: 5 stage


Sparc, PowerPC: 9 stage
Intel Pen;um IV PrescoS: 31 stage!
Intel i7 core: 14 stages

Instruc;on-Level Parallelism
Instruc;on-level parallelism

Eec;ve pipeline usage requires some parallelism

Instruc;ons that can run concurrently


Some processors can even run instruc;ons out of order

The program must have the same output as if


instruc1ons were executed sequen1ally.
Dependencies constrain parallelism

Compilers can reorder instruc;ons to expose as much


instruc;on-level parallelism as possible.
However they can not know every detail of the processor's
pipeline (e.g. later Pen;um IV's had more stages than earlier
ones).
So the pipeline must handle all dependencies correctly.

Sequen;al Consistency
Programming languages like C, C++ and Java
are based on the sequen;al consistency
model:
The eect of execu;ng the program must be
the same as if instruc;ons were executed one
by one in the order they are wriSen.
If people were smarter and there was only
one CPU implementa;on, we could go faster.

Lets learn why this might be hard.


(human pipeline example)

Dependencies
Types of dependencies:
Data dependencies:
Causal: A B if B reads a value wriSen by A.
Output: A B if B writes to a loca;on wriSen by A.
Alias (an;): A B if B writes to a loca;on read by A.

Control dependencies:
Whether a branch is taken or not taken ( jmp, jxx, call,
ret).
When an instruc;on writes to instruc;on memory (self-
modifying code).

Data Dependencies
Combinational
logic

R
e
g

Clock
OP1
OP2
OP3
Time

System
Each opera;on depends on result from preceding one

Data Hazards
Comb.
logic
A

R
e
g

OP1
OP2

Comb.
logic
B

R
e
g

Comb.
logic
C

Clock

OP3
OP4

R
e
g

Time

Result does not feed back around in ;me for next


opera;on
Pipelining has changed behavior of system

Data Dependencies in Processors


1

irmovl $50, %eax

addl %eax ,

mrmovl 100( %ebx ),

%ebx
%edx

Result from one instruc;on used as operand for


another
Read-aoer-write (RAW) dependency

Very common in actual programs


Must make sure our pipeline handles these properly
Get correct results
Minimize performance impact

Control Hazards
Condi;onal branches.
Self modifying code.

Learning Goals
Now that we understand how the sequen;al
CPU works, lets talk about how to make it go
faster.
This lecture will talk about the basic ideas
behind pipelining, performance ramica;ons,
and the challenges that result.
Slides for this unit are slightly modied versions of Bryant and
OHallarons Chapter 4 mini-course at hSp://www.cs.cmu.edu/afs/
cs/academic/class/15349-s02/www/lectures.html

S-ar putea să vă placă și