Sunteți pe pagina 1din 71


Is a contract b/w h/w and s/w ISA defines an interface that separates what is done statically at compile time versus what is done dynamically at run time This interface is called the DSI

Dynamic-Static Interface

Semantic gap between s/w and h/w Placement of DSI determines how gap is bridged

Dynamic-Static Interface

Low-level DSI exposes more knowledge of hardware through the ISA

Optimized code becomes specific to implementation

Places greater burden on compiler/programmer

In fact: happens for higher-level DSI also

SIMD and VLIW identify independent operations

Placed in same instruction word

Dependence vs. Independence Architecture

Perhaps more important to identify dependent operations

Easier for dynamic hardware to collocate dependent instructions Accumulator-based ISAs


Defining Performance
What is important to whom? Computer system user
Minimize elapsed time for program = time_end time_start Called response time

Computer center manager

Maximize completion rate = #jobs/second Called throughput

Response Time vs. Throughput

Is throughput = 1/av. response time?
Only if NO overlap Otherwise, throughput > 1/av. response time

Improve Performance
Improve (a) response time or (b) throughput?
Faster CPU
Helps both (a) and (b)

Add more CPUs

Helps (b) and perhaps (a) due to less queuing

Iron Law
Time Processor Performance = --------------Program

Instructions Program
(code size)

Cycles X Instruction

Time Cycle
(cycle time)

Architecture --> Implementation --> Realization

Compiler Designer Processor Designer Chip Designer

CPI ..?
What is the CPI for a CISC machine? For a RISC m/c ? For a pipelined m/c ?

With pipelining ..
CPI can be made 1 by overlapping instructions This is the best figure for CPI ,or inversely one IPC (Instruction per cycle)

With superscalar..
The aim is to reduce CPI to < 1 or inversely increase IPC to be >1

For some program running on machine X, PerformanceX = 1 / Execution timeX "X is n times faster than Y"

n = PerformanceX / PerformanceY = Execution timeY / Execution timeX

Problem: Machine A runs a program in 10 seconds and machine B in 15 seconds. How much faster is A than B?

Answer: n = PerformanceA / PerformanceB = Execution timeB/Execution timeA = 15/10 = 1.5 A is 1.5 times faster than B.

Performance best determined by running a real application Use programs typical of expected workload Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks nice for architects and designers easy to standardize can be abused SPEC (System Performance Evaluation Cooperative) companies have agreed on a set of real programs and inputs can still be abused valuable indicator of performance (and compiler

Version 1 Execution Time After Improvement = Execution Time Unaffected + Execution Time Affected / Amount of Improvement Version 2 Speedup = Performance after improvement / Performance before improvement = Execution time before improvement / Execution time after improvement

Before: After:

n n a/p

Execution time:before n + a after n + a/p

n+ a su = a n+ p

Amdahls law
Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"

100 /4 = 80 /n + 20 5 = 80/n n= 80 / 5 = 16
i.e. To improve the overall performance by 4, the multiplication part has to be enhanced 16 times

A benchmark program spends half of the time executing floating point instructions. We improve the performance of the floating point unit by a factor of four. What is the speedup?

Time before :10 (supposition)

Time after = 5 + 5/4 = 6.25 Speedup = 10/6.25 = 1.6

Amdahls Law
N No. of Processors 1 h 1-h 1-f f Time

h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Overall speedup:
Speedup =

1 f v

1 f +

If N is the number of processors used for parallel computations, Speedup = 1/ [(1-f) +(f/N)] This simply means that even if we increase the number of processors , the speedup is limited by the non vectorisable part of the program

Consider that half the program is vectorisable . Speedup = 1/( 0.5 +0.5/N) Find the speedup for various values of N . N=1,2,5,10,100

N=1 , SU=1 N=2, SU =1.33 N=4 ,SU=1.6 N=10 .SU=1.8 N=100 , SU=1.98 N=

Revisit Amdahls Law

1 = Sequential bottleneck lim v f 1 f 1 f + Even if v is infinite v Performance limited by nonvectorizable portion (1-f)
N No. of Processors 1 h 1-h 1-f f Time

Revisit Amdahls Law

1 = Sequential bottleneck lim v f 1 f 1 f + Even if v is infinite v Performance limited by nonvectorizable portion (1-f)
N No. of Processors 1 h 1-h 1-f f Time

Amdahl's law
A parallel application can not run faster than the sum of its sequential parts!

Parallelization ideally yields: T=Ts+ Tp Parallelization ideally yields: T=Ts+ Tp Ts= Ts1+Ts2+Ts3+Ts4 Tp=Tp1+Tp2+Tp3

Pipelined Performance Model

N Pipeline Depth 1

g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled)


Pipelined Performance Model

N Pipeline Depth

1 1-g

g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled)

The second profile is similar to this one which we have seen before

1-g is the time the pipeline is stalled. Speed up = 1/[ (1-g) +(g/N)] As g drops off slightly from 100% , the speedup drops off quickly

We can borrow from the parallel processor model for interpreting pipeline effects g is now the time the pipeline is full

This simply means that ,the performance gain obtained from pipelining ,is strongly degraded by a few number of stall cycles

Pipelined Performance Model

N Pipeline Depth

Tyranny of Amdahls Law [Bob Colwell]


When g is even slightly below 100%, a big performance drop will result Stalled cycles are the key adversary and must be minimized as much as possible

Stall cycles ..
..constitute the sequential bottleneck for pipelined processors When a pipeline is stalled, there is only one instruction in the pipeline ,no overlapping of instructions occur Thus the pipeline is stalled for N cycles

The superscalar proposal

S =1/[ (1-f) +(f/N)] machine parallelism measured by N program parallelism i.e. vectorability f This formulism is influenced by traditional computers which have a scalar unit and a vector unit

For N=6 ,the speedup S =1/[(1-f) + f/6]

S= 1/[ ((1-f)/2) +f/6] see the curve in dotted lines

If the sequential part can be parallelised by a factor of 2,

Motivation for Superscalar [Agerwala and Cocke]

Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar)

Typical Range

The goal of superscalar processors is

to allow generalized instruction level parallelism and thus achieve speed up even for programs which are not vectorisable

Superscalar Proposal
Moderate tyranny of Amdahls Law
Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahls Law:

1 Speedup = 1 f + f s v

If , in the previous equation , v=N=6 , and s=2 ,f=0.5 The Speed up is =3

Is defined as the aggregate degree of parallelism (measured as the number of instructions) that can be achieved by the concurrent execution of multiple instructions

Limits on Instruction Level Parallelism (ILP)

Weiss and Smith [1984] Sohi and Vajapeyam [1987] Tjaden and Flynn [1970] Tjaden and Flynn [1973] Uht [1986] Smith et al. [1989] Jouppi and Wall [1988] Johnson [1991] Acosta et al. [1986] Wedig [1982] Butler et al. [1991] Melvin and Patt [1991] Wall [1991] Kuck et al. [1972] Riseman and Foster [1972] 1.58 1.81 1.86 (Flynns bottleneck) 1.96 2.00 2.00 2.40 2.50 2.79 3.00 5.8 6 7 (Jouppi disagreed) 8 51 (no control dependences)

Nicolau and Fisher [1984]

90 (Fishers optimism)

Superscalar Proposal
Go beyond single instruction pipeline, achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP)

Terms related to ILP

Machines used for exploiting ILP are uniprocessors with parallelism obtained from certain techniques i) OL Operation Latency ?? ii) MP-Machine parallelism ?? iii) IL Issue Latency? iv)IP Issue Parallelism

Operation Latency
The number of machine cycles until the result of an instruction is available for use by a subsequent instruction . For a reference instruction, OL is the number of machine cycles required for the execution of such an instruction

MP-Machine parallelism the maximum no of simultaneously executing instructions the m/c can support . This is the maximum no of instructions that can be in the pipeline at any one time

IL Issue Latency
is the number of machine cycles required between issuing two consecutive instructions . Issuing means the initiating of a new instruction into the pipeline

Issue Parallelism
.. Is the maximum no of instructions that can be issued in every m/c cycle

Classifying ILP Machines

[Jouppi, DECWRL 1991] Baseline scalar RISC
Issue parallelism = IP = 1 Operation latency = OP = 1 Peak IPC = 1
1 2 3 4 5 IF 6 4 5 6 7 8 9 DE EX WB



Classifying ILP Machines

[Jouppi, DECWRL 1991] Superpipelined: cycle time = 1/m of baseline
Issue parallelism = IP = 1 inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = m instr / major cycle (m x speedup?)

Superpipelined m/c

2 3 4 5 6 IF 2 DE 3 EX 4 5 WB 6

Classifying ILP Machines

[Jouppi, DECWRL 1991] Superscalar:
Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle (n x speedup?)

1 2 3 4 5 6 7 8 9 IF DE EX WB

Classifying ILP Machines

[Jouppi, DECWRL 1991] VLIW: Very Long Instruction Word
Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle = 1 VLIW / cycle




Classifying ILP Machines

[Jouppi, DECWRL 1991] Superpipelined-Superscalar
Issue parallelism = IP = n inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = n x m instr / major cycle

1 2 3 4 5 6 7 8 9 IF DE EX WB

Superscalar vs. Superpipelined

Roughly equivalent performance
If n = m then both have about the same IPC Parallelism SUPERSCALAR space vs. exposed in Key: time


Dcode Execute Writeback

Time in Cycles (of Base Machine)





Superpipelining is a new and special term meaning
pipelining. The prefix is attached to increase the probability of funding for research proposals. There is no theoretical basis distinguishing superpipelining from pipelining. Etymology of the term is probably similar to the derivation of the now-common terms, methodology and functionality as pompous substitutes for method and function. The novelty of the term superpipelining lies in its reliance on a prefix rather than a suffix for the pompous extension of the root word.

- Nick Tredennick, 1991

Superpipelining: Hype vs. Reality

Superpipelining - Jouppi, 1989 essentially describes a pipelined execution stage

Jouppi s ba se machine Unde rpipelined ma chines cannot issue instructions as fast a s they a re executed Unde rpipelined ma chine Note - key characteristic of Superpipe lined machines is that results are not available to M-1 suc cessive instructions

Superpipelined machine