Amdahl S Law

ISA
Is a contract b/w h/w and s/w ISA defines an interface that separates what is done statically at compile time versus what is done dynamically at run time This interface is called the DSI
Dynamic-Static Interface
Semantic gap between s/w and h/w Placement of DSI determines how gap is bridged
Dynamic-Static Interface
Low-level DSI exposes more knowledge of hardware through the ISA
Optimized code becomes specific to implementation
Places greater burden on compiler/programmer
In fact: happens for higher-level DSI also
SIMD and VLIW identify independent operations

Placed in same instruction word
Dependence vs. Independence Architecture
Perhaps more important to identify dependent operations

Easier for dynamic hardware to collocate dependent instructions Accumulator-based ISAs
PERFORMANCE
Defining Performance
What is important to whom? Computer system user
Minimize elapsed time for program = time_end time_start Called response time
Computer center manager

Maximize completion rate = #jobs/second Called throughput
Response Time vs. Throughput

Is throughput = 1/av. response time?
Only if NO overlap Otherwise, throughput > 1/av. response time
Improve Performance
Improve (a) response time or (b) throughput?
Faster CPU
Helps both (a) and (b)
Add more CPUs

Helps (b) and perhaps (a) due to less queuing
Iron Law
Time Processor Performance = --------------Program
=
Instructions Program
(code size)
Cycles X Instruction
(CPI)
Time Cycle
(cycle time)
Architecture --> Implementation --> Realization

Compiler Designer Processor Designer Chip Designer
CPI ..?
What is the CPI for a CISC machine? For a RISC m/c ? For a pipelined m/c ?
With pipelining ..
CPI can be made 1 by overlapping instructions This is the best figure for CPI ,or inversely one IPC (Instruction per cycle)
With superscalar..
The aim is to reduce CPI to < 1 or inversely ..to increase IPC to be >1
For some program running on machine X, PerformanceX = 1 / Execution timeX "X is n times faster than Y"
n = PerformanceX / PerformanceY = Execution timeY / Execution timeX
Problem: Machine A runs a program in 10 seconds and machine B in 15 seconds. How much faster is A than B?
Answer: n = PerformanceA / PerformanceB = Execution timeB/Execution timeA = 15/10 = 1.5 A is 1.5 times faster than B.
BENCH MARKS
Performance best determined by running a real application Use programs typical of expected workload Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. Small benchmarks nice for architects and designers easy to standardize can be abused SPEC (System Performance Evaluation Cooperative) companies have agreed on a set of real programs and inputs can still be abused valuable indicator of performance (and compiler
AMDAHLS LAW
Version 1 Execution Time After Improvement = Execution Time Unaffected + Execution Time Affected / Amount of Improvement Version 2 Speedup = Performance after improvement / Performance before improvement = Execution time before improvement / Execution time after improvement
Before: After:
n n a/p
Execution time:before n + a after n + a/p
n+ a su = a n+ p
Amdahls law
Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"
100 /4 = 80 /n + 20 5 = 80/n n= 80 / 5 = 16
i.e. To improve the overall performance by 4, the multiplication part has to be enhanced 16 times
A benchmark program spends half of the time executing floating point instructions. We improve the performance of the floating point unit by a factor of four. What is the speedup?
Time before :10 (supposition)

Time after = 5 + 5/4 = 6.25 Speedup = 10/6.25 = 1.6
Amdahls Law
N No. of Processors 1 h 1-h 1-f f Time
h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f Overall speedup:
Speedup =
1 f v
1 f +
If N is the number of processors used for parallel computations, Speedup = 1/ [(1-f) +(f/N)] This simply means that even if we increase the number of processors , the speedup is limited by the non vectorisable part of the program
Consider that half the program is vectorisable . Speedup = 1/( 0.5 +0.5/N) Find the speedup for various values of N . N=1,2,5,10,100
N=1 , SU=1 N=2, SU =1.33 N=4 ,SU=1.6 N=10 .SU=1.8 N=100 , SU=1.98 N=
Revisit Amdahls Law

1 = Sequential bottleneck lim v f 1 f 1 f + Even if v is infinite v Performance limited by nonvectorizable portion (1-f)
Revisit Amdahls Law

1 = Sequential bottleneck lim v f 1 f 1 f + Even if v is infinite v Performance limited by nonvectorizable portion (1-f)
Amdahl's law
A parallel application can not run faster than the sum of its sequential parts!
Parallelization ideally yields: T=Ts+ Tp Parallelization ideally yields: T=Ts+ Tp Ts= Ts1+Ts2+Ts3+Ts4 Tp=Tp1+Tp2+Tp3
Pipelined Performance Model

N Pipeline Depth 1
g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled)
1-g

N Pipeline Depth
1 1-g
g
g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled)
The second profile is similar to this one which we have seen before
1-g is the time the pipeline is stalled. Speed up = 1/[ (1-g) +(g/N)] As g drops off slightly from 100% , the speedup drops off quickly
We can borrow from the parallel processor model for interpreting pipeline effects g is now the time the pipeline is full
This simply means that ,the performance gain obtained from pipelining ,is strongly degraded by a few number of stall cycles

N Pipeline Depth
Tyranny of Amdahls Law [Bob Colwell]
1-g
When g is even slightly below 100%, a big performance drop will result Stalled cycles are the key adversary and must be minimized as much as possible
Stall cycles ..
..constitute the sequential bottleneck for pipelined processors When a pipeline is stalled, there is only one instruction in the pipeline ,no overlapping of instructions occur Thus the pipeline is stalled for N cycles
The superscalar proposal

S =1/[ (1-f) +(f/N)] machine parallelism measured by N program parallelism i.e. vectorability f This formulism is influenced by traditional computers which have a scalar unit and a vector unit
For N=6 ,the speedup S =1/[(1-f) + f/6]
S= 1/[ ((1-f)/2) +f/6] see the curve in dotted lines
If the sequential part can be parallelised by a factor of 2,
Motivation for Superscalar [Agerwala and Cocke]

Speedup jumps from 3 to 4.3 for N=6, f=0.8, but s =2 instead of s=1 (scalar)
Typical Range
The goal of superscalar processors is

to allow generalized instruction level parallelism and thus achieve speed up even for programs which are not vectorisable
Superscalar Proposal
Moderate tyranny of Amdahls Law
Ease sequential bottleneck More generally applicable Robust (less sensitive to f) Revised Amdahls Law:
1 Speedup = 1 f + f s v
If , in the previous equation , v=N=6 , and s=2 ,f=0.5 The Speed up is =3
ILP
Is defined as the aggregate degree of parallelism (measured as the number of instructions) that can be achieved by the concurrent execution of multiple instructions
Limits on Instruction Level Parallelism (ILP)

Weiss and Smith [1984] Sohi and Vajapeyam [1987] Tjaden and Flynn [1970] Tjaden and Flynn [1973] Uht [1986] Smith et al. [1989] Jouppi and Wall [1988] Johnson [1991] Acosta et al. [1986] Wedig [1982] Butler et al. [1991] Melvin and Patt [1991] Wall [1991] Kuck et al. [1972] Riseman and Foster [1972] 1.58 1.81 1.86 (Flynns bottleneck) 1.96 2.00 2.00 2.40 2.50 2.79 3.00 5.8 6 7 (Jouppi disagreed) 8 51 (no control dependences)
Nicolau and Fisher [1984]
90 (Fishers optimism)
Superscalar Proposal
Go beyond single instruction pipeline, achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instruction-level parallelism (ILP)
Terms related to ILP

Machines used for exploiting ILP are uniprocessors with parallelism obtained from certain techniques i) OL Operation Latency ?? ii) MP-Machine parallelism ?? iii) IL Issue Latency? iv)IP Issue Parallelism
Operation Latency
The number of machine cycles until the result of an instruction is available for use by a subsequent instruction . For a reference instruction, OL is the number of machine cycles required for the execution of such an instruction
MP-Machine parallelism
..is the maximum no of simultaneously executing instructions the m/c can support . This is the maximum no of instructions that can be in the pipeline at any one time
IL Issue Latency
is the number of machine cycles required between issuing two consecutive instructions . Issuing means the initiating of a new instruction into the pipeline
Issue Parallelism
.. Is the maximum no of instructions that can be issued in every m/c cycle
Classifying ILP Machines

[Jouppi, DECWRL 1991] Baseline scalar RISC
Issue parallelism = IP = 1 Operation latency = OP = 1 Peak IPC = 1
1 2 3 4 5 IF 6 4 5 6 7 8 9 DE EX WB
SUCCESSIVE INSTRUCTIONS
TIME IN CYCLES (OF BASELINE MACHINE)

[Jouppi, DECWRL 1991] Superpipelined: cycle time = 1/m of baseline
Issue parallelism = IP = 1 inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = m instr / major cycle (m x speedup?)
Superpipelined m/c
1
2 3 4 5 6 IF 2 DE 3 EX 4 5 WB 6

[Jouppi, DECWRL 1991] Superscalar:
Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle (n x speedup?)
1 2 3 4 5 6 7 8 9 IF DE EX WB

[Jouppi, DECWRL 1991] VLIW: Very Long Instruction Word
Issue parallelism = IP = n inst / cycle Operation latency = OP = 1 cycle Peak IPC = n instr / cycle = 1 VLIW / cycle
IF
DE EX
WB

[Jouppi, DECWRL 1991] Superpipelined-Superscalar
Issue parallelism = IP = n inst / minor cycle Operation latency = OP = m minor cycles Peak IPC = n x m instr / major cycle
1 2 3 4 5 6 7 8 9 IF DE EX WB
Superscalar vs. Superpipelined

Roughly equivalent performance
If n = m then both have about the same IPC Parallelism SUPERSCALAR space vs. exposed in Key: time
IFetch
SUPERPIPELINED
Dcode Execute Writeback
Time in Cycles (of Base Machine)
10
11
12
13
Superpipelining
Superpipelining is a new and special term meaning
pipelining. The prefix is attached to increase the probability of funding for research proposals. There is no theoretical basis distinguishing superpipelining from pipelining. Etymology of the term is probably similar to the derivation of the now-common terms, methodology and functionality as pompous substitutes for method and function. The novelty of the term superpipelining lies in its reliance on a prefix rather than a suffix for the pompous extension of the root word.
- Nick Tredennick, 1991
Superpipelining: Hype vs. Reality

Superpipelining - Jouppi, 1989 essentially describes a pipelined execution stage
Jouppi s ba se machine Unde rpipelined ma chines cannot issue instructions as fast a s they a re executed Unde rpipelined ma chine Note - key characteristic of Superpipe lined machines is that results are not available to M-1 suc cessive instructions
Superpipelined machine

Amdahl S Law

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Amdahl S Law

Încărcat de

Drepturi de autor:

Formate disponibile

ISA

Low-level DSI exposes more knowledge of hardware through the ISA

Optimized code becomes specific to implementation

Places greater burden on compiler/programmer

In fact: happens for higher-level DSI also

SIMD and VLIW identify independent operations

Dependence vs. Independence Architecture

Perhaps more important to identify dependent operations

Computer center manager

Response Time vs. Throughput

Add more CPUs

Architecture --> Implementation --> Realization

n = PerformanceX / PerformanceY = Execution timeY / Execution timeX

Execution time:before n + a after n + a/p

Time before :10 (supposition)

Revisit Amdahls Law

Revisit Amdahls Law

Pipelined Performance Model

Pipelined Performance Model

Pipelined Performance Model

Tyranny of Amdahls Law [Bob Colwell]

The superscalar proposal

For N=6 ,the speedup S =1/[(1-f) + f/6]

S= 1/[ ((1-f)/2) +f/6] see the curve in dotted lines

If the sequential part can be parallelised by a factor of 2,

Motivation for Superscalar [Agerwala and Cocke]

The goal of superscalar processors is

If , in the previous equation , v=N=6 , and s=2 ,f=0.5 The Speed up is =3

Limits on Instruction Level Parallelism (ILP)

Nicolau and Fisher [1984]

Terms related to ILP

Classifying ILP Machines

TIME IN CYCLES (OF BASELINE MACHINE)

Classifying ILP Machines

Classifying ILP Machines

Classifying ILP Machines

Classifying ILP Machines

Superscalar vs. Superpipelined

Dcode Execute Writeback

Time in Cycles (of Base Machine)

- Nick Tredennick, 1991

Superpipelining: Hype vs. Reality

S-ar putea să vă placă și