Cs 546 Perf

X.
Sun (IIT) CS546 Lecture 5 Page 1

Performance Evaluation of
Parallel Processing
Xian-He Sun
Illinois Institute of Technology
Sun@iit.edu
X. Sun (IIT) CS546 Lecture 5 Page 2
Outline
Performance metrics
Speedup
Efficiency
Scalability
Examples
Reading: Kumar ch 5
Performance Evaluation
(Improving performance is the goal)
Performance Measurement
Metric, Parameter
Performance Prediction
Model, Application-Resource
Performance Diagnose/Optimization
Post-execution, Algorithm improvement,
Architecture improvement, State-of-the-art,
Scheduling, Resource management/Scheduling
Parallel Performance Metrics
(Run-time is the dominant metric)
Run-Time (Execution Time)
Speed: mflops, mips, cpi
Efficiency: throughput
Speedup

Parallel Efficiency
Scalability: The ability to maintain performance gain when
system and problem size increase
Others: portability, programming ability,etc
Time Execution Parallel
Time Execution or Uniprocess
=
p
S
Models of Speedup
Speedup

Scaled Speedup
Parallel processing gain over sequential
processing, where problem size scales up with
computing power (having sufficient
workload/parallelism)
=
p
S
Performance Evaluation of Parallel Processing
Speedup
T
s
=time for the best serial algorithm
T
p
=time for parallel algorithm using p
processors
p
s
p
T
T
S =
Example
Processor 1
time
100
time
1 2 3 4
25 25 25 25
time
1 2 3 4
35 35 35 35
(a) (b)
(c)
ation paralleliz perfect
, 0 . 4
25
100
= =
p
S
10 is cost synch but
balancing load perfect
, 85 . 2
35
100
= =
p
S
Example (cont.)
time
1 2 3 4
30 20 40 10
time
1 2 3 4
50 50 50 50
(d)
(e)
imbalance load but
synch no
, 5 . 2
40
100
= =
p
S
cost synch and
imbalance load
, 0 . 2
50
100
= =
p
S
What Is Good Speedup?
Linear speedup:

Superlinear speedup

Sub-linear speedup:
p S
p
=
p S
p
<
p S
p
>
Speedup
p
speedup
Sources of Parallel Overheads
Interprocessor communication
Load imbalance
Synchronization
Extra computation
Degradations of Parallel Processing
Unbalanced Workload
Communication Delay
Overhead Increases with the Ensemble Size
Degradations of Distributed Computing
Unbalanced Computing Power and Workload
Shared Computing and Communication Resource
Uncertainty, Heterogeneity, and Overhead Increases
with the Ensemble Size
Causes of Superlinear Speedup
Cache size increased
Overhead reduced
Latency hidden
Randomized algorithms
Mathematical inefficiency of the serial algorithm
Higher memory access cost in sequential
processing
X.H. Sun, and J. Zhu, "Performance Considerations of Shared Virtual Memory Machines,"
IEEE Trans. on Parallel and Distributed Systems, Nov. 1995

Fixed-Size Speedup (Amdahls law)
Emphasis on turnaround time
Problem size, W, is fixed
=
p
S
W
W
S
p
Solving of Time Parallel
Solving of Time or Uniprocess
=
Amdahls Law
The performance improvement that can be gained by
a parallel implementation is limited by the fraction of
time parallelism can actually be used in an
application
Let o= fraction of program (algorithm) that is serial
and cannot be parallelized. For instance:
Loop initialization
Reading/writing to a single disk
Procedure call overhead
Parallel run time is given by

s p
T )
p
( T -
+ =
1
Amdahls Law
Amdahls law gives a limit on speedup in terms of o
p p
T
T
T
S
p
T
T T
s
s
s
p
s
s p
o
o
o
o
o
o
+
=
+
=
+ =
1
1
) 1 (
) 1 (
Enhanced Amdahls Law

+
+
= p as
T
T
T
p
T
T
T
Speedup
overhead
overhead
FS
1
1
1
1 1
) 1 (
o
o
o
To include overhead
The overhead includes parallelism and interaction
overheads
Amdahls law: argument against massively parallel systems
Fixed-Size Speedup (Amdahl Law, 67)
W
p

W
1

W
p
W
p
W
p
W
p

W
1
W
1
W
1
W
1

1 2 3 4 5
Number of Processors (p)
Amount
of
Work
T
p

T
1

T
p

T
p

T
p

T
1

T
1

T
p

T
1

T
1

1 2 3 4 5
Elapsed
Time
Amdahls Law
The speedup that is achievable on p processors is:

If we assume that the serial fraction is fixed, then the
speedup for infinite processors is limited by 1/o

For example, if o=10%, then the maximum speedup is
10, even if we use an infinite number of processors

p
T
T
S
p
s
p
o
o

+
= =
1
1
o
1
lim =
> p p
S
Comments on Amdahls Law
The Amdahls fraction o in practice depends on the problem size
n and the number of processors p
An effective parallel algorithm has:

For such a case, even if one fixes p, we can get linear speedups
by choosing a suitable large problem size

Scalable speedup
Practically, the problem size that we can run for a particular
problem is limited by the time and memory of the parallel
computer

n p n as 0 ) , ( o

+
= = n p
p n p
p
T
T
S
p
s
p
as
) , ( ) 1 ( 1 o
Fixed-Time Speedup (Gustafson, 88)
Emphasis on work finished in a fixed time
Problem size is scaled from W to W'
W': Work finished within the fixed time with parallel
processing
W
W
' Solving of Time or Uniprocess
=
' Solving of Time Parallel
' Solving of Time or Uniprocess
'
W
W
S
p
=
W
W'
=
Gustafsons Law (Without Overhead)
a 1-a
time
p (1-a)p
p s
s
t t
t
+
= o
p
W
pW W
Work
p Work
Speedup
FT
) 1 (
1 (
) 1 (
) (
o o
o o
+ =
) +
= =
Fixed-Time Speedup (Gustafson)
W
p

W
1

W
p

W
p

W
p

W
p

W
1

W
1

W
1

W
1

1 2 3 4 5
Amount
of
Work
T
p

T
1

T
p
T
p
T
p

T
1
T
1

T
p

T
1
T
1

1 2 3 4 5
Elapsed
Time
Converting s between Amdahls
and Gustafons laws
Based on this observation,
Amdahls and Gustafons laws
are identical.
p
p
p
G G
A
) 1 (
1 ) 1 (
o o
o
+ =
+
G
G
A
p
o
o
o
). 1 (
1
1
+
=
o
Memory Constrained Scaling:
Sun and Nis Law
Scale the largest possible solution limited by
the memory space. Or, fix memory usage per
processor
(ex) N-body problem
Problem size is scaled from W to W*
W* is the work executed under memory
limitation of a parallel computer
For simple profile, and G(n) is the increase of
parallel workload as the memory capacity
increases p times.
) ( * * M p G W =
Sun & Nis Law
time in Increase
work in Increase
Time Work
p Time p Work
Speedup
MB
= =
) 1 ( / ) 1 (
) ( / ) (
p p G
p G
Time Work
p Time p Work
Speedup
MB
/ ) ( ) 1 (
) ( ) 1 (
) 1 ( / ) 1 (
) ( / ) (
o o
o o
+
+
= =
a 1-a
p
(1-a)G(p)
time
Memory-Bounded Speedup (Sun & Ni, 90)
Emphasis on work finished under current physical
limitation
Problem size is scaled from W to W
*

W
*
: Work executed under memory limitation with
parallel processing
*
*
*
Solving of Time Parallel
W
W
S
p
=
X.H. Sun, and L. Ni , "Scalable Problems and Memory-Bounded Speedup,"
Journal of Parallel and Distributed Computing, Vol. 19, pp.27-37, Sept. 1993 (SC90).
Memory-Boundary Speedup (Sun & Ni)
W
p

W
1

W
p

W
p

W
p

W
p

W
1

W
1

W
1

W
1

1 2 3 4 5
Amount
of
Work
T
p

T
1

T
p

T
p

T
p

T
1

T
1

T
p

T
1

T
1

1 2 3 4 5
Elapsed
Time
Work executed under memory limitation
Hierarchical memory
Characteristics
Connection to other scaling models
G(p) = 1, problem constrained scaling
G(p) = p, time constrained scaling
With overhead
G(p) > p, can lead to large increase in
execution time
(ex) 10K x 10K matrix factorization: 800MB, 1 hr in
uniprocessor
with 1024 processors, 320K x 320K matrix, 32 hrs
Scalable
More accurate solution
Sufficient parallelism
Maintain efficiency

Efficient in parallel
computing
Load balance
Communication
Mathematically
effective
Adaptive
Accuracy
Why Scalable Computing
Memory-Bounded Speedup
Natural for domain decomposition based computing
Show the potential of parallel processing (In gerneal,
computing requirement increases faster with problem
size than that of communication)
Impacts extend to architecture design: trade-off of
memory size and computing speed
Why Scalable Computing (2)
Appropriate for small machine
Parallelism overheads begin to dominate benefits
for larger machines
Load imbalance
Communication to computation ratio
May even achieve slowdowns
Does not reflect real usage, and inappropriate for
large machine
Can exaggerate benefits of improvements
Small Work
Why Scalable Computing (3)
Appropriate for big machine
Difficult to measure improvement
May not fit for small machine
Cant run
Thrashing to disk
Working set doesnt fit in cache
Fits at some p, leading to superlinear speedup
Large Work
Demonstrating Scaling Problems
parallelism
overhead
superlinear
User want to scale problems as machines grow!
Small Ocean problem
On SGI Origin2000
Big equation solver problem
On SGI Origin2000
How to Scale
Scaling a machine
Make a machine more powerful
Machine size
<processor, memory, communication, I/O>
Scaling a machine in parallel processing
Add more identical nodes
Problem size
Input configuration
data set size : the amount of storage required to
run it on a single processor
memory usage : the amount of memory used by
the program
Two Key Issues in Problem Scaling
Under what constraints should the problem
be scaled?
Some properties must be fixed as the machine
scales
How should the problem be scaled?
Which parameters?
How?
Constraints To Scale
Two types of constraints
Problem-oriented
Ex) Time
Resource-oriented
Ex) Memory
Work to scale
Metric-oriented
Floating point operation, instructions
User-oriented
Easy to change but may difficult to compare
Ex) particles, rows, transactions
Difficult cross comparison
Speedup
S
p
=
Speed Sequential
Speed Parallel
=
p
S
Rethinking of Speedup
Why it is called speedup but compare time
Could we compare speed directly?
Generalized speedup
X.H. Sun, and J. Gustafson, "Toward A Better Parallel Performance Metric,"
Parallel Computing, Vol. 17, pp.1093-1109, Dec. 1991.
Compute t: Problem
Consider parallel algorithm for computing the value of
t=3.1415through the following numerical
integration

dx
x
}
+
=
1
0
2
1
4
2
1
4
x +
Compute t: Sequential Algorithm
computepi()
{
h=1.0/n;
sum =0.0;
for (i=0;i<n;i++) {
x=h*(i+0.5);
sum=sum+4.0/(1+x*x);
}
pi=h*sum;
}

Compute t: Parallel Algorithm
Each processor computes on a set of about n/p
points which are allocated to each processor in a
cyclic manner
Finally, we assume that the local values of t are
accumulated among the p processors under
synchronization
0
1 2 3
0
1 2 3
0
1 2 3
0
1 2 3 0
1 2 3
Compute t: Parallel Algorithm
computepi()
{
id=my_proc_id();
nprocs=number_of_procs():
h=1.0/n;
sum=0.0;
for(i=id;i<n;i=i+nprocs) {
x=h*(i+0.5);
}
localpi=sum*h;
use_tree_based_combining_for_critical_section();
pi=pi+localpi;
end_critical_section();
}

Compute t: Analysis
Assume that the computation of t is performed over n points
The sequential algorithm performs 6 operations (two
multiplications, one division, three additions) per points on the x-
axis. Hence, for n points, the number of operations executed in the
sequential algorithm is:

n T
s
6 =
for (i=0;i<n;i++) {
x=h*(i+0.5);
}

3 additions
2 multiplications
1 division
Compute t: Analysis
The parallel algorithm uses p processors with static
interleaved scheduling. Each processor computes on
a set of m points which are allocated to each process
in a cyclic manner
The expression for m is given by if p
does not exactly divide n. The runtime for the parallel
algorithm for the parallel computation of the local
values of t is:

1 + s
p
n
m
0 0
) 6 6 ( * 6 t
p
n
t m T
p
+ = =
Compute t: Analysis
The accumulation of the local values of t using a
tree-based combining can be optimally performed in
log
2
(p) steps
The total runtime for the parallel algorithm for the
computation of t including the parallel computation
and the combining is:

The speedup of the parallel algorithm is:

) )( log( ) 6 6 ( * 6
0 0 0 c p
t t p t
p
n
t m T + + + = =
) / 1 )( log( 6 6
6
0
t t p
p
n
n
T
T
S
c
p
s
p
+ + +
= =
Compute t: Analysis
The Amdahls fraction for this parallel algorithm can
be determined by rewriting the previous equation as:

Hence, the Amdahls fraction o(n,p) is:

The parallel algorithm is effective because:

) , ( ) 1 ( 1
6
) log(
1
p n p
p
S
n
p pc
n
p
p
S
p p
o +
=
+ +
=
) 1 ( 6
) log(
) 1 (
) , (
=
p n
p pc
n p
p
p n o
p n p n fixed for as 0 ) , ( o
Finite Differences: Problem
Consider a finite difference iterative method applied
to a 2D grid where:

t
j i
t
j i
t
j i
t
j i
t
j i
t
j i
X X X X X X
, , 1 , 1 1 , 1 ,
1
,
) 1 ( ) ( + + + + =
+ +
+
e e
Finite Differences: Serial Algorithm
finitediff()
{
for (t=0;t<T;t++) {
for (i=0;i<n;i++) {
for (j=0;j<n;j++) {
x[i,j]=w_1*(x[i,j-1]+x[i,j+1]+x[i-1,j]+x[i+1,j]+w_2*x[i,j];
}
}
}
}
Finite Differences: Parallel Algorithm
Each processor computes on a sub-grid of
points
Synch between processors after every iteration
ensures correct values being used for subsequent
iterations
p
n
p
n
p
n
Finite Differences: Parallel Algorithm
finitediff()
{
row_id=my_processor_row_id();
col_id=my_processor_col_id();
p=numbre_of_processors();
sp=sqrt(p);
rows=cols=ceil(n/sp);
row_start=row_id*rows;
col_start=col_id*cols;
for (t=0;t<T;t++) {
for (i=row_start;i<min(row_start+rows,n);i++) {
for (j=col_start;j<min(col_start+cols,n);j++) {
}
barrier();
}
}
}
Finite Differences:Analysis
The sequential algorithm performs 6 operations(2
multiplications, 4 additions) every iteration per point on the grid.
Hence, for an n*n grid and T iterations, the number of operations
executed in the sequential algorithm is:

0
2
6 t n T
s
=

2 multiplications
4 additions
The parallel algorithm uses p processors with static
blockwise scheduling. Each processor computes on
an m*m sub-grid allocated to each processor in a
blockwise manner
The expression for m is given by The
runtime for the parallel algorithm is:

(
(
(
(
s
p
n
m
0
2
0
2
) ( 6 6 t
p
n
t m T
p
(
(
(
(
= =
The barrier synch needed for each iteration can be optimally
performed in log(p) steps
The total runtime for the parallel algorithm for the computation
is:

The speedup of the parallel algorithm is:

) )( log( 6 ) )( log( ) ( 6 6
0 0
2
0 0
2
0
2
c c p
t t p t
p
n
t t p t
p
n
t m T + + = + +
(
(
(
(
= =
) / 1 )( log( 6
6
0
2
2
t t p
p
n
n
T
T
S
c
p
s
p
+ +
= =
The Amdahls fraction for this parallel algorithm can be
determined by rewriting the previous equation as:

Hence, the Amdahls fraction o(n.p) is:

We finally note that

Hence, the parallel algorithm is effective

) , ( ) 1 ( 1
6
) log(
1
2
p n p
p
S
n
p pc
p
S
p p
o +
=
+
=
2
6 ) 1 (
) log(
) , (
n p
p pc
p n
= o
p fixed for as 0 ) , ( n p n o
Equation Solver
A[i,j] = 0.2 * (A[i, j] + A[i, j-1] + A[i-1, j] + a[i, j+1] + a[i+1, j])
n
n
procedure solve (A)

while(!done) do
diff = 0;
for i = 1 to n do
for j = 1 to n do
temp = A[i, j];
A[i, j] =
diff += abs(A[i,j] temp);
end for
end for
if (diff/(n*n) < TOL) then done =1 ;
end while
end procedure
Workloads
Basic properties
Memory requirement : O(n
2
)
Computational complexity : O(n
3
), assuming the number of
iterations to converge to be O(n)
Assume speedups equal to # of p
Grid size
Fixed-size : fixed
Fixed-time :

Memory-bound :
n p k k p n = =
3
3 3
n p k k p n = =
2 2
Memory Requirement of Equation Solver
3
2
2
3
2
) (
p
n
p
p n
p
k
=
=
Fixed-time:
3 3
k p n =
Fixed-size :
Memory-bound :
p n
2
,
p
n
2
Time Complexity of Equation Solver
Fixed-time:
Fixed-size:
Memory-bound:
2 2
k p n =
3 3
) ( p n k =
Sequential time complexity
,
p
n
3
3
n
p n
p
p n
3
3
) (
=
Concurrency
Fixed-time:
Fixed-size :
Memory-bound:
2 2
k p n =
2
n
Concurrency is proportional to the number of grid points
3 3
k p n =
3
2 2 2
3
2
) ( p n p n k = =
,
Communication to Computation Ratio
n
p
p
n
p
n
p
n
CCR = = =
2
2
2
1
Fixed-time :
Fixed-size : Memory-bound :
n
p
p
p n
p
k
p
k
p
k
CCR
6
2
3
2
2
2
) (
1 1
= = = =
n
p
p n
p
k
p
k
p
k
CCR
1
) (
1 1
2
2
2
2
= = = =
Scalability

The Need for New Metrics
Comparison of performances with different workload
Availability of massively parallel processing
Scalability
Ability to maintain parallel processing gain when both
problem size and system size increase
Parallel Efficiency

The achieved fraction of total potential
parallel processing gain
Assuming linear speedup p is ideal case
The ability to maintain efficiency when
problem size increase

p
S
E
p
p
=
Maintain Efficiency
Efficiency of adding n numbers in parallel

For an efficiency of 0.80 on 4 procs, n=64
Efficiency for Various Data Sizes
0
0.2
0.4
0.6
0.8
1
1 4 8 16 32
number of processors
E
f
f
i
c
i
e
n
c
y
n=64
n=192
n=320
n=512
E=1/(1+2plogp/n)
Ideally Scalable
T(m p, m W) = T(p, W)
T: execution time
W: work executed
P: number of processors used
m: scale up m times
work: flop count based on the best practical
serial algorithm
Fact:
T(m p, m W) = T(p, W)
if and only if
The Average Unit Speed Is Fixed
Definition:
The average unit speed is the achieved speed divided by
the number of processors

Definition (Isospeed Scalability):
An algorithm-machine combination is scalable if the
achieved average unit speed can remain constant with
increasing numbers of processors, provided the problem
size is increased proportionally
Isospeed Scalability (Sun & Rover, 91)
W: work executed when p processors are employed
W': work executed when p' > p processors are employed
to maintain the average speed

Ideal case

Scalability in terms of time

'
'
) ' , (
W p
W p
p p y Scalabilit
= =
,
'
'
p
W p
W

=
( )
( )
( ) processors ' on ' work with time
processors on work with time
'
' ,
'
p W
p W
W T
W T
p p
p
p
= =
1 ) ' , ( = p p
Isospeed Scalability (Sun & Rover)
W: work executed when p processors are employed
W': work executed when p' > p processors are employed
to maintain the average speed

Ideal case

'
'
) ' , (
W p
W p
p p y Scalabilit
= =
,
'
'
p
W p
W

=
1 ) ' , ( = p p
X. H. Sun, and D. Rover, "Scalability of Parallel Algorithm-Machine Combinations,"
IEEE Trans. on Parallel and Distributed Systems, May, 1994 (Ames TR91)
The Relation of Scalability and Time
More scalable leads to smaller time
Better initial run-time and higher scalability lead to
superior run-time
Same initial run-time and same scalability lead to
same scaled performance
Superior initial performance may not last long if
scalability is low
Range Comparison
X.H. Sun, "Scalability Versus Execution Time in Scalable Systems,"
Journal of Parallel and Distributed Computing, Vol. 62, No. 2, pp. 173-192, Feb 2002.
Range Comparison Via Performance Crossing Point
Assume Program I is oz times slower than program 2 at the initial state
Begin (Range Comparison)
p' = p;
Repeat
p' = p' + 1;
Compute the scalability of program 1 u (p,p');
Compute the scalability of program 2 + (p,p') ;
Until (u (p,p') > o+ (p,p') or p'= the limit of ensemble size)
If u (p,p') > o+ (p,p') Then
p is the smallest scaled crossing point;
program 2 is superior at any ensemble size p
, p s p
< p'
Else
program 2 is superior at any ensemble size p
, p s p
s p
End {if}
End {Range Comparison}
Range Comparison
Influence of Communication Speed Influence of Computing Speed
X.H. Sun, M. Pantano, and Thomas Fahringer, "Integrated Range Comparison for Data-Parallel
Compilation Systems," IEEE Trans. on Parallel and Distributed Processing, May 1999.
The SCALA (SCALability Analyzer) System
Design Goals
Predict performance
Support program optimization
Estimate the influence of hardware variations
Uniqueness
Designed to be integrated into advanced compiler
systems
Based on scalability analysis
Vienna Fortran Compilation System
A data-parallel restructuring compilation system
Consists of a parallelizing compiler for VF/HPF
and tools for program analysis and restructuring
Under a major upgrade for HPF2
Performance prediction is crucial for
appropriate program restructuring
The Structure of SCALA
Prototype Implementation
Automatic range comparison for different data distributions
The P
3
T static performance estimator
Test cases: Jacobi and Redblack
No Crossing Point Have Crossing Point
Summary
Relation between Iso-speed scalability and iso-
efficiency scalability
Both measure the ability to maintain parallel efficiency
defined as

Where iso-efficiencys speedup is the traditional speedup
defined as

Iso-speeds speedup is the generalized speedup defined as

If the the sequential execution speed is independent of
problem size, iso-speed and iso-efficiency is equivalent
Due to memory hierarchy, sequential execution performance
varies largely with problem size
p
S
E
p
p
=
Speed Sequential
Speed Parallel
=
p
S
=
p
S
Summary
Predict the sequential execution performance
becomes a major task of SCALA due to advanced
memory hierarchy
Memory-LogP model is introduced for data access cost
New challenge in distributed computing
Generalized iso-speed scalability
Generalized performance tool: GHS
K. Cameron and X.-H. Sun, "Quantifying Locality Effect in Data Access Delay: Memory logP,"
Proc. of 2003 IEEE IPDPS 2003, Nice, France, April, 2003.
X.-H. Sun and M. Wu, "Grid Harvest Service: A System for Long-Term, Application-Level Task
Scheduling," Proc. of 2003 IEEE IPDPS 2003, Nice, France, April, 2003.

Cs 546 Perf

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Cs 546 Perf

Încărcat de

Drepturi de autor:

Formate disponibile

X.

Sun (IIT) CS546 Lecture 5 Page 1

S-ar putea să vă placă și