Sunteți pe pagina 1din 78

X.

Sun (IIT) CS546 Lecture 5 Page 1


Performance Evaluation of
Parallel Processing
Xian-He Sun
Illinois Institute of Technology
Sun@iit.edu
X. Sun (IIT) CS546 Lecture 5 Page 2
Outline
Performance metrics
Speedup
Efficiency
Scalability
Examples
Reading: Kumar ch 5
X. Sun (IIT) CS546 Lecture 5 Page 3
Performance Evaluation
(Improving performance is the goal)
Performance Measurement
Metric, Parameter
Performance Prediction
Model, Application-Resource
Performance Diagnose/Optimization
Post-execution, Algorithm improvement,
Architecture improvement, State-of-the-art,
Scheduling, Resource management/Scheduling
X. Sun (IIT) CS546 Lecture 5 Page 4
Parallel Performance Metrics
(Run-time is the dominant metric)
Run-Time (Execution Time)
Speed: mflops, mips, cpi
Efficiency: throughput
Speedup

Parallel Efficiency
Scalability: The ability to maintain performance gain when
system and problem size increase
Others: portability, programming ability,etc
Time Execution Parallel
Time Execution or Uniprocess
=
p
S
X. Sun (IIT) CS546 Lecture 5 Page 5
Models of Speedup
Speedup


Scaled Speedup
Parallel processing gain over sequential
processing, where problem size scales up with
computing power (having sufficient
workload/parallelism)
Time Execution Parallel
Time Execution or Uniprocess
=
p
S
Performance Evaluation of Parallel Processing
X. Sun (IIT) CS546 Lecture 5 Page 6
Speedup
T
s
=time for the best serial algorithm
T
p
=time for parallel algorithm using p
processors
p
s
p
T
T
S =
X. Sun (IIT) CS546 Lecture 5 Page 7
Example
Processor 1
time
100
time
1 2 3 4
25 25 25 25
time
1 2 3 4
35 35 35 35
(a) (b)
(c)
ation paralleliz perfect
, 0 . 4
25
100
= =
p
S
10 is cost synch but
balancing load perfect
, 85 . 2
35
100
= =
p
S
X. Sun (IIT) CS546 Lecture 5 Page 8
Example (cont.)
time
1 2 3 4
30 20 40 10
time
1 2 3 4
50 50 50 50
(d)
(e)
imbalance load but
synch no
, 5 . 2
40
100
= =
p
S
cost synch and
imbalance load
, 0 . 2
50
100
= =
p
S
X. Sun (IIT) CS546 Lecture 5 Page 9
What Is Good Speedup?
Linear speedup:

Superlinear speedup

Sub-linear speedup:
p S
p
=
p S
p
<
p S
p
>
X. Sun (IIT) CS546 Lecture 5 Page 10
Speedup
p
speedup
X. Sun (IIT) CS546 Lecture 5 Page 11
Sources of Parallel Overheads
Interprocessor communication
Load imbalance
Synchronization
Extra computation
X. Sun (IIT) CS546 Lecture 5 Page 12
Degradations of Parallel Processing
Unbalanced Workload
Communication Delay
Overhead Increases with the Ensemble Size
X. Sun (IIT) CS546 Lecture 5 Page 13
Degradations of Distributed Computing
Unbalanced Computing Power and Workload
Shared Computing and Communication Resource
Uncertainty, Heterogeneity, and Overhead Increases
with the Ensemble Size
X. Sun (IIT) CS546 Lecture 5 Page 14
Causes of Superlinear Speedup
Cache size increased
Overhead reduced
Latency hidden
Randomized algorithms
Mathematical inefficiency of the serial algorithm
Higher memory access cost in sequential
processing
X.H. Sun, and J. Zhu, "Performance Considerations of Shared Virtual Memory Machines,"
IEEE Trans. on Parallel and Distributed Systems, Nov. 1995

X. Sun (IIT) CS546 Lecture 5 Page 15
Fixed-Size Speedup (Amdahls law)
Emphasis on turnaround time
Problem size, W, is fixed
Time Execution Parallel
Time Execution or Uniprocess
=
p
S
W
W
S
p
Solving of Time Parallel
Solving of Time or Uniprocess
=
X. Sun (IIT) CS546 Lecture 5 Page 16
Amdahls Law
The performance improvement that can be gained by
a parallel implementation is limited by the fraction of
time parallelism can actually be used in an
application
Let o= fraction of program (algorithm) that is serial
and cannot be parallelized. For instance:
Loop initialization
Reading/writing to a single disk
Procedure call overhead
Parallel run time is given by

s p
T )
p

( T -

+ =
1
X. Sun (IIT) CS546 Lecture 5 Page 17
Amdahls Law
Amdahls law gives a limit on speedup in terms of o
p p
T
T
T
S
p
T
T T
s
s
s
p
s
s p
o
o
o
o
o
o

+
=

+
=

+ =
1
1
) 1 (
) 1 (
X. Sun (IIT) CS546 Lecture 5 Page 18
Enhanced Amdahls Law

+

+
= p as
T
T
T
p
T
T
T
Speedup
overhead
overhead
FS
1
1
1
1 1
) 1 (
o
o
o
To include overhead
The overhead includes parallelism and interaction
overheads
Amdahls law: argument against massively parallel systems
X. Sun (IIT) CS546 Lecture 5 Page 19
Fixed-Size Speedup (Amdahl Law, 67)
W
p

W
1

W
p
W
p
W
p
W
p

W
1
W
1
W
1
W
1

1 2 3 4 5
Number of Processors (p)
Amount
of
Work
T
p

T
1

T
p

T
p

T
p

T
1

T
1

T
p

T
1

T
1

1 2 3 4 5
Number of Processors (p)
Elapsed
Time
X. Sun (IIT) CS546 Lecture 5 Page 20
Amdahls Law
The speedup that is achievable on p processors is:


If we assume that the serial fraction is fixed, then the
speedup for infinite processors is limited by 1/o


For example, if o=10%, then the maximum speedup is
10, even if we use an infinite number of processors

p
T
T
S
p
s
p
o
o

+
= =
1
1
o
1
lim =
> p p
S
X. Sun (IIT) CS546 Lecture 5 Page 21
Comments on Amdahls Law
The Amdahls fraction o in practice depends on the problem size
n and the number of processors p
An effective parallel algorithm has:


For such a case, even if one fixes p, we can get linear speedups
by choosing a suitable large problem size



Scalable speedup
Practically, the problem size that we can run for a particular
problem is limited by the time and memory of the parallel
computer

n p n as 0 ) , ( o

+
= = n p
p n p
p
T
T
S
p
s
p
as
) , ( ) 1 ( 1 o
X. Sun (IIT) CS546 Lecture 5 Page 22
Fixed-Time Speedup (Gustafson, 88)
Emphasis on work finished in a fixed time
Problem size is scaled from W to W'
W': Work finished within the fixed time with parallel
processing
W
W
Solving of Time or Uniprocess
' Solving of Time or Uniprocess
=
' Solving of Time Parallel
' Solving of Time or Uniprocess
'
W
W
S
p
=
W
W'
=
X. Sun (IIT) CS546 Lecture 5 Page 23
Gustafsons Law (Without Overhead)
a 1-a
time
p (1-a)p
p s
s
t t
t
+
= o
p
W
pW W
Work
p Work
Speedup
FT
) 1 (
1 (
) 1 (
) (
o o
o o
+ =
) +
= =
X. Sun (IIT) CS546 Lecture 5 Page 24
Fixed-Time Speedup (Gustafson)
W
p

W
1

W
p

W
p

W
p

W
p

W
1

W
1

W
1

W
1

1 2 3 4 5
Number of Processors (p)
Amount
of
Work
T
p

T
1

T
p
T
p
T
p

T
1
T
1

T
p

T
1
T
1

1 2 3 4 5
Number of Processors (p)
Elapsed
Time
X. Sun (IIT) CS546 Lecture 5 Page 25
Converting s between Amdahls
and Gustafons laws
Based on this observation,
Amdahls and Gustafons laws
are identical.
p
p
p
G G
A
) 1 (
1 ) 1 (
o o
o
+ =
+
G
G
A
p
o
o
o
). 1 (
1
1

+
=
o
X. Sun (IIT) CS546 Lecture 5 Page 27
Memory Constrained Scaling:
Sun and Nis Law
Scale the largest possible solution limited by
the memory space. Or, fix memory usage per
processor
(ex) N-body problem
Problem size is scaled from W to W*
W* is the work executed under memory
limitation of a parallel computer
For simple profile, and G(n) is the increase of
parallel workload as the memory capacity
increases p times.
) ( * * M p G W =
X. Sun (IIT) CS546 Lecture 5 Page 28
Sun & Nis Law
time in Increase
work in Increase
Time Work
p Time p Work
Speedup
MB
= =
) 1 ( / ) 1 (
) ( / ) (
p p G
p G
Time Work
p Time p Work
Speedup
MB
/ ) ( ) 1 (
) ( ) 1 (
) 1 ( / ) 1 (
) ( / ) (
o o
o o
+
+
= =
a 1-a
p
(1-a)G(p)
time
X. Sun (IIT) CS546 Lecture 5 Page 29
Memory-Bounded Speedup (Sun & Ni, 90)
Emphasis on work finished under current physical
limitation
Problem size is scaled from W to W
*

W
*
: Work executed under memory limitation with
parallel processing
*
*
*
Solving of Time Parallel
Solving of Time or Uniprocess
W
W
S
p
=
X.H. Sun, and L. Ni , "Scalable Problems and Memory-Bounded Speedup,"
Journal of Parallel and Distributed Computing, Vol. 19, pp.27-37, Sept. 1993 (SC90).
X. Sun (IIT) CS546 Lecture 5 Page 30
Memory-Boundary Speedup (Sun & Ni)
W
p

W
1

W
p

W
p

W
p

W
p

W
1

W
1

W
1

W
1

1 2 3 4 5
Number of Processors (p)
Amount
of
Work
T
p

T
1

T
p

T
p

T
p

T
1

T
1

T
p

T
1

T
1

1 2 3 4 5
Number of Processors (p)
Elapsed
Time
Work executed under memory limitation
Hierarchical memory
X. Sun (IIT) CS546 Lecture 5 Page 31
Characteristics
Connection to other scaling models
G(p) = 1, problem constrained scaling
G(p) = p, time constrained scaling
With overhead
G(p) > p, can lead to large increase in
execution time
(ex) 10K x 10K matrix factorization: 800MB, 1 hr in
uniprocessor
with 1024 processors, 320K x 320K matrix, 32 hrs
X. Sun (IIT) CS546 Lecture 5 Page 32
Scalable
More accurate solution
Sufficient parallelism
Maintain efficiency

Efficient in parallel
computing
Load balance
Communication
Mathematically
effective
Adaptive
Accuracy
Why Scalable Computing
X. Sun (IIT) CS546 Lecture 5 Page 33
Memory-Bounded Speedup
Natural for domain decomposition based computing
Show the potential of parallel processing (In gerneal,
computing requirement increases faster with problem
size than that of communication)
Impacts extend to architecture design: trade-off of
memory size and computing speed
X. Sun (IIT) CS546 Lecture 5 Page 34
Why Scalable Computing (2)
Appropriate for small machine
Parallelism overheads begin to dominate benefits
for larger machines
Load imbalance
Communication to computation ratio
May even achieve slowdowns
Does not reflect real usage, and inappropriate for
large machine
Can exaggerate benefits of improvements
Small Work
X. Sun (IIT) CS546 Lecture 5 Page 35
Why Scalable Computing (3)
Appropriate for big machine
Difficult to measure improvement
May not fit for small machine
Cant run
Thrashing to disk
Working set doesnt fit in cache
Fits at some p, leading to superlinear speedup
Large Work
X. Sun (IIT) CS546 Lecture 5 Page 36
Demonstrating Scaling Problems
parallelism
overhead
superlinear
User want to scale problems as machines grow!
Small Ocean problem
On SGI Origin2000
Big equation solver problem
On SGI Origin2000
X. Sun (IIT) CS546 Lecture 5 Page 37
How to Scale
Scaling a machine
Make a machine more powerful
Machine size
<processor, memory, communication, I/O>
Scaling a machine in parallel processing
Add more identical nodes
Problem size
Input configuration
data set size : the amount of storage required to
run it on a single processor
memory usage : the amount of memory used by
the program
X. Sun (IIT) CS546 Lecture 5 Page 38
Two Key Issues in Problem Scaling
Under what constraints should the problem
be scaled?
Some properties must be fixed as the machine
scales
How should the problem be scaled?
Which parameters?
How?
X. Sun (IIT) CS546 Lecture 5 Page 39
Constraints To Scale
Two types of constraints
Problem-oriented
Ex) Time
Resource-oriented
Ex) Memory
Work to scale
Metric-oriented
Floating point operation, instructions
User-oriented
Easy to change but may difficult to compare
Ex) particles, rows, transactions
Difficult cross comparison
X. Sun (IIT) CS546 Lecture 5 Page 40
Speedup
Time Execution Parallel
Time Execution or Uniprocess
S
p
=
Speed Sequential
Speed Parallel
=
p
S
Rethinking of Speedup
Why it is called speedup but compare time
Could we compare speed directly?
Generalized speedup
X.H. Sun, and J. Gustafson, "Toward A Better Parallel Performance Metric,"
Parallel Computing, Vol. 17, pp.1093-1109, Dec. 1991.
X. Sun (IIT) CS546 Lecture 5 Page 41
X. Sun (IIT) CS546 Lecture 5 Page 42
Compute t: Problem
Consider parallel algorithm for computing the value of
t=3.1415through the following numerical
integration



dx
x

}
+
=
1
0
2
1
4
2
1
4
x +
X. Sun (IIT) CS546 Lecture 5 Page 43
Compute t: Sequential Algorithm
computepi()
{
h=1.0/n;
sum =0.0;
for (i=0;i<n;i++) {
x=h*(i+0.5);
sum=sum+4.0/(1+x*x);
}
pi=h*sum;
}


X. Sun (IIT) CS546 Lecture 5 Page 44
Compute t: Parallel Algorithm
Each processor computes on a set of about n/p
points which are allocated to each processor in a
cyclic manner
Finally, we assume that the local values of t are
accumulated among the p processors under
synchronization
0
1 2 3
0
1 2 3
0
1 2 3
0
1 2 3 0
1 2 3
X. Sun (IIT) CS546 Lecture 5 Page 45
Compute t: Parallel Algorithm
computepi()
{
id=my_proc_id();
nprocs=number_of_procs():
h=1.0/n;
sum=0.0;
for(i=id;i<n;i=i+nprocs) {
x=h*(i+0.5);
sum=sum+4.0/(1+x*x);
}
localpi=sum*h;
use_tree_based_combining_for_critical_section();
pi=pi+localpi;
end_critical_section();
}


X. Sun (IIT) CS546 Lecture 5 Page 46
Compute t: Analysis
Assume that the computation of t is performed over n points
The sequential algorithm performs 6 operations (two
multiplications, one division, three additions) per points on the x-
axis. Hence, for n points, the number of operations executed in the
sequential algorithm is:



n T
s
6 =
for (i=0;i<n;i++) {
x=h*(i+0.5);
sum=sum+4.0/(1+x*x);
}

3 additions
2 multiplications
1 division
X. Sun (IIT) CS546 Lecture 5 Page 47
Compute t: Analysis
The parallel algorithm uses p processors with static
interleaved scheduling. Each processor computes on
a set of m points which are allocated to each process
in a cyclic manner
The expression for m is given by if p
does not exactly divide n. The runtime for the parallel
algorithm for the parallel computation of the local
values of t is:



1 + s
p
n
m
0 0
) 6 6 ( * 6 t
p
n
t m T
p
+ = =
X. Sun (IIT) CS546 Lecture 5 Page 48
Compute t: Analysis
The accumulation of the local values of t using a
tree-based combining can be optimally performed in
log
2
(p) steps
The total runtime for the parallel algorithm for the
computation of t including the parallel computation
and the combining is:

The speedup of the parallel algorithm is:

) )( log( ) 6 6 ( * 6
0 0 0 c p
t t p t
p
n
t m T + + + = =
) / 1 )( log( 6 6
6
0
t t p
p
n
n
T
T
S
c
p
s
p
+ + +
= =
X. Sun (IIT) CS546 Lecture 5 Page 49
Compute t: Analysis
The Amdahls fraction for this parallel algorithm can
be determined by rewriting the previous equation as:


Hence, the Amdahls fraction o(n,p) is:


The parallel algorithm is effective because:


) , ( ) 1 ( 1
6
) log(
1
p n p
p
S
n
p pc
n
p
p
S
p p
o +
=
+ +
=
) 1 ( 6
) log(
) 1 (
) , (

=
p n
p pc
n p
p
p n o
p n p n fixed for as 0 ) , ( o
X. Sun (IIT) CS546 Lecture 5 Page 50
Finite Differences: Problem
Consider a finite difference iterative method applied
to a 2D grid where:


t
j i
t
j i
t
j i
t
j i
t
j i
t
j i
X X X X X X
, , 1 , 1 1 , 1 ,
1
,
) 1 ( ) ( + + + + =
+ +
+
e e
X. Sun (IIT) CS546 Lecture 5 Page 51
Finite Differences: Serial Algorithm
finitediff()
{
for (t=0;t<T;t++) {
for (i=0;i<n;i++) {
for (j=0;j<n;j++) {
x[i,j]=w_1*(x[i,j-1]+x[i,j+1]+x[i-1,j]+x[i+1,j]+w_2*x[i,j];
}
}
}
}
X. Sun (IIT) CS546 Lecture 5 Page 52
Finite Differences: Parallel Algorithm
Each processor computes on a sub-grid of
points
Synch between processors after every iteration
ensures correct values being used for subsequent
iterations
p
n
p
n

p
n
X. Sun (IIT) CS546 Lecture 5 Page 53
Finite Differences: Parallel Algorithm
finitediff()
{
row_id=my_processor_row_id();
col_id=my_processor_col_id();
p=numbre_of_processors();
sp=sqrt(p);
rows=cols=ceil(n/sp);
row_start=row_id*rows;
col_start=col_id*cols;
for (t=0;t<T;t++) {
for (i=row_start;i<min(row_start+rows,n);i++) {
for (j=col_start;j<min(col_start+cols,n);j++) {
x[i,j]=w_1*(x[i,j-1]+x[i,j+1]+x[i-1,j]+x[i+1,j]+w_2*x[i,j];
}
barrier();
}
}
}
X. Sun (IIT) CS546 Lecture 5 Page 54
Finite Differences:Analysis
The sequential algorithm performs 6 operations(2
multiplications, 4 additions) every iteration per point on the grid.
Hence, for an n*n grid and T iterations, the number of operations
executed in the sequential algorithm is:


0
2
6 t n T
s
=
x[i,j]=w_1*(x[i,j-1]+x[i,j+1]+x[i-1,j]+x[i+1,j]+w_2*x[i,j];

2 multiplications
4 additions
X. Sun (IIT) CS546 Lecture 5 Page 55
Finite Differences:Analysis
The parallel algorithm uses p processors with static
blockwise scheduling. Each processor computes on
an m*m sub-grid allocated to each processor in a
blockwise manner
The expression for m is given by The
runtime for the parallel algorithm is:


(
(
(
(

s
p
n
m
0
2
0
2
) ( 6 6 t
p
n
t m T
p
(
(
(
(

= =
X. Sun (IIT) CS546 Lecture 5 Page 56
Finite Differences:Analysis
The barrier synch needed for each iteration can be optimally
performed in log(p) steps
The total runtime for the parallel algorithm for the computation
is:




The speedup of the parallel algorithm is:


) )( log( 6 ) )( log( ) ( 6 6
0 0
2
0 0
2
0
2
c c p
t t p t
p
n
t t p t
p
n
t m T + + = + +
(
(
(
(

= =
) / 1 )( log( 6
6
0
2
2
t t p
p
n
n
T
T
S
c
p
s
p
+ +
= =
X. Sun (IIT) CS546 Lecture 5 Page 57
Finite Differences:Analysis
The Amdahls fraction for this parallel algorithm can be
determined by rewriting the previous equation as:



Hence, the Amdahls fraction o(n.p) is:


We finally note that


Hence, the parallel algorithm is effective


) , ( ) 1 ( 1
6
) log(
1
2
p n p
p
S
n
p pc
p
S
p p
o +
=
+
=
2
6 ) 1 (
) log(
) , (
n p
p pc
p n

= o
p fixed for as 0 ) , ( n p n o
X. Sun (IIT) CS546 Lecture 5 Page 58
Equation Solver
A[i,j] = 0.2 * (A[i, j] + A[i, j-1] + A[i-1, j] + a[i, j+1] + a[i+1, j])
n
n
procedure solve (A)

while(!done) do
diff = 0;
for i = 1 to n do
for j = 1 to n do
temp = A[i, j];
A[i, j] =
diff += abs(A[i,j] temp);
end for
end for
if (diff/(n*n) < TOL) then done =1 ;
end while
end procedure
X. Sun (IIT) CS546 Lecture 5 Page 59
Workloads
Basic properties
Memory requirement : O(n
2
)
Computational complexity : O(n
3
), assuming the number of
iterations to converge to be O(n)
Assume speedups equal to # of p
Grid size
Fixed-size : fixed
Fixed-time :

Memory-bound :
n p k k p n = =
3
3 3
n p k k p n = =
2 2
X. Sun (IIT) CS546 Lecture 5 Page 60
Memory Requirement of Equation Solver
3
2
2
3
2
) (
p
n
p
p n
p
k
=

=
Fixed-time:
3 3
k p n =
Fixed-size :
Memory-bound :
p n
2
,
p
n
2
X. Sun (IIT) CS546 Lecture 5 Page 61
Time Complexity of Equation Solver
Fixed-time:
Fixed-size:
Memory-bound:
2 2
k p n =
3 3
) ( p n k =
Sequential time complexity
,
p
n
3
3
n
p n
p
p n
3
3
) (
=
X. Sun (IIT) CS546 Lecture 5 Page 62
Concurrency
Fixed-time:
Fixed-size :
Memory-bound:
2 2
k p n =
2
n
Concurrency is proportional to the number of grid points
3 3
k p n =
3
2 2 2
3
2
) ( p n p n k = =
,
X. Sun (IIT) CS546 Lecture 5 Page 63
Communication to Computation Ratio
n
p
p
n
p
n
p
n
CCR = = =
2
2
2
1
Fixed-time :
Fixed-size : Memory-bound :
n
p
p
p n
p
k
p
k
p
k
CCR
6
2
3
2
2
2
) (
1 1
= = = =
n
p
p n
p
k
p
k
p
k
CCR
1
) (
1 1
2
2
2
2
= = = =
X. Sun (IIT) CS546 Lecture 5 Page 64
Scalability

The Need for New Metrics
Comparison of performances with different workload
Availability of massively parallel processing
Scalability
Ability to maintain parallel processing gain when both
problem size and system size increase
X. Sun (IIT) CS546 Lecture 5 Page 65
Parallel Efficiency


The achieved fraction of total potential
parallel processing gain
Assuming linear speedup p is ideal case
The ability to maintain efficiency when
problem size increase

p
S
E
p
p
=
X. Sun (IIT) CS546 Lecture 5 Page 66
Maintain Efficiency
Efficiency of adding n numbers in parallel









For an efficiency of 0.80 on 4 procs, n=64
For an efficiency of 0.80 on 8 procs, n=192
For an efficiency of 0.80 on 16 procs, n=512
Efficiency for Various Data Sizes
0
0.2
0.4
0.6
0.8
1
1 4 8 16 32
number of processors
E
f
f
i
c
i
e
n
c
y
n=64
n=192
n=320
n=512
E=1/(1+2plogp/n)
X. Sun (IIT) CS546 Lecture 5 Page 67
Ideally Scalable
T(m p, m W) = T(p, W)
T: execution time
W: work executed
P: number of processors used
m: scale up m times
work: flop count based on the best practical
serial algorithm
Fact:
T(m p, m W) = T(p, W)
if and only if
The Average Unit Speed Is Fixed
X. Sun (IIT) CS546 Lecture 5 Page 68
Definition:
The average unit speed is the achieved speed divided by
the number of processors

Definition (Isospeed Scalability):
An algorithm-machine combination is scalable if the
achieved average unit speed can remain constant with
increasing numbers of processors, provided the problem
size is increased proportionally
X. Sun (IIT) CS546 Lecture 5 Page 69
Isospeed Scalability (Sun & Rover, 91)
W: work executed when p processors are employed
W': work executed when p' > p processors are employed
to maintain the average speed

Ideal case


Scalability in terms of time


'
'
) ' , (
W p
W p
p p y Scalabilit

= =
,
'
'
p
W p
W

=
( )
( )
( ) processors ' on ' work with time
processors on work with time
'
' ,
'
p W
p W
W T
W T
p p
p
p
= =
1 ) ' , ( = p p
X. Sun (IIT) CS546 Lecture 5 Page 70
Isospeed Scalability (Sun & Rover)
W: work executed when p processors are employed
W': work executed when p' > p processors are employed
to maintain the average speed

Ideal case


'
'
) ' , (
W p
W p
p p y Scalabilit

= =
,
'
'
p
W p
W

=
1 ) ' , ( = p p
X. H. Sun, and D. Rover, "Scalability of Parallel Algorithm-Machine Combinations,"
IEEE Trans. on Parallel and Distributed Systems, May, 1994 (Ames TR91)
X. Sun (IIT) CS546 Lecture 5 Page 71
The Relation of Scalability and Time
More scalable leads to smaller time
Better initial run-time and higher scalability lead to
superior run-time
Same initial run-time and same scalability lead to
same scaled performance
Superior initial performance may not last long if
scalability is low
Range Comparison
X.H. Sun, "Scalability Versus Execution Time in Scalable Systems,"
Journal of Parallel and Distributed Computing, Vol. 62, No. 2, pp. 173-192, Feb 2002.
X. Sun (IIT) CS546 Lecture 5 Page 72
Range Comparison Via Performance Crossing Point
Assume Program I is oz times slower than program 2 at the initial state
Begin (Range Comparison)
p' = p;
Repeat
p' = p' + 1;
Compute the scalability of program 1 u (p,p');
Compute the scalability of program 2 + (p,p') ;
Until (u (p,p') > o+ (p,p') or p'= the limit of ensemble size)
If u (p,p') > o+ (p,p') Then
p is the smallest scaled crossing point;
program 2 is superior at any ensemble size p

, p s p

< p'
Else
program 2 is superior at any ensemble size p

, p s p

s p
End {if}
End {Range Comparison}
X. Sun (IIT) CS546 Lecture 5 Page 73
Range Comparison
Influence of Communication Speed Influence of Computing Speed
X.H. Sun, M. Pantano, and Thomas Fahringer, "Integrated Range Comparison for Data-Parallel
Compilation Systems," IEEE Trans. on Parallel and Distributed Processing, May 1999.
X. Sun (IIT) CS546 Lecture 5 Page 74
The SCALA (SCALability Analyzer) System
Design Goals
Predict performance
Support program optimization
Estimate the influence of hardware variations
Uniqueness
Designed to be integrated into advanced compiler
systems
Based on scalability analysis
X. Sun (IIT) CS546 Lecture 5 Page 75
Vienna Fortran Compilation System
A data-parallel restructuring compilation system
Consists of a parallelizing compiler for VF/HPF
and tools for program analysis and restructuring
Under a major upgrade for HPF2
Performance prediction is crucial for
appropriate program restructuring
X. Sun (IIT) CS546 Lecture 5 Page 76
The Structure of SCALA
X. Sun (IIT) CS546 Lecture 5 Page 77
Prototype Implementation
Automatic range comparison for different data distributions
The P
3
T static performance estimator
Test cases: Jacobi and Redblack
No Crossing Point Have Crossing Point
X. Sun (IIT) CS546 Lecture 5 Page 78
Summary
Relation between Iso-speed scalability and iso-
efficiency scalability
Both measure the ability to maintain parallel efficiency
defined as

Where iso-efficiencys speedup is the traditional speedup
defined as

Iso-speeds speedup is the generalized speedup defined as


If the the sequential execution speed is independent of
problem size, iso-speed and iso-efficiency is equivalent
Due to memory hierarchy, sequential execution performance
varies largely with problem size
p
S
E
p
p
=
Speed Sequential
Speed Parallel
=
p
S
Time Execution Parallel
Time Execution or Uniprocess
=
p
S
X. Sun (IIT) CS546 Lecture 5 Page 79
Summary
Predict the sequential execution performance
becomes a major task of SCALA due to advanced
memory hierarchy
Memory-LogP model is introduced for data access cost
New challenge in distributed computing
Generalized iso-speed scalability
Generalized performance tool: GHS
K. Cameron and X.-H. Sun, "Quantifying Locality Effect in Data Access Delay: Memory logP,"
Proc. of 2003 IEEE IPDPS 2003, Nice, France, April, 2003.
X.-H. Sun and M. Wu, "Grid Harvest Service: A System for Long-Term, Application-Level Task
Scheduling," Proc. of 2003 IEEE IPDPS 2003, Nice, France, April, 2003.

S-ar putea să vă placă și