Sunteți pe pagina 1din 6

Motivation for Parallelism Motivation for Parallelism

 The speed of an application is determined by more than just  Communication enables parallel applications:
processor speed.  Harnessing the computing power of distributed systems over the
 Memory speed Internet is a popular example of parallel processing.
 Disk speed  (SETI, Folding@home, ...)
 Network speed
 ...  Constraints on the location of data
 Multiprocessors typically improve the aggregate speeds:  Huge data sets could be difficult, expensive, or otherwise
 Memory bandwidth is improved by separate memories. infeasible to store in a central location.
 Multiprocessors usually have more aggregate cache memory.  Distributed data and parallel processing is a practical solution.
 Each processor in a cluster can have its own disk and network
adapter, improving aggregate speeds.

2007 October 11th 1 2007 October 11th 2

Types of Parallelism ILP Example: Loop Unrolling

 Instruction Level Parallelism (ILP) for( i=0; i<=64; i++ ) { This loop has one sequence of
vec[i] = a * vec[i]; load, multiply, store per iteration.
 Instructions near each other in an instruction stream could be } The amount of ILP is very limited.
independent.
 These can then execute in parallel
 either partially (pipelining),
for( i=0; i<=64; i+=4 ) {
 or fully (superscalar).
vec[i+0] = a * vec[i+0]; 4-fold loop unrolling increases the
 Hardware needed for dependency tracking. vec[i+1] = a * vec[i+1]; amount of ILP exploitable by the
 The amount of ILP available per instruction stream is limited vec[i+2] = a * vec[i+2]; hardware.
vec[i+3] = a * vec[i+3];
 ILP is usually considered an implicit parallelism since the hardware }
automatically exploits it without programmer/compiler intervention.
 Programmers/compilers could transform
applications to expose more ILP. Four independent
load, multiply, store
sequences per iteration.

2007 October 11th 3 2007 October 11th 4


Types of Parallelism TLP Example: Quicksort
QuickSort( A, B ):
if( B – A < 10 ) {
 Task Level Parallelism (TLP) /* Base Case: Use fast sort */
 Several instruction streams are independent. FastSort( A, B );
} else {
 Parallel execution of the streams is possible. /* Continue Recursively */
 Much more coarse-grain parallelism compared with ILP. Partition [A,B] into [A,C] and [C+1,B]; /* Task X */
 Typically involves the programmer and/or compiler to: QuickSort( A, C ); /* Task Y */
QuickSort( C+1, B ); /* Task Z */
 decompose the application into tasks, }
 enforce dependencies,
 and expose parallelism.
X
 Some experimental techniques, such as speculative Both Y and Z depend
multithreading, are aimed at removing these burdens from the on X but are mutually
programmer. independent.

Y Z

2007 October 11th 5 2007 October 11th 6

Types of Parallelism Superscalar and OoO execution

 Data Parallelism (DP)  Scalar processors can issue one instruction per cycle.
 In many applications a collection of data is transformed in such a
 Superscalar processors can issue more than one instruction per cycle.
way that the operations on each element is largely independent
 Common feature in most modern processors.
of the others.
 By replicating functional units and adding hardware to detect and track
instruction dependencies a superscalar processor takes advantage of
 A typical scenario is when we apply the same instruction to a Instruction Level Parallelism.
collection of data.  A related technique (applicable to both scalar and superscalar processors)
 Example: adding two arrays is out-of-order (OoO) execution.
 Instructions are reordered (by hardware) for better utilization of pipeline(s).
 Excellent example of the use of extra transistors to speed up execution
3 7 4 3 6 5 4 without programmer intervention.
+ + + + + + +  Naturally limited by the available ILP.
4 1 5 6 2 4 3 The same operation (+) applied  Also severely limited by the hardware complexity of dependency checking.
= = = = = = = to a collection of data  In practice: 2-way superscalar architectures are common, more than 4-way
7 8 9 9 8 9 7 is unlikely.

2007 October 11th 7 2007 October 11th 8


Vector Processors Vector Processors
 Vector processors refer to a previously common supercomputer  Vector processors are suitable for a certain range of
architecture where a vector is a basic memory abstraction. applications.
 Examples: Cray 1, IBM 3090/VF.
 Vector: 1D array of numbers  Traditional, scalar, processor designs have been successful
 Example: add two vectors over a larger spectrum of applications.
 Scalar solution:  Economic realities favor large clusters of commodity
 Loop through vectors and add each scalar element
 Repeated address translations processors or small-scale SMPs.
 Branches
 Vector solution:
 Add vectors via a vector addition instruction
 Address translations once
 No branches
 Independent operations enable:
 deeper pipelines,
 higher clock frequencies, and
 multiple concurrent functional units (e.g., SIMD-units)

2007 October 11th 9 2007 October 11th 10

Single Instruction Multiple Data (SIMD) Shared Memory Multiprocessor

 Several functional units executing the same instruction on  Multiprocessors where all processors share a single address
different data streams simultaneously and synchronized. space are commonly called Shared Memory Multiprocessors.
 Suitable architecture for many data parallel applications:  They can be classified based on how long access time
 Matrix Computations different processors have to different memory areas.
 Graphics Processing  Uniform Memory Access (UMA): each processor has the same
 Image Analysis access time.
 ...
 Non-Uniform Memory Access (NUMA): some memory is closer to
a processor, access time is higher to distant memory.
 Found primarily in common microprocessors and GPUs:
 Furthermore, their caches could be coherent or not.
 SIMD instruction extensions such as MMX, SSE, AltiVec.
 CC-UMA: Cache-Coherent Uniform Memory Access
 By another name, Symmetric MultiProcessor (SMP).

2007 October 11th 11 2007 October 11th 12


Bus-Based UMA MP NUMA MP
Chip Chip Chip Chip

Proc. Proc. Proc. Proc.

Memory Proc. Proc. Memory


Cache Cache Cache Cache

Interconnect
Bus

Memory Memory Proc. Proc. Memory

2007 October 11th 13 2007 October 11th 14

Multicore Multicore
Chip
 When several processor cores are physically located in the
same processor socket we refer to it as a multicore processor.
Core Core Core Core
 Both Intel and AMD now have quad-core (4 cores) processors
in their product portfolios.
 A new desktop computer today is definitely a multiprocessor. L1 L1 L1 L1
 Multicores usually have a single address space and are Cache Cache Cache Cache
cache-coherent.
 They are very similar to SMPs but they
L2
 typically share one or more levels of cache, Cache
 have more favorable inter-processor/core communication speed.

Memory

2007 October 11th 15 2007 October 11th 16


Multicore Distributed Memory Multiprocessor

 Multicore chips have multiple benefits:  In contrast to a single address space, machines with multiple
 Higher peak performance private memories are commonly called Distributed Memory
 Power consumption control Machines.
 Some cores can be turned off.  Data is exchanged between memories via messages
 Production yield increase communicated over a dedicated network.
 8-core chips with a defective core sold with one core disabled.  When the processor/memory pairs are physically separated,
 ...but also some potential drawbacks: such as on different boards or in different casings, such
 Memory bandwidth per core limited machines are called Clusters.
 Physical limits such as the number of pins.
 Lower peak performance per thread
 Some inherently sequential applications may actually run slower.

2007 October 11th 17 2007 October 11th 18

Cluster Clusters: Past and Present


Node Node Node Node
 In the past, clusters were exclusively high-end machines with
 custom supercomputer processors, and
Proc. Proc. Proc. Proc.  custom high-performance interconnects.
 These machines where very expensive and therefore limited
to research and big corporations.
Cache Cache Cache Cache  In the ’90s onwards it is increasingly common with clusters
based on off-the-shelf components:
 commodity processors, and
 commodity interconnects (e.g. ethernet with switches).
Memory Memory Memory Memory
 The economic benefits and programmer familiarity with
commodity components far outweigh the performance issues.
Network  Has helped ”democratize” supercomputing:
 many corporations and universities have clusters today.

2007 October 11th 19 2007 October 11th 20


Networks

 Access to memory on other nodes is very expensive.


 Data must be transferred over a relatively high-latency low-
bandwidth network.
 Algorithms with low data locality will suffer.
 High synchronization requirements will also degrade
performance for the same reason.
 The network design is a tradeoff between conflicting goals:
 Maximum bandwidth and low latency:
  full connectivity
 Low cost and power consumption:
  tree network
 Switch-based networks common today, other examples of
topologies include rings, meshes, hypercubes, and trees.

2007 October 11th 21

S-ar putea să vă placă și