Documente Academic
Documente Profesional
Documente Cultură
The speed of an application is determined by more than just Communication enables parallel applications:
processor speed. Harnessing the computing power of distributed systems over the
Memory speed Internet is a popular example of parallel processing.
Disk speed (SETI, Folding@home, ...)
Network speed
... Constraints on the location of data
Multiprocessors typically improve the aggregate speeds: Huge data sets could be difficult, expensive, or otherwise
Memory bandwidth is improved by separate memories. infeasible to store in a central location.
Multiprocessors usually have more aggregate cache memory. Distributed data and parallel processing is a practical solution.
Each processor in a cluster can have its own disk and network
adapter, improving aggregate speeds.
Instruction Level Parallelism (ILP) for( i=0; i<=64; i++ ) { This loop has one sequence of
vec[i] = a * vec[i]; load, multiply, store per iteration.
Instructions near each other in an instruction stream could be } The amount of ILP is very limited.
independent.
These can then execute in parallel
either partially (pipelining),
for( i=0; i<=64; i+=4 ) {
or fully (superscalar).
vec[i+0] = a * vec[i+0]; 4-fold loop unrolling increases the
Hardware needed for dependency tracking. vec[i+1] = a * vec[i+1]; amount of ILP exploitable by the
The amount of ILP available per instruction stream is limited vec[i+2] = a * vec[i+2]; hardware.
vec[i+3] = a * vec[i+3];
ILP is usually considered an implicit parallelism since the hardware }
automatically exploits it without programmer/compiler intervention.
Programmers/compilers could transform
applications to expose more ILP. Four independent
load, multiply, store
sequences per iteration.
Y Z
Data Parallelism (DP) Scalar processors can issue one instruction per cycle.
In many applications a collection of data is transformed in such a
Superscalar processors can issue more than one instruction per cycle.
way that the operations on each element is largely independent
Common feature in most modern processors.
of the others.
By replicating functional units and adding hardware to detect and track
instruction dependencies a superscalar processor takes advantage of
A typical scenario is when we apply the same instruction to a Instruction Level Parallelism.
collection of data. A related technique (applicable to both scalar and superscalar processors)
Example: adding two arrays is out-of-order (OoO) execution.
Instructions are reordered (by hardware) for better utilization of pipeline(s).
Excellent example of the use of extra transistors to speed up execution
3 7 4 3 6 5 4 without programmer intervention.
+ + + + + + + Naturally limited by the available ILP.
4 1 5 6 2 4 3 The same operation (+) applied Also severely limited by the hardware complexity of dependency checking.
= = = = = = = to a collection of data In practice: 2-way superscalar architectures are common, more than 4-way
7 8 9 9 8 9 7 is unlikely.
Several functional units executing the same instruction on Multiprocessors where all processors share a single address
different data streams simultaneously and synchronized. space are commonly called Shared Memory Multiprocessors.
Suitable architecture for many data parallel applications: They can be classified based on how long access time
Matrix Computations different processors have to different memory areas.
Graphics Processing Uniform Memory Access (UMA): each processor has the same
Image Analysis access time.
...
Non-Uniform Memory Access (NUMA): some memory is closer to
a processor, access time is higher to distant memory.
Found primarily in common microprocessors and GPUs:
Furthermore, their caches could be coherent or not.
SIMD instruction extensions such as MMX, SSE, AltiVec.
CC-UMA: Cache-Coherent Uniform Memory Access
By another name, Symmetric MultiProcessor (SMP).
Interconnect
Bus
Multicore Multicore
Chip
When several processor cores are physically located in the
same processor socket we refer to it as a multicore processor.
Core Core Core Core
Both Intel and AMD now have quad-core (4 cores) processors
in their product portfolios.
A new desktop computer today is definitely a multiprocessor. L1 L1 L1 L1
Multicores usually have a single address space and are Cache Cache Cache Cache
cache-coherent.
They are very similar to SMPs but they
L2
typically share one or more levels of cache, Cache
have more favorable inter-processor/core communication speed.
Memory
Multicore chips have multiple benefits: In contrast to a single address space, machines with multiple
Higher peak performance private memories are commonly called Distributed Memory
Power consumption control Machines.
Some cores can be turned off. Data is exchanged between memories via messages
Production yield increase communicated over a dedicated network.
8-core chips with a defective core sold with one core disabled. When the processor/memory pairs are physically separated,
...but also some potential drawbacks: such as on different boards or in different casings, such
Memory bandwidth per core limited machines are called Clusters.
Physical limits such as the number of pins.
Lower peak performance per thread
Some inherently sequential applications may actually run slower.