Sunteți pe pagina 1din 74

UNIT 5

MULTI PROCESSORS AND THREAD


LEVEL PARALLELISM

CONTENT

INTRODUCTION

SYMMETRIC AND SHARED MEMORY ARCHITECTURES

PERFORMANCE OF SYMMETRIC SHARED MEMORY


ARCHITECTURES

DISTRIBUTES SHARED MEMORY AND DIRECTORY BASED


COHERENCE

BASICS OF SYNCHRONIZATION

MODELS OF MEMORY CONSISTENCY

FACTORS THAT TREND TOWARD


MULTIPROCESSOR
1.
2.
3.
4.
5.

A growing interest in servers and server


performance
A growth in data intensive applications
The insight that increasing performance on the
desktop is less important
An improved understanding on how to use
multi processors effectively
Advantages of leveraging a design investment
by replication rather than unique design

A TAXONOMY OF PARALLEL
ARCHITECTURES
1.
2.
3.
4.

Single instruction stream, single data stream


(SISD)
Single instruction stream, multiple data stream
(SIMD)
Multiple instruction stream single data stream
(MISD)
Multiple instruction stream, multiple data
stream (MIMD)

SISD

- Uniprocessor

SIMD
Same instruction is executed by multiple
processors using different data streams
Exploit data level parallelism
Each processor has its own data memory
Single instruction memory
Control processor to fetch and dispatch
instructions

MIMD
Each processor fetches its own instruction and
operates on its own code
Exploits thread level parallelism

FACTORS THAT CONTRIBUTED TO


THE RISE OF MIMD
1. Flexibility

Functions as a single user multiprocessor


Can focus on high performance for one application
Can run multiple tasks simultaneously

2. Cost performance

Use the same micro processor found in


workstations and single processor servers
Multi core chips leverage the design investment
using replication

CLUSTERS
One class of MIMD
Use standard components and a network
technology
Two types:

Commodity clusters
Custom clusters

COMMODITY CLUSTERS
Rely on 3rd party processors and interconnect
technology
Are often blade / rack mounted servers
Focus on throughput
No communication among threads
Assembled by users rather than vendors

CUSTOMCLUSTERS
Designer customizes either the detailed node
design or the interconnect design or both
Exploit large amount of parallelism
Require significant among of communication
during computation
More efficient
Ex.: IBM Blue gene

MULTICORE
Multi processors placed on a single die
A.k.a. on-chip multiprocessing or single-chip
multiprocessing
Multiple core shares resources (cache, I/O bus)
Ex.: IBM Power 5

PROCESS
Segment of code that may be run independently
Process state contains all necessary information
to execute that program
Each process is independent of the other :multiprogramming environment

THREADS
Multiple processors executing a single program
Share the code and address space
Grain size must be large to exploit parallelism
Independent threads within a process are
identified by the programmer or created by the
compiler
Loop iterations within thread-Exploit data level
parallelism

MIMD CLASSIFICATION
1.
2.

Centralized shared memory architectures


Distributed memory processors

CENTRALIZED SHARED MEMORY


ARCHITECTURES

A few dozen processors share a single centralized


memory
Large caches or multiple bank memory
Scaling done using p-2-p connections, switches
and multiple bank memory
Symmetric relationship
Uniform access time
Called as Symmetric Shared Memory
Multiprocessor (SMP) or Uniform Memory Access
(UMA)

DISTRIBUTED MEMORY MULTI


PROCESSORS
Physically distributed memory
Supports large number of processors and
bandwidth
Raises the need for high bandwidth interconnect
Direction networks(switches) and indirection
networks(multidimensional meshes) are used

BENEFITS:
1.
Cost effective to scale memory bandwidth
2.
Reduces latency to access local memory
DRAWBACKS:
3.
Complex
4.
Software needed to manage the increased
memory bandwidth

MODELS FOR COMMUNICATION


AND MEMORY ARCHITECTURE
1.

Communication occurs in a shared address


space

Physically separated memory => one logical shared


address space
Called as Distributed Shared Memory
architecture(DSM) or Non Uniform Memory Access
(NUMA)
Memory reference made by any processor to any
memory location
Access time depends on the data location in
memory

2. Address space consist of multiple private address


spaces
Addresses are logically disjoint
Cannot be addresses by a remote processor
Same physical address(processor) refer to
different memory location
Each processor-memory module is a separate
computer
Communication is done via message passing
A.k.a Message Passing Multiprocessors

CHALLENGES OF MULTI
PROCESSING
1.
2.
3.
4.

Limited parallelism available in program


Relatively high cost of communication
Large latency of remote access
Difficult to achieve good speed up
Performance measured using Amdahls
law

SOLUTION
Limited parallelism : algorithms with better
parallel performance
Access latency : architecture design and
programming
Reduce the frequency of remote access: hardware
and software mechanisms
Tolerate latency: multi threading and prefetching

PROBLEM

Suppose you want to achieve a speedup of 80 with


100 processors. What fraction of the original
computation can be sequential?

assume that the program operates in only two


modes:
1. parallel with all processors fully used, which is
the enhanced mode
2. serial with only one processor in use.

Speedup in enhanced mode =number of processors,


Speed in fraction of enhanced mode = time spent
in parallel mode.

.25% of original computation can be can be


sequential

=99.75%
26

PROBLEM

SHARED SYMMETRIC MEMORY


ARCHITECTURE
Use of multi level caches substantially reduce the
memory bandwidth demands of a processor
Solution: Creation of small scale multi processors
where several processors shared a single physical
memory connected by a shared bus
Benefit: Cost effective
They support caching of private/shared data

Private data: Used by a single processor


Shared data: Shared between multiple processors
How are these cached?

WHAT IS MULTI PROCESSOR CACHE


COHERENCE?

1.

2.

3.

A memory system is said to be coherent:


A read by processor P to a location X that follows a
write by P to X, with no writes of X by another
processor occurring between the write and the
read by P, always returns the value written by P
A read by a processor to location X that follows a
write by another processor to X returns the
written value if the read and write are sufficiently
separated in time and no other writes to X occur
between two accesses
Writes to the same location are serialized; two
writes to the same location by any two processors
are seen in the same order by all processors

Coherence: Defines the behavior of reads and


writes to the same memory location
Consistency: Defines the behavior of reads and
writes w.r.t accesses to other memory location

BASIC SCHEMES FOR ENFORCING


COHERENCE
Coherent caches provide:
1.
Migration: Data item can be moved to a local
cache and used
2.
Replication: Shared data can be
simultaneously read
. The protocols to maintain coherence for
multiple processors are called cache coherence
protocols

1.

2.

Directory based: The sharing status of a


block of physical memory is kept in just one
location: directory
Snooping: Every cache that has a copy of the
data from a block of physical memory has also a
sharing status of the block; no centralized state
is kept

SNOOPING PROTOCOLS
1.

Write invalidate

2.

Write update

BASIC IMPLEMENTATION
TECHNIQUES
1.

2.
3.
4.
5.

The processor acquires bus access and


broadcasts the address to be invalidated on the
bus
Processors continuously snoop on the bus
watching for addresses
The processors check if the address on the bus
in their cache
If so, they invalidate the corresponding data in
their cache
If two processors attempt to write shared blocks
at the same time, their attempts to broadcast
an invalidate operation will be serialized

Write update: Broadcasts the write to all the


cache lines
Consumes bandwidth

Write - through cache: Written data is sent to


memory
The most recent value of the data item be
fetched from memory

Write back cache: Every processor snoops the


address on the bus.
If the processor finds that it has a dirty copy of
the requested cache block, it provides that cache
block on request for a read
This in turn causes the memory operation to be
aborted
The cache block is then retrieved from the
processors cache

To track if a cache block is shared, an extra bit


called state bit is associated with each cache
block
When a write to a shared block occurs, the cache
generates an invalidation on the bus and marks
the block as exclusive
The processor with this sole copy of the block is
called the owner of the block

When an invalidation is sent, the owners sate of


the cache block is changed from shared to
exclusive
Later, if another processor requests for the cache
block, the state has to be made shared again

WRITE INVALIDATE FOR A WRITE


BACK CAHCE
Circles: Cache states
Arcs: State transitions
Label on the arcs: Stimulus that causes state
transition
Bold: Bus actions caused by transitions

LIMITATIONS
As the number of processors in a multiprocessor
grow / memory demands grow, any centralized
resource becomes a bottleneck
A single bus has to carry both the coherence
traffic as well as the normal traffic
Designers can use multiple buses and
interconnection networks
Attain a midway approach : shared memory vs
centralized memory

PERFORMANCE OF SYMMETRIC SHARED


MEMORY MULTI PROCESSORS

1.

2.

Coherence misses can be broken into two sources:


True sharing miss: The first write by a
processor to a shared cache block causes an
invalidation to establish block ownership; a
subsequent attempt to read the modified in that
cache block results in a miss
False sharing miss: The block is invalidate
because some word in the cache block other
than the one being read is written into

PROBLEM 3: Assume that words xl and x2 are


in the same cache block, which is in the shared
state in the caches of both PI and P2. Assuming
the following sequence of events, identify each
miss as a true sharing miss, a false sharing miss,
or a hit. Any miss that would occur if the block
size were one word is designated a true sharing
miss.

DISTRIBUTED SHARED MEMORY


AND DIRECTORRY BASED COHERENCE
A directory keeps state of every cached block
Information in the directory includes which
caches have copies of the block, if they are dirty
and so on
An entry in the directory is associated with each
block
To prevent the directory from becoming a
bottleneck, the directory is distributed along with
the memory.

DIRECTORY BASED CACHE


COHERENCE PROTOCOLS

1.

2.
3.

The state of each cache block could be the


following:
Shared: One or more processors have the block
cached, and the value in memory and all the
caches is up to date
Uncached: No processor has a copy of the
cache block
Modified: Exactly one processor has a copy of
the cached block, and it has written the block,
the memory copy is out of date; the processor is
the owner of the block

To keep track of the each potentially shared


block, a bit vector is maintained for each block.
Each bit indicates if the corresponding processor
has a copy of the block
Local node
Home node
Remote node

DIRECTORY BASED CACHE


COHERENCE PROTOCOLS

1.

2.

When the block is in uncached state, the possible


requests for it are:
Read miss: The requesting processor is sent
the block from memory; the state of the block is
made shared
Write miss: The requesting processor is sent
the value and becomes the sharing node; the
block is made exclusive

1.

2.

When the block is in the shared state, the


memory value is up to date:
Read miss: The requesting processor is sent
the requested data from memory, and the
requesting processor is added to the sharing set
Write miss: The requesting processor is sent
the value. All other processors in the sharers
state are sent invalidate messages and they
contain the identity of the requesting processor;
the state of the block is made exclusive

1.

When the block is in the exclusive state, the


current value of the block is held in the owner
processors cache
Read miss: The owner processor is sent the
data fetch message. The state of the block s
made shared; the requesting processor is added
to the sharers set which contains the identity of
the owner

2.

3.

Data write back: The owners processor is


replacing the block and hence the block has to
be written back. Memory copy is made up to
date, the block is uncached and the sharers set
is empty
Write miss: The block has a new owner. A
message is sent to old owner to invalidate the
block; the state of the block remains exclusive

SYNCHRONIZATION
Synchronization mechanisms are built with user
level software routines that rely on hardware
supplies synchronization instructions
Atomic operations: The ability to atomically
read and modify the memory location
Atomic exchange: Inter changes the value in a
register for a value in memory
Locks: 0 is used to indicate a lock is free; 1 is
used to indicate that a lock is unavailable

Test and set: Tests a value and sets if the value


passes the test
Fetch and increment: It returns a value in
memory and atomically increments it

IMPLEMENTING LOCKS USING


COHERENCE
Spin locks: Locks that a processor continuously
tries to acquire, spinning around a loop until it
succeeds
Are to used when the lock is to be held for a very
short amount of time and the process acquiring
the lock is of low latency

Simple implementation:
A processor could continually try to acquire the
lock using an atomic operation
E.g.: Exchange and test
To release a lock, the processor stores a 0 to the
lock

Coherence mechanism:
Use cache coherence mechanism to maintain the
lock value coherently
The processor can acquire a locally cached lock
rather than using a global memory
Locality in lock access: The processor that used
the lock last will use it again in near future

Spin procedure:
A processor reads the lock variable to test its
state
This is repeated until the value of the read
indicates that the lock is unlocked
The processor then races with all the other
waiting processors
All processes use a swap function that reads the
old value and stores a 1 into the lock variable

The single winner will see a 0 and the losers will


see a 1 that is placed by the winner
The winning processor executes the code after
the lock and then release it by storing a 0 in the
lock variable
The race starts again

MODELS OF MEMORY
CONSISTENCY
Consistency:
1.
When must a processor see a value that has
been updated by another processor
2.
In what order must a processor observe the
data writes of another processor
. Sequential consistency: Result of any execution
be the same as if the memory accesses executed
by each processor were kept in order and
accesses among different processors are
interleaved

Sequential consistency: Sequential


consistency requires that the result of any
execution be the same as if the memory accesses
executed by each processor were kept in order
and the accesses among different processors were
arbitrarily interleaved.

A program is synchronized if all accesses to


shared data are ordered by synchronization
operations
Data race: Variables are updated without
ordering by synchronization; execution outcome
depends on the relative speed of the processors
Synchronization operations?

RELAXED CONSISTENCY MODELS


Allow read and write to complete out of order; but
use synchronization operations to enforce
ordering
X->Y: Operation X must complete before Y
Four possible orderings: R->W; R->R; W->R; W>W

1.
2.
3.

Relaxing W -> R yields total store ordering or


processor consistency model
Relaxing W -> W ordering yields a model
known as partial store order
Relaxing R -> W and R -> R yields weak
ordering, release consistency model

1.
2.
3.
4.
5.

Define the four major categories of computer


systems
List the factors that led to the rise of MIMD
multi processors
Illustrate the basic architecture of a centralized
shared memory multi processor
Illustrate the basic architecture of a distributed
memory multi processor
Distinguish between private data and shared
data

6.
7.
8.
9.
10.

Define the cache coherence problem


List the conditions required for a memory
system to be coherent
Define the cache coherence protocols
Analyze the implementation of cache coherence
protocol
Illustrate the performance of symmetric shared
memory multi processors with a commercial
workload applicatio

11.
12.
13.
14.

Illustrate the working of distributed memory


multi processor
Demonstrate the transitions in a directory
based system
Define spin locks
Define the ordering of a relaxed consistency
model

S-ar putea să vă placă și