Multi Processors and Thread Level Parallelism

UNIT 5
MULTI PROCESSORS AND THREAD

LEVEL PARALLELISM
CONTENT
INTRODUCTION
SYMMETRIC AND SHARED MEMORY ARCHITECTURES
PERFORMANCE OF SYMMETRIC SHARED MEMORY

ARCHITECTURES
DISTRIBUTES SHARED MEMORY AND DIRECTORY BASED

COHERENCE
BASICS OF SYNCHRONIZATION
MODELS OF MEMORY CONSISTENCY
FACTORS THAT TREND TOWARD

MULTIPROCESSOR
1.
2.
3.
4.
5.
A growing interest in servers and server

performance
A growth in data intensive applications
The insight that increasing performance on the
desktop is less important
An improved understanding on how to use
multi processors effectively
Advantages of leveraging a design investment
by replication rather than unique design
A TAXONOMY OF PARALLEL
ARCHITECTURES
1.
2.
3.
4.
Single instruction stream, single data stream

(SISD)
Single instruction stream, multiple data stream
(SIMD)
Multiple instruction stream single data stream
(MISD)
Multiple instruction stream, multiple data
stream (MIMD)
SISD
- Uniprocessor
SIMD
Same instruction is executed by multiple
processors using different data streams
Exploit data level parallelism
Each processor has its own data memory
Single instruction memory
Control processor to fetch and dispatch
instructions
MIMD
Each processor fetches its own instruction and
operates on its own code
Exploits thread level parallelism
FACTORS THAT CONTRIBUTED TO

THE RISE OF MIMD
1. Flexibility
Functions as a single user multiprocessor

Can focus on high performance for one application
Can run multiple tasks simultaneously
2. Cost performance
Use the same micro processor found in

workstations and single processor servers
Multi core chips leverage the design investment
using replication
CLUSTERS
One class of MIMD
Use standard components and a network
technology
Two types:
Commodity clusters
Custom clusters
COMMODITY CLUSTERS
Rely on 3rd party processors and interconnect
technology
Are often blade / rack mounted servers
Focus on throughput
No communication among threads
Assembled by users rather than vendors
CUSTOMCLUSTERS
Designer customizes either the detailed node
design or the interconnect design or both
Exploit large amount of parallelism
Require significant among of communication
during computation
More efficient
Ex.: IBM Blue gene
MULTICORE
Multi processors placed on a single die
A.k.a. on-chip multiprocessing or single-chip
multiprocessing
Multiple core shares resources (cache, I/O bus)
Ex.: IBM Power 5
PROCESS
Segment of code that may be run independently
Process state contains all necessary information
to execute that program
Each process is independent of the other :multiprogramming environment
THREADS
Multiple processors executing a single program
Share the code and address space
Grain size must be large to exploit parallelism
Independent threads within a process are
identified by the programmer or created by the
compiler
Loop iterations within thread-Exploit data level
parallelism
MIMD CLASSIFICATION
1.
2.
Centralized shared memory architectures

Distributed memory processors
CENTRALIZED SHARED MEMORY

ARCHITECTURES
A few dozen processors share a single centralized

memory
Large caches or multiple bank memory
Scaling done using p-2-p connections, switches
and multiple bank memory
Symmetric relationship
Uniform access time
Called as Symmetric Shared Memory
Multiprocessor (SMP) or Uniform Memory Access
(UMA)
DISTRIBUTED MEMORY MULTI

PROCESSORS
Physically distributed memory
Supports large number of processors and
bandwidth
Raises the need for high bandwidth interconnect
Direction networks(switches) and indirection
networks(multidimensional meshes) are used
BENEFITS:
1.
Cost effective to scale memory bandwidth
2.
Reduces latency to access local memory
DRAWBACKS:
3.
Complex
4.
Software needed to manage the increased
memory bandwidth
MODELS FOR COMMUNICATION

AND MEMORY ARCHITECTURE
1.
Communication occurs in a shared address

space
Physically separated memory => one logical shared

address space
Called as Distributed Shared Memory
architecture(DSM) or Non Uniform Memory Access
(NUMA)
Memory reference made by any processor to any
memory location
Access time depends on the data location in
memory
2. Address space consist of multiple private address

spaces
Addresses are logically disjoint
Cannot be addresses by a remote processor
Same physical address(processor) refer to
different memory location
Each processor-memory module is a separate
computer
Communication is done via message passing
A.k.a Message Passing Multiprocessors
CHALLENGES OF MULTI
PROCESSING
1.
2.
3.
4.
Limited parallelism available in program

Relatively high cost of communication
Large latency of remote access
Difficult to achieve good speed up
Performance measured using Amdahls
law
SOLUTION
Limited parallelism : algorithms with better
parallel performance
Access latency : architecture design and
programming
Reduce the frequency of remote access: hardware
and software mechanisms
Tolerate latency: multi threading and prefetching
PROBLEM
Suppose you want to achieve a speedup of 80 with

100 processors. What fraction of the original
computation can be sequential?
assume that the program operates in only two

modes:
1. parallel with all processors fully used, which is
the enhanced mode
2. serial with only one processor in use.
Speedup in enhanced mode =number of processors,

Speed in fraction of enhanced mode = time spent
in parallel mode.
.25% of original computation can be can be

sequential
=99.75%
26
PROBLEM
SHARED SYMMETRIC MEMORY

ARCHITECTURE
Use of multi level caches substantially reduce the
memory bandwidth demands of a processor
Solution: Creation of small scale multi processors
where several processors shared a single physical
memory connected by a shared bus
Benefit: Cost effective
They support caching of private/shared data
Private data: Used by a single processor

Shared data: Shared between multiple processors
How are these cached?
WHAT IS MULTI PROCESSOR CACHE

COHERENCE?
1.
2.
3.
A memory system is said to be coherent:

A read by processor P to a location X that follows a
write by P to X, with no writes of X by another
processor occurring between the write and the
read by P, always returns the value written by P
A read by a processor to location X that follows a
write by another processor to X returns the
written value if the read and write are sufficiently
separated in time and no other writes to X occur
between two accesses
Writes to the same location are serialized; two
writes to the same location by any two processors
are seen in the same order by all processors
Coherence: Defines the behavior of reads and

writes to the same memory location
Consistency: Defines the behavior of reads and
writes w.r.t accesses to other memory location
BASIC SCHEMES FOR ENFORCING

COHERENCE
Coherent caches provide:
1.
Migration: Data item can be moved to a local
cache and used
2.
Replication: Shared data can be
simultaneously read
. The protocols to maintain coherence for
multiple processors are called cache coherence
protocols
1.
2.
Directory based: The sharing status of a

block of physical memory is kept in just one
location: directory
Snooping: Every cache that has a copy of the
data from a block of physical memory has also a
sharing status of the block; no centralized state
is kept
SNOOPING PROTOCOLS
1.
Write invalidate
2.
Write update
BASIC IMPLEMENTATION
TECHNIQUES
1.
2.
3.
4.
5.
The processor acquires bus access and

broadcasts the address to be invalidated on the
bus
Processors continuously snoop on the bus
watching for addresses
The processors check if the address on the bus
in their cache
If so, they invalidate the corresponding data in
their cache
If two processors attempt to write shared blocks
at the same time, their attempts to broadcast
an invalidate operation will be serialized
Write update: Broadcasts the write to all the

cache lines
Consumes bandwidth
Write - through cache: Written data is sent to

memory
The most recent value of the data item be
fetched from memory
Write back cache: Every processor snoops the

address on the bus.
If the processor finds that it has a dirty copy of
the requested cache block, it provides that cache
block on request for a read
This in turn causes the memory operation to be
aborted
The cache block is then retrieved from the
processors cache
To track if a cache block is shared, an extra bit

called state bit is associated with each cache
block
When a write to a shared block occurs, the cache
generates an invalidation on the bus and marks
the block as exclusive
The processor with this sole copy of the block is
called the owner of the block
When an invalidation is sent, the owners sate of

the cache block is changed from shared to
exclusive
Later, if another processor requests for the cache
block, the state has to be made shared again
WRITE INVALIDATE FOR A WRITE

BACK CAHCE
Circles: Cache states
Arcs: State transitions
Label on the arcs: Stimulus that causes state
transition
Bold: Bus actions caused by transitions
LIMITATIONS
As the number of processors in a multiprocessor
grow / memory demands grow, any centralized
resource becomes a bottleneck
A single bus has to carry both the coherence
traffic as well as the normal traffic
Designers can use multiple buses and
interconnection networks
Attain a midway approach : shared memory vs
centralized memory
PERFORMANCE OF SYMMETRIC SHARED

MEMORY MULTI PROCESSORS
1.
2.
Coherence misses can be broken into two sources:

True sharing miss: The first write by a
processor to a shared cache block causes an
invalidation to establish block ownership; a
subsequent attempt to read the modified in that
cache block results in a miss
False sharing miss: The block is invalidate
because some word in the cache block other
than the one being read is written into
PROBLEM 3: Assume that words xl and x2 are

in the same cache block, which is in the shared
state in the caches of both PI and P2. Assuming
the following sequence of events, identify each
miss as a true sharing miss, a false sharing miss,
or a hit. Any miss that would occur if the block
size were one word is designated a true sharing
miss.
DISTRIBUTED SHARED MEMORY

AND DIRECTORRY BASED COHERENCE
A directory keeps state of every cached block
Information in the directory includes which
caches have copies of the block, if they are dirty
and so on
An entry in the directory is associated with each
block
To prevent the directory from becoming a
bottleneck, the directory is distributed along with
the memory.
DIRECTORY BASED CACHE

COHERENCE PROTOCOLS
1.
2.
3.
The state of each cache block could be the

following:
Shared: One or more processors have the block
cached, and the value in memory and all the
caches is up to date
Uncached: No processor has a copy of the
cache block
Modified: Exactly one processor has a copy of
the cached block, and it has written the block,
the memory copy is out of date; the processor is
the owner of the block
To keep track of the each potentially shared

block, a bit vector is maintained for each block.
Each bit indicates if the corresponding processor
has a copy of the block
Local node
Home node
Remote node
DIRECTORY BASED CACHE

COHERENCE PROTOCOLS
1.
2.
When the block is in uncached state, the possible

requests for it are:
Read miss: The requesting processor is sent
the block from memory; the state of the block is
made shared
Write miss: The requesting processor is sent
the value and becomes the sharing node; the
block is made exclusive
1.
2.
When the block is in the shared state, the

memory value is up to date:
Read miss: The requesting processor is sent
the requested data from memory, and the
requesting processor is added to the sharing set
Write miss: The requesting processor is sent
the value. All other processors in the sharers
state are sent invalidate messages and they
contain the identity of the requesting processor;
the state of the block is made exclusive
1.
When the block is in the exclusive state, the

current value of the block is held in the owner
processors cache
Read miss: The owner processor is sent the
data fetch message. The state of the block s
made shared; the requesting processor is added
to the sharers set which contains the identity of
the owner
2.
3.
Data write back: The owners processor is

replacing the block and hence the block has to
be written back. Memory copy is made up to
date, the block is uncached and the sharers set
is empty
Write miss: The block has a new owner. A
message is sent to old owner to invalidate the
block; the state of the block remains exclusive
SYNCHRONIZATION
Synchronization mechanisms are built with user
level software routines that rely on hardware
supplies synchronization instructions
Atomic operations: The ability to atomically
read and modify the memory location
Atomic exchange: Inter changes the value in a
register for a value in memory
Locks: 0 is used to indicate a lock is free; 1 is
used to indicate that a lock is unavailable
Test and set: Tests a value and sets if the value

passes the test
Fetch and increment: It returns a value in
memory and atomically increments it
IMPLEMENTING LOCKS USING

COHERENCE
Spin locks: Locks that a processor continuously
tries to acquire, spinning around a loop until it
succeeds
Are to used when the lock is to be held for a very
short amount of time and the process acquiring
the lock is of low latency
Simple implementation:
A processor could continually try to acquire the
lock using an atomic operation
E.g.: Exchange and test
To release a lock, the processor stores a 0 to the
lock
Coherence mechanism:
Use cache coherence mechanism to maintain the
lock value coherently
The processor can acquire a locally cached lock
rather than using a global memory
Locality in lock access: The processor that used
the lock last will use it again in near future
Spin procedure:
A processor reads the lock variable to test its
state
This is repeated until the value of the read
indicates that the lock is unlocked
The processor then races with all the other
waiting processors
All processes use a swap function that reads the
old value and stores a 1 into the lock variable
The single winner will see a 0 and the losers will

see a 1 that is placed by the winner
The winning processor executes the code after
the lock and then release it by storing a 0 in the
lock variable
The race starts again
MODELS OF MEMORY
CONSISTENCY
Consistency:
1.
When must a processor see a value that has
been updated by another processor
2.
In what order must a processor observe the
data writes of another processor
. Sequential consistency: Result of any execution
be the same as if the memory accesses executed
by each processor were kept in order and
accesses among different processors are
interleaved
Sequential consistency: Sequential

consistency requires that the result of any
execution be the same as if the memory accesses
executed by each processor were kept in order
and the accesses among different processors were
arbitrarily interleaved.
A program is synchronized if all accesses to

shared data are ordered by synchronization
operations
Data race: Variables are updated without
ordering by synchronization; execution outcome
depends on the relative speed of the processors
Synchronization operations?
RELAXED CONSISTENCY MODELS

Allow read and write to complete out of order; but
use synchronization operations to enforce
ordering
X->Y: Operation X must complete before Y
Four possible orderings: R->W; R->R; W->R; W>W
1.
2.
3.
Relaxing W -> R yields total store ordering or

processor consistency model
Relaxing W -> W ordering yields a model
known as partial store order
Relaxing R -> W and R -> R yields weak
ordering, release consistency model
1.
2.
3.
4.
5.
Define the four major categories of computer

systems
List the factors that led to the rise of MIMD
multi processors
Illustrate the basic architecture of a centralized
shared memory multi processor
Illustrate the basic architecture of a distributed
memory multi processor
Distinguish between private data and shared
data
6.
7.
8.
9.
10.
Define the cache coherence problem

List the conditions required for a memory
system to be coherent
Define the cache coherence protocols
Analyze the implementation of cache coherence
protocol
Illustrate the performance of symmetric shared
memory multi processors with a commercial
workload applicatio
11.
12.
13.
14.
Illustrate the working of distributed memory

multi processor
Demonstrate the transitions in a directory
based system
Define spin locks
Define the ordering of a relaxed consistency
model

Multi Processors and Thread Level Parallelism

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Multi Processors and Thread Level Parallelism

Încărcat de

Drepturi de autor:

Formate disponibile

UNIT 5

MULTI PROCESSORS AND THREAD

SYMMETRIC AND SHARED MEMORY ARCHITECTURES

PERFORMANCE OF SYMMETRIC SHARED MEMORY

DISTRIBUTES SHARED MEMORY AND DIRECTORY BASED

MODELS OF MEMORY CONSISTENCY

FACTORS THAT TREND TOWARD

A growing interest in servers and server

Single instruction stream, single data stream

FACTORS THAT CONTRIBUTED TO

Functions as a single user multiprocessor

Use the same micro processor found in

Centralized shared memory architectures

CENTRALIZED SHARED MEMORY

A few dozen processors share a single centralized

DISTRIBUTED MEMORY MULTI

MODELS FOR COMMUNICATION

Communication occurs in a shared address

Physically separated memory => one logical shared

2. Address space consist of multiple private address

Limited parallelism available in program

Suppose you want to achieve a speedup of 80 with

assume that the program operates in only two

Speedup in enhanced mode =number of processors,

.25% of original computation can be can be

SHARED SYMMETRIC MEMORY

Private data: Used by a single processor

WHAT IS MULTI PROCESSOR CACHE

A memory system is said to be coherent:

Coherence: Defines the behavior of reads and

BASIC SCHEMES FOR ENFORCING

Directory based: The sharing status of a

The processor acquires bus access and

Write update: Broadcasts the write to all the

Write - through cache: Written data is sent to

Write back cache: Every processor snoops the

To track if a cache block is shared, an extra bit

When an invalidation is sent, the owners sate of

WRITE INVALIDATE FOR A WRITE

PERFORMANCE OF SYMMETRIC SHARED

Coherence misses can be broken into two sources:

PROBLEM 3: Assume that words xl and x2 are

DISTRIBUTED SHARED MEMORY

DIRECTORY BASED CACHE

The state of each cache block could be the

To keep track of the each potentially shared

DIRECTORY BASED CACHE

When the block is in uncached state, the possible

When the block is in the shared state, the

When the block is in the exclusive state, the

Data write back: The owners processor is

Test and set: Tests a value and sets if the value

IMPLEMENTING LOCKS USING

The single winner will see a 0 and the losers will

Sequential consistency: Sequential

A program is synchronized if all accesses to

RELAXED CONSISTENCY MODELS

Relaxing W -> R yields total store ordering or

Define the four major categories of computer

Define the cache coherence problem

Illustrate the working of distributed memory

S-ar putea să vă placă și