Sunteți pe pagina 1din 48

High Performance Computing

Jimmy Mathew
jimmym@rajagiritech.ac.in
Department of Information Technology
Rajagiri School of Engineering and Technology, Cochin
High Performance Computing

Module 1 – Parallel Processing

Module 1 J Mathew RASET 2


References

[1] “Computer architecture & parallel processing”, Kai


Hwang, Faye A Briggs

[2] “Parallel Computers”, V Rajaraman, Siva Ram Murthy

[3] “Elements of parallel computing”, V Rajaraman

[4] “Super computers”, V Rajaraman

Module 1 J Mathew RASET 3


Seminar topics

• Applications of super computers


• 10 applications
• Contribution of India in super computers
• Various architectures
• IBM 360/91, Illiac IV, Univac 1100 / 80

Module 1 J Mathew RASET 4

Seminars:
Seminars shall go in parallel with regular classes. All the seminars shall be in groups of two or
three.
The concept of parallelism

• Evolution of computer science


• Generations of computers
• Need for more computation in lesser time
• Invention of networking
• Independent activities
• Multiple resources
• Parallelism

Module 1 J Mathew RASET 5

Introduction – The concept of parallelism

Parallelism was not a quick idea. The concept began to grow with the advancement in technology,
and researchers around the globe are still continuing to exploit more and more parallelisms
possible. In this module, we shall see how this concept evolved and eventually lead to invention
of super computers.

The growth of computer era began at 1938 with the invention of first analog computer called
ENIAC. The technology used in first generation computers were electromechanical relays and
vacuum tubes. Then came the era of second generation computers by 1952, were transistors and
diodes were building blocks. TRADIC from Bell Laboratories is one among them. Third
generation (1962) computers had seen advancement of electronics in Silicon based integrated
circuits (in small scale and medium scale), and multi layered printed circuit boards. Illiac IV, IBM
360 / 91 are a few to mention. Fourth generation began on 1972, and large scale integration were
used in every processor fabrication with high density packaging. Concept of Intellectual property
(IP) or Virtual Components (VC) were extensively used from then onwards. Cray – I and Univac
1100 / 80 were beginners in this era. Refer [1] pp 2-4 for more details.

The computation capability individual processor was increasing ever since, and networks helped
to interconnect these processors. Various application, which are distributed in nature and with
high computational requirement became more and more dependent on the architecture and
network configurations. The high performance applications were divided into independent tasks
and were distributed among various processors, which are either distributed or centralised. We
aim to study at architectures of such centralized processors.
Trends in parallelism

• Computers are being


outdated
• Importance of Moore's law
• Processing stages

Module 1 J Mathew RASET 6

Gordon Moore is a co-founder of Intel Corporation. His prediction related to processor capacity
was later on called Moore's law. In 1965 Gordon Moore published a paper which highlighted the
exponential growth in transistor densities, and this became known as Moore’s Law. Today, it is
understood to mean that densities double every two years, though in fact they increase by a factor
of 5 every 5 years.

Implication of Moore's law are performance increase, complexity increase, die area decrease,
faster technology expire, etc. In the earlier days, computers were used only for data processing.
Later, information processing began. From information, knowledge processing started. From
knowledge processing, today’s intelligence processing is derived. Computer systems are
becoming more and more capable of working by its own intelligence, as they have huge
applications.
Trends in parallelism

• OS advancement stages • Parallelism exploited in

Module 1 J Mathew RASET 7

OS advancements:

The advancements in Operating System technology has helped a lot to improve effective use of
processor capacity. From batch processing scenario to current days multi-processing capability
were significant milestones in OS history. Even though multi-processing capability is built in
OS, many applications run on such OS are still not capable to exploit full parallelism provided
by multiple processors. One of the reason for such incapability of application is lack of
programming language that support parallel programming.
Program level

• Execute programs in
parallel
• Multi-programming
• Multi-processing

Module 1 J Mathew RASET 8

Muti-programming :
Concurrently executing multiple programs with one processor. Actually parallel processing is not
possible with single processor. We may need to use co-processors like DMA to simulate parallel
processing.

Multi-processing :
Parallel execution of multiple programs with multiple processors. The central node (main
processor) acts as a distributor of load to each processor.
Procedure level

• Parallelism within a
processor
• Execute functions or tasks in
parallel
• Task scheduling
• Significance of stack?

Module 1 J Mathew RASET 9

Procedure level parallelism:


Various task scheduling algorithms are used to exploit parallelism in uniprocessor at procedure
level. When one procedure waits for an I/O to complete, the other procedure executes. It is a very
common phenomena in multi-tasking systems, such as RTOSs. Round Robin, first in first served,
largest job fist, smallest job first, etc are a few algorithms to mention. In many cases, two or more
algorithms together used. The events which are helping this scheduling are priority preemption,
interrupts, I/O, manual delay etc. Stack is used to save the context of each procedure.

How will you schedule the following tasks?


Task Priority Execution Time

T1 1 500ms (Highest priority)


T2 2 250ms
T3 2 100ms
T4 2 50ms
T5 3 25ms
T6 3 25ms
T7 3 25ms
Inter-instruction level

• Parallelism between instructions


• Independent instructions can be
executed in any order
• Fill the pipeline with independent
instructions

Module 1 J Mathew RASET 10

Inter-instruction level parallelism:


If the current instruction depends on the result of any previous instruction, dependency is said to
exist in the system. We cannot reorder the execution of instructions if any such dependency is
present. But, if a bunch of instructions have no dependency among them, various instruction level
parallelism can be employed.

ADD R0 R1 R2 // R0 = R1 + R2
SUB R3 R2 R1 // R3 = R2 + R1
ST @R4 R2 // Store value in R2 to address in R4
LD R5 @R2 // Load R5 with value in the address of R2

These instructions have no dependencies, and it can be executed in any order.


More about such dependencies can be learned on coming modules.
Intra-instruction level

• Instruction dependency
causes delays
• Identify independent
instructions
• Reorder instructions
• Keep dependency order
unchanged

Module 1 J Mathew RASET 11

Intra-instruction level parallelism:


Instruction reordering can help unwanted delays or stalls in a pipeline. Remember, speed of a
pipeline is the speed of its slowest stage. If a single instruction is delayed in the pipeline, all the
instruction in various stages of the pipeline will be delayed.
Parallelism in uniprocessor

• Known architecture
• Processor
• ALU, CU, Registers
• Memory
• Segments, hierarchy
• Input / Output

• Where is cache memory?

Module 1 J Mathew RASET 12


Parallelism in uniprocessor

1. Multiple
functional units
– Coprocessors
– Multi-port
RAM
– Instruction
cache (I$)
– Data cache
(D$)
• CDC 6600

Module 1 J Mathew RASET 13

Role of multiple functional units:


With a single copy of the resource, we can only achieve partial parallelism (by means of
scheduling and instruction level). Parallelism can be achieved only by means of multiple
resources. The important resources which are required to achieve true parallelism are processors
or co-processors, multi-port RAM, cache memory and efficient network.

Co-processors can execute in parallel with main processor by a process called cycle stealing.
Example is a DMA controller. Such controller is actually called and controlled by the main
processor. DMA controller performs memory load and store operations without interrupting or
delaying the main processor. But the initializations, such as starting address and ending address,
are issued by main processor. Without the main processor, DMA controller cannot do anything.
Even after the data transfer, the DMA controller informs the main processor about the completion
of its operations.

Multi-port RAM are RAM with multiple ports. More than one processor can access RAM at the
same time. Multi-port RAM usually acts as a tightly coupled memory to share information among
various processors.

Cache is the place where most recently used, frequently used and most private data of a processor
is stored. Cache is smaller in size than RAM, and is in-built within the processor. Hence very
privileged data are stored here. Cache is considered as loosely coupled memory, because the cache
content is less or not shared with other processors.
Parallelism in uniprocessor
2. Pipelining
– No fixed stages
– Pipelines with
32 stages exists
– Branch
prediction
– Speculation
– Hazards: RAW,
WAR, WAW
– Is RAR a Courtesy: Prof. Nigel Topham, UoE
hazard?
Module 1 J Mathew RASET 14
Parallelism in uniprocessor

3. Overlapped CPU and I/O


instructions
– CPU bound
– I/O bound
4. Use of Memory Hierarchy
– Update only the nearest
memory
– Periodic updates to lower
memory
– What is data contention?
Module 1 J Mathew RASET 15
Parallelism in uniprocessor

5. Multi-programming and time sharing


– Concept of concurrency
– Scheduling
– Mix I/O bound and CPU bound instructions
– Time slice – equal opportunities to all programs
– Virtual processors for multiple programs
– Time sharing suitable for interactive multiple terminals
– Performance depends heavily on OS

Module 1 J Mathew RASET 16


Module 1 J Mathew RASET 17
Parallelism in uniprocessor

6. By balancing band width of subsystems


– Performance varies from device to device
– Bottlenecks
– Bandwidth varies
– Memory has highest BW
– Processor has lesser BW
– Utilization BW is lesser than actual BW
– Concept of interleaved memory

Module 1 J Mathew RASET 18


Band width continues ..

– But, processor is faster than memory


– BW = (output) / (time)
– BW is the total number of output produced in unit time
– What is BW for a processor?
– What is BW for a memory?
– Can we use BW fully?
– BWm > BWp
– Need to balance BW for better performance

Module 1 J Mathew RASET 19

Bandwidth:
Bandwidth of a processor is number of instructions executed by that processor in unit time,
say a second.
Bandwidth of a memory is number of words read or written to the memory in unit time.
Bandwidth of a network is number of bytes transferred through it in unit time.

Bandwidth cannot be used in full, because of delays and conflicts in systems. The effective
bandwidth, in most cases, is less than the actual bandwidth.
Lecture 1 summary

– Concept of parallelism
– History
– Trends towards parallelism
– Technology advancement in OS
– Parallelism in uniprocessor

Module 1 J Mathew RASET 20


Parallel computer structures

• Pipeline computers
• Array processors
• Multi-processor systems
• Data flow computers

• Will be discussed in detail in coming modules

Module 1 J Mathew RASET 21


Pipeline computers

• Pipeline stages, in general


• Instruction Fetch (IF)
• Instruction Decode (ID)
• Operand Fetch (OF)
• Execution (EX)
• Memory Write (MW)
• Write Back (WB)
• What is the speed of the pipeline?

Module 1 J Mathew RASET 22

Pipeline speed:
Speed of the pipeline is speed of its slowest stage. It means that when a single instruction is
delayed, all the instructions in the pipeline get the same delay.
Pipeline computers
• Non pipelined Vs pipelined

BW = 2

Module 1 J Mathew RASET 23

Comparison:
If the instruction cycle is taken as 5 stages, unit time as 10 clock cycles, and no delay
is assumed, a non-pipelined processor throughput is given as 2 and a pipelined throughput is
given as 6. Obviously, pipelined computers are more efficient.
Pipeline computers

• Ideal for repeated operations


• Different instructions has different delays
• A stall in pipeline causes delays to every instruction
• Good for vector processing – same instruction is
executed repeatedly for similar set of data
• Floating point operations
• Matrix multiplication
• Weather processing

Module 1 J Mathew RASET 24


Pipeline computers

Module 1 J Mathew RASET 25


Array processors

• Synchronous parallel computer


• Multiple ALUs, called processing elements (PE)
• Central scalar control unit
• Each processor has associated memory
• Data routing network
• Broadcast Vector instructions to PE for distributed
execution
• PEs are passive devices – No instruction decoding
capability

Module 1 J Mathew RASET 26


Array processing

•Associative memory
•Associative processors
•Matrix multiplication
•Merge sort
•Fast Fourier Transform
•Example – Illiac IV

Module 1 J Mathew RASET 27


Multi-processor systems

• A system with two or more processors


• Shares common main memory, I/O channels, devices
• Each processor has private memory
• System controlled by single integrated OS (custom built)
• Interconnections
• Time shared common bus
• Crossbar switch network
• Multi-port memories
• Significance of OS capability
Module 1 J Mathew RASET 28
Multi-processor systems

Module 1 J Mathew RASET 29


Dataflow computers

• Von Neumann machines


• Use program counter
• Sequential execution
• Slow processing
• Known as control flow machines
• New concept - Data flow machines
• To exploit maximum parallelism
• Execute an instruction as soon as all operands
available, irrespective of order, but with dependency
Module 1 J Mathew RASET 30
Data flow representation

• Z = ((a+b) – c) * 10;
• Instruction as template: add [DST] [SRC1] [SRC2]

Module 1 J Mathew RASET 31


Lecture 2 summary

Parallel computer structures


– Pipeline computers
– Array processors
– Multi-processor systems
– Data flow computers

Module 1 J Mathew RASET 32


Parallel computer performance

• Performance is not linear


– A processor takes 10 second to complete a task
– In a multi-processor environment, there are 10 identical
processors
– 10 identical processors take at most 1 second to solve
the same task
– It may be a wrong assumption!
• 'n' processors cannot give 'n' times faster processing

Module 1 J Mathew RASET 33


Parallel computer performance

• Lower bound : log2 n (Minsky's conjecture)


• Upper bound : n / ln n
• Research showed, for n = 4 to n = 16, the actual speed up
was 2 to 4 times ( this study was performed on C.mmp and
S-1 systems)
• Use small numbers of fast processors than large numbers
of slow processors

Module 1 J Mathew RASET 34


Parallel computer performance

Module 1 J Mathew RASET 35


Parallel computer performance

• Pipelined uniprocessor systems are still dominating, cost


is less, OS is well developed

• Array processors custom designed, good for specific


applications, programming is complex

• Multi-processor systems are general purpose and are


flexible in parallel processing, OS capability is important

Module 1 J Mathew RASET 36


Architecture classifications

• Three classifications
• Multiplicity of I – D stream (Flynn scheme)
• SISD, SIMD, MISD and MIMD
• Serial Vs parallel processing (Feng scheme)
• WSBS, WPBS, WSBP and WPBP
• Parallelism Vs pipelining (Handler scheme)
• PCU, ALU, BLC

Module 1 J Mathew RASET 37


SISD

• Instruction stream –
sequence of instructions
• Data stream – input,
Michael J Flynn output, temporary results

• American professor • Multiplicity of hardware


emeritus • SISD - Overlapped
• Stanford University execution or pipelining

Flynn’s
•Module 1
taxonomy J Mathew RASET 38
SIMD

• Single instruction stream,


multiple data stream
• Instruction broadcast
• Each PE operates on
different data sets from
same stream
• Shared memory
subsystem

Module 1 J Mathew RASET 39


MISD
• Multiple instruction stream, single data stream
• Output of one PE is input to next PE
• No real implementation, yet

Module 1 J Mathew RASET 40


MIMD
• Multiple instruction stream, single data stream
• Common shared memory
• Multi-processor systems are listed here

Module 1 J Mathew RASET 41


Lecture 3 summary

Parallel computer performance analysis


Classifications of computer architecture
Flynn’s taxonomy
– SISD
– SIMD
– MISD
– MIMD

Module 1 J Mathew RASET 42


Serial Vs Parallel processing
• Tse-yun Feng suggested four types of processing methods
• Maximum parallelism degree, P(C) = (n, m)
– n is word length
– m is bit slice in pipeline (number of bits present in all
pipeline stages)
• Example: TI ASC has 64 word length 4 arithmetic pipeline.
Each pipeline has 8 stages.
• P(C) = (64, 4x8) = (64, 32)

Module 1 J Mathew RASET 43


Serial Vs Parallel processing

1. WSBS -Word serial and bit serial


Bit serial processing (word length n = 1, bit slice m = 1)

2. WPBS – Word parallel and bit serial


Bit slice processing (n = 1, m > 1); m bit slice together

3. WSBP - Word serial and bit parallel


Word slice processing (n > 1, m = 1); n bit word
together

4. WPBP – Word parallel and bit parallel


Fully parallel processing (n > 1, m > 1)
Module 1 J Mathew RASET 44
Parallelism Vs Pipelining
• Proposed by Wolfgang Handler
• Parallelism degree & pipelined degree combined
1. Process control unit (PCU) – single processor
2. Arithmetic logic unit (ALU) – processing elements
3. Bit level circuit (BLC) – combinational logic circuit
• Triplet, T(C) = <K x K', D x D', W x W'>
• K – No. of PCUs K' – No. of PCU pipelined
• D – No. of ALUs D' – No. of ALU pipelined
• W – Word length W' – No. of pipeline stages
Module 1 J Mathew RASET 45
Parallelism Vs Pipelining
• Example: Texas Instrument’s Advanced Scientific
Computer (TI ASC)
• Has one controller
• Four arithmetic pipelines
• 64 bit word length
• 8 stages
• T(ASC) = < 1x1, 4x1, 64x8 > = < 1, 4, 256 >

Module 1 J Mathew RASET 46


Lecture 4 summary

Classifications of computer architecture


Serial Vs Parallel processing
– WSBS
– WPBS
– WSBP
– WPBP
Parallelism Vs Pipelining

Module 1 J Mathew RASET 47


Module summary

• The concept • Architectural


classifications
• Trends
• Applications
• Parallelism in uniprocessor
• Contribution of India
• Parallel computer
structures
• Performance

Module 1 J Mathew RASET 48

S-ar putea să vă placă și