HPC Module 1

High Performance Computing
Jimmy Mathew
jimmym@rajagiritech.ac.in
Department of Information Technology
Rajagiri School of Engineering and Technology, Cochin
High Performance Computing
Module 1 – Parallel Processing
Module 1 J Mathew RASET 2

References
[1] “Computer architecture & parallel processing”, Kai

Hwang, Faye A Briggs
[2] “Parallel Computers”, V Rajaraman, Siva Ram Murthy
[3] “Elements of parallel computing”, V Rajaraman
[4] “Super computers”, V Rajaraman

Seminar topics
• Applications of super computers

• 10 applications
• Contribution of India in super computers
• Various architectures
• IBM 360/91, Illiac IV, Univac 1100 / 80
Seminars:
Seminars shall go in parallel with regular classes. All the seminars shall be in groups of two or
three.
The concept of parallelism
• Evolution of computer science

• Generations of computers
• Need for more computation in lesser time
• Invention of networking
• Independent activities
• Multiple resources
• Parallelism
Introduction – The concept of parallelism
Parallelism was not a quick idea. The concept began to grow with the advancement in technology,
and researchers around the globe are still continuing to exploit more and more parallelisms
possible. In this module, we shall see how this concept evolved and eventually lead to invention
of super computers.
The growth of computer era began at 1938 with the invention of first analog computer called
ENIAC. The technology used in first generation computers were electromechanical relays and
vacuum tubes. Then came the era of second generation computers by 1952, were transistors and
diodes were building blocks. TRADIC from Bell Laboratories is one among them. Third
generation (1962) computers had seen advancement of electronics in Silicon based integrated
circuits (in small scale and medium scale), and multi layered printed circuit boards. Illiac IV, IBM
360 / 91 are a few to mention. Fourth generation began on 1972, and large scale integration were
used in every processor fabrication with high density packaging. Concept of Intellectual property
(IP) or Virtual Components (VC) were extensively used from then onwards. Cray – I and Univac
1100 / 80 were beginners in this era. Refer [1] pp 2-4 for more details.
The computation capability individual processor was increasing ever since, and networks helped
to interconnect these processors. Various application, which are distributed in nature and with
high computational requirement became more and more dependent on the architecture and
network configurations. The high performance applications were divided into independent tasks
and were distributed among various processors, which are either distributed or centralised. We
aim to study at architectures of such centralized processors.
Trends in parallelism
• Computers are being

outdated
• Importance of Moore's law
• Processing stages
Gordon Moore is a co-founder of Intel Corporation. His prediction related to processor capacity
was later on called Moore's law. In 1965 Gordon Moore published a paper which highlighted the
exponential growth in transistor densities, and this became known as Moore’s Law. Today, it is
understood to mean that densities double every two years, though in fact they increase by a factor
of 5 every 5 years.
Implication of Moore's law are performance increase, complexity increase, die area decrease,
faster technology expire, etc. In the earlier days, computers were used only for data processing.
Later, information processing began. From information, knowledge processing started. From
knowledge processing, today’s intelligence processing is derived. Computer systems are
becoming more and more capable of working by its own intelligence, as they have huge
applications.
Trends in parallelism
• OS advancement stages • Parallelism exploited in
OS advancements:
The advancements in Operating System technology has helped a lot to improve effective use of
processor capacity. From batch processing scenario to current days multi-processing capability
were significant milestones in OS history. Even though multi-processing capability is built in
OS, many applications run on such OS are still not capable to exploit full parallelism provided
by multiple processors. One of the reason for such incapability of application is lack of
programming language that support parallel programming.
Program level
• Execute programs in
parallel
• Multi-programming
• Multi-processing
Muti-programming :
Concurrently executing multiple programs with one processor. Actually parallel processing is not
possible with single processor. We may need to use co-processors like DMA to simulate parallel
processing.
Multi-processing :
Parallel execution of multiple programs with multiple processors. The central node (main
processor) acts as a distributor of load to each processor.
Procedure level
• Parallelism within a
processor
• Execute functions or tasks in
parallel
• Task scheduling
• Significance of stack?
Procedure level parallelism:

Various task scheduling algorithms are used to exploit parallelism in uniprocessor at procedure
level. When one procedure waits for an I/O to complete, the other procedure executes. It is a very
common phenomena in multi-tasking systems, such as RTOSs. Round Robin, first in first served,
largest job fist, smallest job first, etc are a few algorithms to mention. In many cases, two or more
algorithms together used. The events which are helping this scheduling are priority preemption,
interrupts, I/O, manual delay etc. Stack is used to save the context of each procedure.
How will you schedule the following tasks?

Task Priority Execution Time
T1 1 500ms (Highest priority)

T2 2 250ms
T3 2 100ms
T4 2 50ms
T5 3 25ms
T6 3 25ms
T7 3 25ms
Inter-instruction level
• Parallelism between instructions

• Independent instructions can be
executed in any order
• Fill the pipeline with independent
instructions
Inter-instruction level parallelism:

If the current instruction depends on the result of any previous instruction, dependency is said to
exist in the system. We cannot reorder the execution of instructions if any such dependency is
present. But, if a bunch of instructions have no dependency among them, various instruction level
parallelism can be employed.
ADD R0 R1 R2 // R0 = R1 + R2
SUB R3 R2 R1 // R3 = R2 + R1
ST @R4 R2 // Store value in R2 to address in R4
LD R5 @R2 // Load R5 with value in the address of R2
These instructions have no dependencies, and it can be executed in any order.

More about such dependencies can be learned on coming modules.
Intra-instruction level
• Instruction dependency
causes delays
• Identify independent
instructions
• Reorder instructions
• Keep dependency order
unchanged
Intra-instruction level parallelism:

Instruction reordering can help unwanted delays or stalls in a pipeline. Remember, speed of a
pipeline is the speed of its slowest stage. If a single instruction is delayed in the pipeline, all the
instruction in various stages of the pipeline will be delayed.
Parallelism in uniprocessor
• Known architecture
• Processor
• ALU, CU, Registers
• Memory
• Segments, hierarchy
• Input / Output
• Where is cache memory?

1. Multiple
functional units
– Coprocessors
– Multi-port
RAM
– Instruction
cache (I$)
– Data cache
(D$)
• CDC 6600
Role of multiple functional units:

With a single copy of the resource, we can only achieve partial parallelism (by means of
scheduling and instruction level). Parallelism can be achieved only by means of multiple
resources. The important resources which are required to achieve true parallelism are processors
or co-processors, multi-port RAM, cache memory and efficient network.
Co-processors can execute in parallel with main processor by a process called cycle stealing.
Example is a DMA controller. Such controller is actually called and controlled by the main
processor. DMA controller performs memory load and store operations without interrupting or
delaying the main processor. But the initializations, such as starting address and ending address,
are issued by main processor. Without the main processor, DMA controller cannot do anything.
Even after the data transfer, the DMA controller informs the main processor about the completion
of its operations.
Multi-port RAM are RAM with multiple ports. More than one processor can access RAM at the
same time. Multi-port RAM usually acts as a tightly coupled memory to share information among
various processors.
Cache is the place where most recently used, frequently used and most private data of a processor
is stored. Cache is smaller in size than RAM, and is in-built within the processor. Hence very
privileged data are stored here. Cache is considered as loosely coupled memory, because the cache
content is less or not shared with other processors.
2. Pipelining
– No fixed stages
– Pipelines with
32 stages exists
– Branch
prediction
– Speculation
– Hazards: RAW,
WAR, WAW
– Is RAR a Courtesy: Prof. Nigel Topham, UoE
hazard?
3. Overlapped CPU and I/O

instructions
– CPU bound
– I/O bound
4. Use of Memory Hierarchy
– Update only the nearest
memory
– Periodic updates to lower
memory
– What is data contention?
5. Multi-programming and time sharing

– Concept of concurrency
– Scheduling
– Mix I/O bound and CPU bound instructions
– Time slice – equal opportunities to all programs
– Virtual processors for multiple programs
– Time sharing suitable for interactive multiple terminals
– Performance depends heavily on OS

6. By balancing band width of subsystems

– Performance varies from device to device
– Bottlenecks
– Bandwidth varies
– Memory has highest BW
– Processor has lesser BW
– Utilization BW is lesser than actual BW
– Concept of interleaved memory

Band width continues ..
– But, processor is faster than memory

– BW = (output) / (time)
– BW is the total number of output produced in unit time
– What is BW for a processor?
– What is BW for a memory?
– Can we use BW fully?
– BWm > BWp
– Need to balance BW for better performance
Bandwidth:
Bandwidth of a processor is number of instructions executed by that processor in unit time,
say a second.
Bandwidth of a memory is number of words read or written to the memory in unit time.
Bandwidth of a network is number of bytes transferred through it in unit time.
Bandwidth cannot be used in full, because of delays and conflicts in systems. The effective
bandwidth, in most cases, is less than the actual bandwidth.
Lecture 1 summary
– Concept of parallelism
– History
– Trends towards parallelism
– Technology advancement in OS
– Parallelism in uniprocessor

Parallel computer structures
• Pipeline computers
• Array processors
• Multi-processor systems
• Data flow computers
• Will be discussed in detail in coming modules

Pipeline computers
• Pipeline stages, in general

• Instruction Fetch (IF)
• Instruction Decode (ID)
• Operand Fetch (OF)
• Execution (EX)
• Memory Write (MW)
• Write Back (WB)
• What is the speed of the pipeline?
Pipeline speed:
Speed of the pipeline is speed of its slowest stage. It means that when a single instruction is
delayed, all the instructions in the pipeline get the same delay.
Pipeline computers
• Non pipelined Vs pipelined
BW = 2
Comparison:
If the instruction cycle is taken as 5 stages, unit time as 10 clock cycles, and no delay
is assumed, a non-pipelined processor throughput is given as 2 and a pipelined throughput is
given as 6. Obviously, pipelined computers are more efficient.
Pipeline computers
• Ideal for repeated operations

• Different instructions has different delays
• A stall in pipeline causes delays to every instruction
• Good for vector processing – same instruction is
executed repeatedly for similar set of data
• Floating point operations
• Matrix multiplication
• Weather processing

Pipeline computers

Array processors
• Synchronous parallel computer

• Multiple ALUs, called processing elements (PE)
• Central scalar control unit
• Each processor has associated memory
• Data routing network
• Broadcast Vector instructions to PE for distributed
execution
• PEs are passive devices – No instruction decoding
capability

Array processing
•Associative memory
•Associative processors
•Matrix multiplication
•Merge sort
•Fast Fourier Transform
•Example – Illiac IV

Multi-processor systems
• A system with two or more processors

• Shares common main memory, I/O channels, devices
• Each processor has private memory
• System controlled by single integrated OS (custom built)
• Interconnections
• Time shared common bus
• Crossbar switch network
• Multi-port memories
• Significance of OS capability
Multi-processor systems

Dataflow computers
• Von Neumann machines

• Use program counter
• Sequential execution
• Slow processing
• Known as control flow machines
• New concept - Data flow machines
• To exploit maximum parallelism
• Execute an instruction as soon as all operands
available, irrespective of order, but with dependency
Data flow representation
• Z = ((a+b) – c) * 10;
• Instruction as template: add [DST] [SRC1] [SRC2]

Lecture 2 summary
Parallel computer structures

– Pipeline computers
– Array processors
– Multi-processor systems
– Data flow computers

Parallel computer performance
• Performance is not linear

– A processor takes 10 second to complete a task
– In a multi-processor environment, there are 10 identical
processors
– 10 identical processors take at most 1 second to solve
the same task
– It may be a wrong assumption!
• 'n' processors cannot give 'n' times faster processing

• Lower bound : log2 n (Minsky's conjecture)

• Upper bound : n / ln n
• Research showed, for n = 4 to n = 16, the actual speed up
was 2 to 4 times ( this study was performed on C.mmp and
S-1 systems)
• Use small numbers of fast processors than large numbers
of slow processors


• Pipelined uniprocessor systems are still dominating, cost

is less, OS is well developed
• Array processors custom designed, good for specific

applications, programming is complex
• Multi-processor systems are general purpose and are

flexible in parallel processing, OS capability is important

Architecture classifications
• Three classifications
• Multiplicity of I – D stream (Flynn scheme)
• SISD, SIMD, MISD and MIMD
• Serial Vs parallel processing (Feng scheme)
• WSBS, WPBS, WSBP and WPBP
• Parallelism Vs pipelining (Handler scheme)
• PCU, ALU, BLC

SISD
• Instruction stream –
sequence of instructions
• Data stream – input,
Michael J Flynn output, temporary results
• American professor • Multiplicity of hardware

emeritus • SISD - Overlapped
• Stanford University execution or pipelining
Flynn’s
•Module 1
taxonomy J Mathew RASET 38
SIMD
• Single instruction stream,

multiple data stream
• Instruction broadcast
• Each PE operates on
different data sets from
same stream
• Shared memory
subsystem

MISD
• Multiple instruction stream, single data stream
• Output of one PE is input to next PE
• No real implementation, yet

MIMD
• Multiple instruction stream, single data stream
• Common shared memory
• Multi-processor systems are listed here

Lecture 3 summary
Parallel computer performance analysis

Classifications of computer architecture
Flynn’s taxonomy
– SISD
– SIMD
– MISD
– MIMD

Serial Vs Parallel processing
• Tse-yun Feng suggested four types of processing methods
• Maximum parallelism degree, P(C) = (n, m)
– n is word length
– m is bit slice in pipeline (number of bits present in all
pipeline stages)
• Example: TI ASC has 64 word length 4 arithmetic pipeline.
Each pipeline has 8 stages.
• P(C) = (64, 4x8) = (64, 32)

1. WSBS -Word serial and bit serial

Bit serial processing (word length n = 1, bit slice m = 1)
2. WPBS – Word parallel and bit serial

Bit slice processing (n = 1, m > 1); m bit slice together
3. WSBP - Word serial and bit parallel

Word slice processing (n > 1, m = 1); n bit word
together
4. WPBP – Word parallel and bit parallel

Fully parallel processing (n > 1, m > 1)
Parallelism Vs Pipelining
• Proposed by Wolfgang Handler
• Parallelism degree & pipelined degree combined
1. Process control unit (PCU) – single processor
2. Arithmetic logic unit (ALU) – processing elements
3. Bit level circuit (BLC) – combinational logic circuit
• Triplet, T(C) = <K x K', D x D', W x W'>
• K – No. of PCUs K' – No. of PCU pipelined
• D – No. of ALUs D' – No. of ALU pipelined
• W – Word length W' – No. of pipeline stages
• Example: Texas Instrument’s Advanced Scientific
Computer (TI ASC)
• Has one controller
• Four arithmetic pipelines
• 64 bit word length
• 8 stages
• T(ASC) = < 1x1, 4x1, 64x8 > = < 1, 4, 256 >

Lecture 4 summary
Classifications of computer architecture

– WSBS
– WPBS
– WSBP
– WPBP

Module summary
• The concept • Architectural

classifications
• Trends
• Applications
• Parallelism in uniprocessor
• Contribution of India
• Parallel computer
structures
• Performance

HPC Module 1

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

HPC Module 1

Încărcat de

Drepturi de autor:

Formate disponibile

High Performance Computing

Module 1 – Parallel Processing

Module 1 J Mathew RASET 2

[1] “Computer architecture & parallel processing”, Kai

[2] “Parallel Computers”, V Rajaraman, Siva Ram Murthy

[3] “Elements of parallel computing”, V Rajaraman

[4] “Super computers”, V Rajaraman

Module 1 J Mathew RASET 3

• Applications of super computers

Module 1 J Mathew RASET 4

• Evolution of computer science

Module 1 J Mathew RASET 5

Introduction – The concept of parallelism

• Computers are being

Module 1 J Mathew RASET 6

• OS advancement stages • Parallelism exploited in

Module 1 J Mathew RASET 7

Module 1 J Mathew RASET 8

Module 1 J Mathew RASET 9

Procedure level parallelism:

How will you schedule the following tasks?

T1 1 500ms (Highest priority)

• Parallelism between instructions

Module 1 J Mathew RASET 10

Inter-instruction level parallelism:

These instructions have no dependencies, and it can be executed in any order.

Module 1 J Mathew RASET 11

Intra-instruction level parallelism:

• Where is cache memory?

Module 1 J Mathew RASET 12

Module 1 J Mathew RASET 13

Role of multiple functional units:

3. Overlapped CPU and I/O

5. Multi-programming and time sharing

Module 1 J Mathew RASET 16

6. By balancing band width of subsystems

Module 1 J Mathew RASET 18

– But, processor is faster than memory

Module 1 J Mathew RASET 19

Module 1 J Mathew RASET 20

• Will be discussed in detail in coming modules

Module 1 J Mathew RASET 21

• Pipeline stages, in general

Module 1 J Mathew RASET 22

Module 1 J Mathew RASET 23

• Ideal for repeated operations

Module 1 J Mathew RASET 24

Module 1 J Mathew RASET 25

• Synchronous parallel computer

Module 1 J Mathew RASET 26

Module 1 J Mathew RASET 27

• A system with two or more processors

Module 1 J Mathew RASET 29

• Von Neumann machines

Module 1 J Mathew RASET 31

Parallel computer structures

Module 1 J Mathew RASET 32

• Performance is not linear

Module 1 J Mathew RASET 33

• Lower bound : log2 n (Minsky's conjecture)

Module 1 J Mathew RASET 34

Module 1 J Mathew RASET 35

• Pipelined uniprocessor systems are still dominating, cost