Documente Academic
Documente Profesional
Documente Cultură
Jimmy Mathew
jimmym@rajagiritech.ac.in
Department of Information Technology
Rajagiri School of Engineering and Technology, Cochin
High Performance Computing
Seminars:
Seminars shall go in parallel with regular classes. All the seminars shall be in groups of two or
three.
The concept of parallelism
Parallelism was not a quick idea. The concept began to grow with the advancement in technology,
and researchers around the globe are still continuing to exploit more and more parallelisms
possible. In this module, we shall see how this concept evolved and eventually lead to invention
of super computers.
The growth of computer era began at 1938 with the invention of first analog computer called
ENIAC. The technology used in first generation computers were electromechanical relays and
vacuum tubes. Then came the era of second generation computers by 1952, were transistors and
diodes were building blocks. TRADIC from Bell Laboratories is one among them. Third
generation (1962) computers had seen advancement of electronics in Silicon based integrated
circuits (in small scale and medium scale), and multi layered printed circuit boards. Illiac IV, IBM
360 / 91 are a few to mention. Fourth generation began on 1972, and large scale integration were
used in every processor fabrication with high density packaging. Concept of Intellectual property
(IP) or Virtual Components (VC) were extensively used from then onwards. Cray – I and Univac
1100 / 80 were beginners in this era. Refer [1] pp 2-4 for more details.
The computation capability individual processor was increasing ever since, and networks helped
to interconnect these processors. Various application, which are distributed in nature and with
high computational requirement became more and more dependent on the architecture and
network configurations. The high performance applications were divided into independent tasks
and were distributed among various processors, which are either distributed or centralised. We
aim to study at architectures of such centralized processors.
Trends in parallelism
Gordon Moore is a co-founder of Intel Corporation. His prediction related to processor capacity
was later on called Moore's law. In 1965 Gordon Moore published a paper which highlighted the
exponential growth in transistor densities, and this became known as Moore’s Law. Today, it is
understood to mean that densities double every two years, though in fact they increase by a factor
of 5 every 5 years.
Implication of Moore's law are performance increase, complexity increase, die area decrease,
faster technology expire, etc. In the earlier days, computers were used only for data processing.
Later, information processing began. From information, knowledge processing started. From
knowledge processing, today’s intelligence processing is derived. Computer systems are
becoming more and more capable of working by its own intelligence, as they have huge
applications.
Trends in parallelism
OS advancements:
The advancements in Operating System technology has helped a lot to improve effective use of
processor capacity. From batch processing scenario to current days multi-processing capability
were significant milestones in OS history. Even though multi-processing capability is built in
OS, many applications run on such OS are still not capable to exploit full parallelism provided
by multiple processors. One of the reason for such incapability of application is lack of
programming language that support parallel programming.
Program level
• Execute programs in
parallel
• Multi-programming
• Multi-processing
Muti-programming :
Concurrently executing multiple programs with one processor. Actually parallel processing is not
possible with single processor. We may need to use co-processors like DMA to simulate parallel
processing.
Multi-processing :
Parallel execution of multiple programs with multiple processors. The central node (main
processor) acts as a distributor of load to each processor.
Procedure level
• Parallelism within a
processor
• Execute functions or tasks in
parallel
• Task scheduling
• Significance of stack?
ADD R0 R1 R2 // R0 = R1 + R2
SUB R3 R2 R1 // R3 = R2 + R1
ST @R4 R2 // Store value in R2 to address in R4
LD R5 @R2 // Load R5 with value in the address of R2
• Instruction dependency
causes delays
• Identify independent
instructions
• Reorder instructions
• Keep dependency order
unchanged
• Known architecture
• Processor
• ALU, CU, Registers
• Memory
• Segments, hierarchy
• Input / Output
1. Multiple
functional units
– Coprocessors
– Multi-port
RAM
– Instruction
cache (I$)
– Data cache
(D$)
• CDC 6600
Co-processors can execute in parallel with main processor by a process called cycle stealing.
Example is a DMA controller. Such controller is actually called and controlled by the main
processor. DMA controller performs memory load and store operations without interrupting or
delaying the main processor. But the initializations, such as starting address and ending address,
are issued by main processor. Without the main processor, DMA controller cannot do anything.
Even after the data transfer, the DMA controller informs the main processor about the completion
of its operations.
Multi-port RAM are RAM with multiple ports. More than one processor can access RAM at the
same time. Multi-port RAM usually acts as a tightly coupled memory to share information among
various processors.
Cache is the place where most recently used, frequently used and most private data of a processor
is stored. Cache is smaller in size than RAM, and is in-built within the processor. Hence very
privileged data are stored here. Cache is considered as loosely coupled memory, because the cache
content is less or not shared with other processors.
Parallelism in uniprocessor
2. Pipelining
– No fixed stages
– Pipelines with
32 stages exists
– Branch
prediction
– Speculation
– Hazards: RAW,
WAR, WAW
– Is RAR a Courtesy: Prof. Nigel Topham, UoE
hazard?
Module 1 J Mathew RASET 14
Parallelism in uniprocessor
Bandwidth:
Bandwidth of a processor is number of instructions executed by that processor in unit time,
say a second.
Bandwidth of a memory is number of words read or written to the memory in unit time.
Bandwidth of a network is number of bytes transferred through it in unit time.
Bandwidth cannot be used in full, because of delays and conflicts in systems. The effective
bandwidth, in most cases, is less than the actual bandwidth.
Lecture 1 summary
– Concept of parallelism
– History
– Trends towards parallelism
– Technology advancement in OS
– Parallelism in uniprocessor
• Pipeline computers
• Array processors
• Multi-processor systems
• Data flow computers
Pipeline speed:
Speed of the pipeline is speed of its slowest stage. It means that when a single instruction is
delayed, all the instructions in the pipeline get the same delay.
Pipeline computers
• Non pipelined Vs pipelined
BW = 2
Comparison:
If the instruction cycle is taken as 5 stages, unit time as 10 clock cycles, and no delay
is assumed, a non-pipelined processor throughput is given as 2 and a pipelined throughput is
given as 6. Obviously, pipelined computers are more efficient.
Pipeline computers
•Associative memory
•Associative processors
•Matrix multiplication
•Merge sort
•Fast Fourier Transform
•Example – Illiac IV
• Z = ((a+b) – c) * 10;
• Instruction as template: add [DST] [SRC1] [SRC2]
• Three classifications
• Multiplicity of I – D stream (Flynn scheme)
• SISD, SIMD, MISD and MIMD
• Serial Vs parallel processing (Feng scheme)
• WSBS, WPBS, WSBP and WPBP
• Parallelism Vs pipelining (Handler scheme)
• PCU, ALU, BLC
• Instruction stream –
sequence of instructions
• Data stream – input,
Michael J Flynn output, temporary results
Flynn’s
•Module 1
taxonomy J Mathew RASET 38
SIMD