Supercomputer Architecture

Presented
by
Vishal Shrivastav
Indian Institute of Technology Kharagpur
OUTLINE
Overview
- What is Supercomputer?
- Where do we use
Supercomputers?
- Dierences between
Supercomputers and PCs
- Brief History of
Supercomputers
- Present Day Supercomputers
System Considerations
- Multi core Computing

- Symmetric Multiprocessing
- Distributed Computing
- Distributed Computing: Cluster
- Distributed Computing: Grid
- Cluster vs Grid
Limitations
- Comparing various classes of

Processors
- Amdahls Law
- Gustafsons Law
- Analogies
- Dierences
Memory Considerations
Case Studies
- Memory Hierarchy
- K Computer
- Blue Gene
- Hopper Cray
Processor Considerations
Overview
What is Supercomputer?
Wikipedia denes a Supercomputer as a computer at the frontline of
current processing capacity, particularly speed of calculation.
Supercomputer is a computer that is only one generation behind what
large-scale users want.

NilLi Neil Lincoln, architect for the CDC Cyber 205 and others
Where do we use Supercomputers?

Engineering (Automotive)
- Crash Simulations
- Aerodynamics
Meteorology
- Weather Forecasts
- Hurricane Warnings
Applied Mathematics
- Lattice Boltzmann Flow Solvers

Biology
- Simulation of HIV protease Dynamics
Dierences between Supercomputers and PCs

Supercomputers dier from PCs in following aspects:
Specialized
Memory Hierarchy
Speed
- Measured in Tera and Peta ops

Cost
- Range from $100,000s to $1,000,000s
Environment
- Require environmentally controlled rooms
Brief History of Supercomputers

In 1960s a series of Supercomputers were designed at Control Data
Corporation(CDC) by Seymour Cray

- CDC 6600, released in 1964, is considered to be the rst Supercomputer
Then arrived the Cray series of Supercomputers
- CRAY 1 : A 80Mhz Supercomputer released in 1976

- CRAY 2 : Released in 1985
An 8 processor liquid cooled computer with Fluorinert pumped
through it as it operated
Performed at 1.9 gigaops and was the world's fastest until 1990
CONTD . . .
In the 1990s machines with thousands of processors began to appear in US and
Japan
Intel Paragon : Ranked fastest in 1993
- A MIMD machine which connected processors via a high speed 2-D mesh
allowing processes to execute on separate nodes; communicating via the
Message Passing Interface
Fujitsu's Numerical Wind Tunnel : Ranked fastest in 1994
- Used 166 vector processors; achieved top speed of 1.7 gigaops / processor
Hitachi SR2201 : Ranked fastest in 1996
- Used 2048 processors connected via 3-D crossbar network

- Achieved peak performance of 600 gigaops
Present Day Supercomputers

Characterized by very high degree of parallelism
- Design of memory hierarchy is such that processor is kept fed with data and
instructions at all time
- The I/O systems have very high bandwidth
Uses various modern processing techniques:
- Vector Processing
- Non-uniform Memory access
- Parallel Filesystems
More than 90% of present day Supercomputers run some form of LINUX as their
operating system
CONTD . . .
The base programming language of Supercomputers is FORTRAN or C
The software tools for distributed processing include:
- APIs such as MPI and PVM, VTL

- Open source-based software solutions such as Beowulf, WareWulf and
openMosix
An easy programming language for Supercomputers remains an open research
topic in computer science
Processor
Considerations
Comparing various classes of Processors

Class of
Processor
Implementation Instruction
scheduling
Inst. Issue
Latency
Speedup (wrt
scalar
processor)
Scalar (static)
hardware
static
Scalar
(dynamic)
hardware
dynamic
Superscalar
(static)
hardware
static
Superscalar
(dynamic)
hardware
dynamic
Super-
pipelined
hardware
static
n
(1 per minor
cycle)
VLIW
software
static
Memory
Considerations
Memory Hierarchy
Shared Memory
Distributed Memory
Memory space is shared between
Many processors, each with local
multiple processors for read & write

operations
Processors interact by modifying
data objects in the shared address
space
Memory-CPU bandwidth limits its
use
memory accessible only to it, are

connected via an interconnection
network (Highly Scalable)
They communicate by passing
messages to each other
Non-uniform memory access model
(NUMA)
CONTD . . .
The largest and fastest computers in the world today employ hybrid of shared
and distributed memory architectures

At low level compute nodes comprise multiple processors sharing same
address space (shared memory)
At higher level Distributed Memory
System
Considerations
Multi core Computing

- A processor that includes multiple execution units (cores) on the same
chip
- Each core in a multi core processor can potentially be superscalar that is,
on
every cycle, each core can issue multiple instructions from one instruction
stream
- IBMs Cell microprocessor, designed for use in the Sony Playstation 3, is a
prominent example of multi core processor
Symmetric Multiprocessing
- A computer system with multiple identical processors that share memory
and connect via bus
- Bus contention prevents bus architectures from scaling. As a result, SMPs
generally do not comprise more than 32 processors.
P
Cache
Cache
Cache
Cache
Main memory
I/O system
Distributed Computing
Individual memory for each processor
Messaging interface for communication
P1
P2
Pn
Cache
Cache
Cache
Main
Memory
Main
Memory
Main
Memory
200+ processors can work on same application

But still not independent and collectively
constitute a computer
Synchronization aspects and communication
overheads are the major bottlenecks

Network
Distributed Computing: Cluster

Cluster dened:
Coordinated use of interconnected autonomous computers in a
machine room
a collection of standalone workstations of PCs that are interconnected
by a high-speed network
work as an integrated collection of resources (unied computing
resource)
have a single system image spanning all its nodes
A cluster consists of:
standalone machines with storage
a fast interconnection network
Low latency communication protocols
software to give Single System Image: Cluster Middleware
Programming tools
Classication of Clusters
Non-dedicated Clusters:
Network of Workstations
(NOW)
Use spare computation cycles
of nodes
Background job distribution
Individual owners of
workstations
Dedicated Clusters:
Joint ownership
Dedicated nodes
Parallel computing
Homogeneous cluster:
Similar processors
Software, etc
Heterogeneous:
Dierent
Architecture
data format
computational speed
system software, etc
Distributed Computing: Grid

Term grid computing originated in the early 1990s as a metaphor for making
computer power as easy to access as an electric power grid.

Makes use of computers communicating over the Internet to work on a given
problem
Deals only with embarrassingly parallel problems, owing to low bandwidth and
extremely high latency available on internet
IBMs Grid computing has put forward:
Use open standards and protocols to gain access to computing resources
over the Internet
Uses a large number of small systems spread across a large geographical region
and presents a unied picture to the users.

Availability of high speed networking surmounts the distance problem.
The Big Question

At some level, all applications share common needs.
how to:
nd resources?
acquire resources?
locate and move data?
start/monitor computation?
all securely and conveniently?
Solution:
Grid middleware: a single software infrastructure supports all of the above.
Grid Middleware Components

Grid Information Service (GIS):
Support registering and querying grid resources.
Grid Resource Broker (GRB):
End users submit their application requirements to GRB.

GRB discovers resources by querying GIS.
Grid fabric: manages resources
computers, storage devices, scientic instruments, etc.
Core grid middleware: Oers services
Process management, allocation of resources, security
User-level grid Middleware: Oers services:
Programming tools, resource brokers, scheduling application tasks for
execution on global resources.
Cluster vs Grid
Cluster computing can be said to be a subset of grid computing.
Cluster nodes are in close proximity and interconnected by LAN
Grid nodes are geographically separate.
Clusters provide guarantee of service

Nodes are expected to give full resource.
Clusters are usually a homogeneous set of nodes.

Availability and performance of grid resources are unpredictable:
Requests from within an administrative domain may gain more
priority over requests from outside.
Limitations
Limitations of Parallelism
Parallelization is the process of formulating a problem in a way that
lends itself to concurrent execution by several execution units of some

kind
Using N processors, we expect the execution time to come down N
times ideally. This is termed as speedup of N

Reality is not so perfect
Limitations:
- Shared resources
- Dependencies between processors
- Communication
- Load imbalance
The serial part limits speedup
Amdahls Law
1 processor : T(1) = s + p = 1 (s: serial part p: parallel part)
n processors : T(n) = s + (p/n)
Scalability (Speed Up) = T(1)/T(n) = 1 /(s + (1-s/n))
Gustafson's Law
- Addresses the shortcomings of Amdahl's law
- Says that problems with large, repetitive data sets can be eciently parallelized
where P is the number of processors, S is the speedup, and the non-
parallelizable part of the process
- Adding due consideration for large scale consideration and tasks
Analogies
Amdahls Law
Suppose two cities are 60 km apart, a car has spent one hour travelling the rst
30 km. No matter how fast it drives the last 30 km, it is impossible to achieve an
average speed of 90 km/h before arriving the destination
Gustafsons Law
Suppose a car has already been travelling for some time at speed of less than
90km/h, and when given enough time and distance to travel, the cars average
speed can reach 90km/h as long as it drives faster than 90 km/h for some time.
And also the average speed can reach 120km/h and even 150km/h as long as it
drives fast enough in the following part
Dierences
Amdahls Law
Gustafsons Law
Does not scale the availability of
Proposes that programmers set the
computing power as the number of

machines increases
Based on xed workload or
xed problem size. It implies that

the sequential part of a program
does not change with respect to
machine size (i.e., the number of
processors). However the parallel
part is evenly distributed
over P processors.
size of problems to use the available

equipment to solve problems within
a practical xed time
Therefore, if faster (more parallel)
equipment is available, larger

problems can be solved in the same
time
Redenes eciency as a need to
minimize the sequential part of a

program, even if it increases the
total amount of computation
Case Studies
K Computer
Produced by Fujitsu at the RIKEN Advanced
Institute for Computational Science campus

in Kobe, Japan
In June of 2011, K became the world's fastest
Supercomputer, as recorded by the TOP500,

with a rating of over 8 petaops (quadrillion
calculations per second)
In November 2011, it became the rst computer
to top 10 petaops
CONTD . . .
Major Features
Uses 68,544 2.0Ghz 8-core SPARC64 VIIIfx processors packed in 672 cabinets,
for a total of 548,352 cores

It uses 45nm CMOS technology
Each cabinet has 96 compute nodes, 6 IO nodes where each node contains a
single processor and 16GB of memory
Uses a 6-D torus network interconnect called Tofu, and a Tofu-optimized
Message Passing Interface based on Open MPI library
Adopts 2-level local/global lesystem

Fujitsu developed an optimized parallel le system based on Lustre, called
Fujitsu Exabyte File System, scalable to several hundred petabytes
Blue Gene
Blue Gene is a computer architecture
project carried out by a cooperative

eorts of IBM, Lawerence Livermore
National Lab, US Department of
Energy and academia
4 Blue Gene projects are in progress:
- Blue Gene/L
- Blue Gene/C
- Blue Gene/P
- Blue Gene/Q
The project was awarded the National Medal of Technology and Innovation
by US President Barack Obama in 2009
CONTD . . .
Major features
Trading the speed of processors for lower power consumption.
Dual processors per node with two working modes: co-processor (1 user
process/node: computation and communication work is shared by two

processors) and virtual mode (2 user processes/node)
A large number of nodes (scalable in increments of 1024 up to at least
65,536)
Three-dimensional torus interconnect with auxiliary networks for global
communications, I/O, and management
Lightweight OS per node for minimum system overhead (computational
noise)
CONTD . . .
The block scheme of the Blue Gene/L ASIC including dual PowerPC 440 cores
Hopper Cray
Hopper is NERSC's rst petaop system
It has 153,216 compute cores
The size of main memory is 217 TB
The secondary storage (disk) size is 2 PB
Hopper placed number 5 on the November 2010 Top500 Supercomputer list.
CONTD . . .
Compute Nodes
6,384 nodes
2 twelve-core AMD 'Magny Cours' 2.1-GHz processors per node
24 cores per node (153,216 total cores)
32 GB DDR3 1333-MHz memory per node (6,000 nodes)
64 GB DDR3 1333-MHz memory per node (384 nodes)
Peak Gop/s rate:
8.4 Gops/core
201.6 Gops/node
1.28 Peta-ops for the entire machine
Each core has its own L1 and L2 caches, with 64 KB and 512KB respectively
One 6-MB L3 cache shared between 6 cores on the Magny Cours processor
Four DDR3 1333-MHz memory channels per twelve-core 'Magny Cours'
processor
CONTD . . .
Magny Cours Processor
CONTD . . .
Interconnect
Hopper's compute nodes are connected via a custom high-bandwidth, low-
latency network provided by Cray
Each network node handles not only data destined for itself, but also data to be
relayed to other nodes
Nodes at the "edges" of the mesh network are connected to nodes at the other
edge to form a 3-D torus
CONTD . . .
The custom chips that route communication
over the network are known as "Gemini" and

the entire network is often referred to as the
"Cray Gemini Network."
Each pair of Hopper nodes (containing 48 cores)
is connected to one Gemini Application-Specic

Integrated Circuit (AISC)
Cray Gemini Interconnect with an "exploded

view of the ASIC
CONTD . . .
Wiring up a Cray XE6
CONTD . . .
File System
All of NERSC's global le systems are available on Hopper. Additionally,
Hopper has 2 PB of locally attached high-performance /scratch disk space
Scratch File Systems

Size (TB)
Aggregate Peak
Performance
# of Disks
$SCRATCH
1 PB
35 GB/sec
13
$SCRATCH2
1 PB
35 GB/sec
13
THANKS !

Supercomputer Architecture

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Supercomputer Architecture

Încărcat de

Drepturi de autor:

Formate disponibile

Presented

- Multi core Computing

- Comparing various classes of

current processing capacity, particularly speed of calculation.

Supercomputer is a computer that is only one generation behind what

large-scale users want.

Where do we use Supercomputers?

- Lattice Boltzmann Flow Solvers

- Simulation of HIV protease Dynamics

Dierences between Supercomputers and PCs

- Measured in Tera and Peta ops

Brief History of Supercomputers

Corporation(CDC) by Seymour Cray

- CRAY 1 : A 80Mhz Supercomputer released in 1976

- Used 2048 processors connected via 3-D crossbar network

Present Day Supercomputers

- APIs such as MPI and PVM, VTL

topic in computer science

Comparing various classes of Processors

Memory space is shared between

Many processors, each with local

multiple processors for read & write

memory accessible only to it, are

and distributed memory architectures

Multi core Computing

200+ processors can work on same application

overheads are the major bottlenecks

Distributed Computing: Cluster

Distributed Computing: Grid

computer power as easy to access as an electric power grid.

over the Internet

and presents a unied picture to the users.

The Big Question

Grid middleware: a single software infrastructure supports all of the above.

Grid Middleware Components

Support registering and querying grid resources.

Grid Resource Broker (GRB):

End users submit their application requirements to GRB.

Grid fabric: manages resources

computers, storage devices, scientic instruments, etc.

Core grid middleware: Oers services

Process management, allocation of resources, security

User-level grid Middleware: Oers services:

Programming tools, resource brokers, scheduling application tasks for

execution on global resources.

Clusters provide guarantee of service

Clusters are usually a homogeneous set of nodes.

priority over requests from outside.

lends itself to concurrent execution by several execution units of some

Using N processors, we expect the execution time to come down N

times ideally. This is termed as speedup of N

Does not scale the availability of

Proposes that programmers set the

computing power as the number of

xed problem size. It implies that

size of problems to use the available

Therefore, if faster (more parallel)

equipment is available, larger

Redenes eciency as a need to

minimize the sequential part of a

Institute for Computational Science campus

Supercomputer, as recorded by the TOP500,

for a total of 548,352 cores