Documente Academic
Documente Profesional
Documente Cultură
by
Vishal
Shrivastav
Indian
Institute
of
Technology
Kharagpur
OUTLINE
Overview
-
What
is
Supercomputer?
-
Where
do
we
use
Supercomputers?
-
Dierences
between
Supercomputers
and
PCs
-
Brief
History
of
Supercomputers
-
Present
Day
Supercomputers
System Considerations
-
Amdahls
Law
-
Gustafsons
Law
-
Analogies
-
Dierences
Memory Considerations
Case Studies
- Memory Hierarchy
-
K
Computer
-
Blue
Gene
-
Hopper
Cray
Processor Considerations
Overview
What
is
Supercomputer?
Wikipedia
denes
a
Supercomputer
as
a
computer
at
the
frontline
of
-
Crash
Simulations
-
Aerodynamics
Meteorology
-
Weather
Forecasts
-
Hurricane
Warnings
Applied
Mathematics
CONTD
.
.
.
In
the
1990s
machines
with
thousands
of
processors
began
to
appear
in
US
and
Japan
Intel
Paragon
:
Ranked
fastest
in
1993
-
A
MIMD
machine
which
connected
processors
via
a
high
speed
2-D
mesh
allowing
processes
to
execute
on
separate
nodes;
communicating
via
the
Message
Passing
Interface
Fujitsu's
Numerical
Wind
Tunnel
:
Ranked
fastest
in
1994
-
Used
166
vector
processors;
achieved
top
speed
of
1.7
gigaops
/
processor
Hitachi
SR2201
:
Ranked
fastest
in
1996
-
Design
of
memory
hierarchy
is
such
that
processor
is
kept
fed
with
data
and
instructions
at
all
time
-
The
I/O
systems
have
very
high
bandwidth
Uses
various
modern
processing
techniques:
-
Vector
Processing
-
Non-uniform
Memory
access
-
Parallel
Filesystems
More
than
90%
of
present
day
Supercomputers
run
some
form
of
LINUX
as
their
operating system
CONTD
.
.
.
The
base
programming
language
of
Supercomputers
is
FORTRAN
or
C
The
software
tools
for
distributed
processing
include:
Processor
Considerations
Implementation
Instruction
scheduling
Inst.
Issue
Latency
Speedup
(wrt
scalar
processor)
Scalar (static)
hardware
static
Scalar
(dynamic)
hardware
dynamic
Superscalar
(static)
hardware
static
Superscalar
(dynamic)
hardware
dynamic
Super-
pipelined
hardware
static
n
(1
per
minor
cycle)
VLIW
software
static
Memory
Considerations
Memory
Hierarchy
Shared
Memory
Distributed Memory
CONTD
.
.
.
The
largest
and
fastest
computers
in
the
world
today
employ
hybrid
of
shared
System
Considerations
-
Each
core
in
a
multi
core
processor
can
potentially
be
superscalar
that
is,
on
every
cycle,
each
core
can
issue
multiple
instructions
from
one
instruction
stream
-
IBMs
Cell
microprocessor,
designed
for
use
in
the
Sony
Playstation
3,
is
a
prominent
example
of
multi
core
processor
Symmetric
Multiprocessing
-
A
computer
system
with
multiple
identical
processors
that
share
memory
and
connect
via
bus
-
Bus
contention
prevents
bus
architectures
from
scaling.
As
a
result,
SMPs
generally
do
not
comprise
more
than
32
processors.
P
Cache
Cache
Cache
Cache
Main memory
I/O system
Distributed
Computing
Individual
memory
for
each
processor
Messaging
interface
for
communication
P1
P2
Pn
Cache
Cache
Cache
Main
Memory
Main
Memory
Main
Memory
constitute
a
computer
Synchronization
aspects
and
communication
machine
room
a
collection
of
standalone
workstations
of
PCs
that
are
interconnected
by
a
high-speed
network
work
as
an
integrated
collection
of
resources
(unied
computing
resource)
have
a
single
system
image
spanning
all
its
nodes
A
cluster
consists
of:
standalone
machines
with
storage
a
fast
interconnection
network
Low
latency
communication
protocols
software
to
give
Single
System
Image:
Cluster
Middleware
Programming
tools
Classication
of
Clusters
Non-dedicated
Clusters:
Network
of
Workstations
(NOW)
Use
spare
computation
cycles
of
nodes
Background
job
distribution
Individual
owners
of
workstations
Dedicated
Clusters:
Joint
ownership
Dedicated
nodes
Parallel
computing
Homogeneous
cluster:
Similar
processors
Software,
etc
Heterogeneous:
Dierent
Architecture
data
format
computational
speed
system
software,
etc
Uses a large number of small systems spread across a large geographical region
Solution:
Cluster
vs
Grid
Cluster
computing
can
be
said
to
be
a
subset
of
grid
computing.
Cluster
nodes
are
in
close
proximity
and
interconnected
by
LAN
Grid
nodes
are
geographically
separate.
Limitations
Limitations
of
Parallelism
Parallelization
is
the
process
of
formulating
a
problem
in
a
way
that
Limitations:
-
Shared
resources
-
Dependencies
between
processors
-
Communication
-
Load
imbalance
The
serial
part
limits
speedup
Amdahls
Law
1
processor
:
T(1)
=
s
+
p
=
1
(s:
serial
part
p:
parallel
part)
n
processors
:
T(n)
=
s
+
(p/n)
Scalability
(Speed
Up)
=
T(1)/T(n)
=
1
/(s
+
(1-s/n))
Gustafson's
Law
-
Addresses
the
shortcomings
of
Amdahl's
law
-
Says
that
problems
with
large,
repetitive
data
sets
can
be
eciently
parallelized
where
P
is
the
number
of
processors,
S
is
the
speedup,
and
the
non-
parallelizable
part
of
the
process
-
Adding
due
consideration
for
large
scale
consideration
and
tasks
Analogies
Amdahls
Law
Suppose
two
cities
are
60
km
apart,
a
car
has
spent
one
hour
travelling
the
rst
30
km.
No
matter
how
fast
it
drives
the
last
30
km,
it
is
impossible
to
achieve
an
average
speed
of
90
km/h
before
arriving
the
destination
Gustafsons
Law
Suppose
a
car
has
already
been
travelling
for
some
time
at
speed
of
less
than
90km/h,
and
when
given
enough
time
and
distance
to
travel,
the
cars
average
speed
can
reach
90km/h
as
long
as
it
drives
faster
than
90
km/h
for
some
time.
And
also
the
average
speed
can
reach
120km/h
and
even
150km/h
as
long
as
it
drives
fast
enough
in
the
following
part
Dierences
Amdahls
Law
Gustafsons Law
Case Studies
K
Computer
Produced
by
Fujitsu
at
the
RIKEN
Advanced
to top 10 petaops
CONTD
.
.
.
Major
Features
Uses
68,544
2.0Ghz
8-core
SPARC64
VIIIfx
processors
packed
in
672
cabinets,
Blue
Gene
Blue
Gene
is
a
computer
architecture
-
Blue
Gene/L
-
Blue
Gene/C
-
Blue
Gene/P
-
Blue
Gene/Q
The
project
was
awarded
the
National
Medal
of
Technology
and
Innovation
CONTD
.
.
.
Major
features
Trading
the
speed
of
processors
for
lower
power
consumption.
Dual
processors
per
node
with
two
working
modes:
co-processor
(1
user
65,536)
Three-dimensional
torus
interconnect
with
auxiliary
networks
for
global
noise)
CONTD
.
.
.
The
block
scheme
of
the
Blue
Gene/L
ASIC
including
dual
PowerPC
440
cores
Hopper
Cray
Hopper
is
NERSC's
rst
petaop
system
It
has
153,216
compute
cores
The
size
of
main
memory
is
217
TB
The
secondary
storage
(disk)
size
is
2
PB
Hopper
placed
number
5
on
the
November
2010
Top500
Supercomputer
list.
CONTD
.
.
.
Compute
Nodes
6,384
nodes
2
twelve-core
AMD
'Magny
Cours'
2.1-GHz
processors
per
node
24
cores
per
node
(153,216
total
cores)
32
GB
DDR3
1333-MHz
memory
per
node
(6,000
nodes)
64
GB
DDR3
1333-MHz
memory
per
node
(384
nodes)
Peak
Gop/s
rate:
8.4
Gops/core
201.6
Gops/node
1.28
Peta-ops
for
the
entire
machine
Each
core
has
its
own
L1
and
L2
caches,
with
64
KB
and
512KB
respectively
One
6-MB
L3
cache
shared
between
6
cores
on
the
Magny
Cours
processor
Four
DDR3
1333-MHz
memory
channels
per
twelve-core
'Magny
Cours'
processor
CONTD
.
.
.
Magny
Cours
Processor
CONTD
.
.
.
Interconnect
Hopper's
compute
nodes
are
connected
via
a
custom
high-bandwidth,
low-
Each network node handles not only data destined for itself, but also data to be
Nodes at the "edges" of the mesh network are connected to nodes at the other
CONTD
.
.
.
The
custom
chips
that
route
communication
CONTD
.
.
.
Wiring
up
a
Cray
XE6
CONTD
.
.
.
File
System
All
of
NERSC's
global
le
systems
are
available
on
Hopper.
Additionally,
Aggregate Peak
Performance
# of Disks
$SCRATCH
1 PB
35 GB/sec
13
$SCRATCH2
1 PB
35 GB/sec
13
THANKS !