Introduction To HPC CU

An Introduction to High Performance Computing
Stuart Rankin
sjr20@cam.ac.uk
High Performance Computing Service (http://www.hpc.cam.ac.uk/)
University Information Services (http://www.uis.cam.ac.uk/)
29th June 2015 / UIS Training for the MRC Cancer Unit
Health and Safety
2 of 82
Welcome
I Please sign in on the attendance sheet.

I Please fill in the online feedback at the end of the course:
http://feedback.training.cam.ac.uk/ucs/form.php
I Keep your belongings with you.
I The printer will not work.
I Please ask questions and let us know if you need assistance.
3 of 82
Plan of the Course
Part 1: Basics
Part 2: High Performance Computing Service
Part 3: Using a HPC Facility
11:00-11:30 Practical and break
12:00-13:30 Practical and break for lunch
14:00-14:30 Practical
15:30-CLOSE Further discussion
4 of 82
Plan of the Course
Part 1: Basics
Part 2: High Performance Computing Service
Part 3: Using a HPC Facility
11:00-11:30 Practical and break
12:00-13:30 Practical and break for lunch
15:30-CLOSE Further discussion
4 of 82
5
Part I: Basics
Basics: Outline
Why Buy a Big Computer?
Inside a Modern Computer
How to Build a Supercomputer
Programming a Multiprocessor Machine
6 of 82
Basics: Why Buy a Big Computer?
What types of big problem might require a Big Computer?

Compute Intensive: A single problem requiring a large amount of
computation.
Memory Intensive: A single problem requiring a large amount of
memory.
Data Intensive: Operation on a large amount of data.
High Throughput: Many unrelated problems to be executed over a
long period.
7 of 82

computation.
memory.
long period.
7 of 82

computation.
memory.
long period.
7 of 82

computation.
memory.
long period.
7 of 82

computation.
memory.
long period.
7 of 82
Basics: Compute Intensive Problems
I Distribute the work across multiple CPUs to reduce the execution
time as far as possible.
I Program workload must be parallelised.
Parallel programs split into copies (threads).
Each thread performs a part of the work on its own
CPU, concurrently with the others.
A well-parallelised program will fully exercise as
many CPUs as it has threads.
I The CPUs may need to exchange data rapidly, using specialized
hardware.
I Large systems running multiple parallel jobs also need fast access
to storage.
I Many use cases from Physics, Chemistry, Engineering, Astronomy,
Biology...
I The traditional domain of HPC and the Supercomputer.
8 of 82
hardware.
to storage.
Biology...
8 of 82
hardware.
to storage.
Biology...
8 of 82
hardware.
to storage.
Biology...
8 of 82
hardware.
to storage.
Biology...
8 of 82
Basics: Scaling & Amdahls Law
I Using more CPUs is not necessarily faster.

I Typically parallel codes have a scaling limit.
I Partly due to the system overhead of managing more threads, but
also to more basic constraints;
I Amdahls Law (slightly simplistic model)
1
S(N) = p
1p+ N
where
p is the fraction of the program which can be parallelized
N is the number of processors
S(N) is the fraction by which the program has sped up
relative to N = 1.
9 of 82
Basics: Scaling & Amdahls Law
I Using more CPUs is not necessarily faster.

I Typically parallel codes have a scaling limit.
I Partly due to the system overhead of managing more threads, but
also to more basic constraints;
I Amdahls Law (slightly simplistic model)
1
S(N) = p
1p+ N
where
p is the fraction of the program which can be parallelized
N is the number of processors
S(N) is the fraction by which the program has sped up
relative to N = 1.
9 of 82
Basics: Amdahls Law
http://en.wikipedia.org/wiki/File:AmdahlsLaw.svg
10 of 82
The Bottom Line
I Parallelisation requires effort:

I First optimise performance on one CPU.
I Then make p as large as possible.
I Eventually using more CPUs is detrimental.
11 of 82
Basics: Data Intensive Problems
I Distribute the data across multiple CPUs to process in a

reasonable time.
I Note that the same work may be done on each data segment.
I Rapid movement of data in and out of (disk) storage becomes
important.
NB Memory and storage are usually different things.
I Big Data and how to efficiently process it currently occupies much
thought.
I Life Sciences (genomics).
12 of 82

reasonable time.
important.
thought.
12 of 82

reasonable time.
important.
thought.
12 of 82

reasonable time.
important.
thought.
12 of 82

reasonable time.
important.
thought.
12 of 82

reasonable time.
important.
thought.
12 of 82
Basics: High Throughput
I Distribute work across multiple CPUs to reduce the overall

execution time as far as possible.
I Workload is trivially (or embarrassingly) parallel.
Workload breaks up naturally into independent pieces.
Each piece is performed by a separate process on a separate CPU
(concurrently).
I Emphasis is on throughput over a period, rather than on
performance on a single problem.
I Obviously a supercomputer can do this too.
13 of 82

(concurrently).
13 of 82

(concurrently).
13 of 82

(concurrently).
13 of 82
Basics: Memory Intensive Problems
I Aggregate sufficient memory to enable solution at all.

I Technically more challenging if the program cannot be parallelised
efficiently.
I Historically, the arena of large SGI systems.
14 of 82
Basics: Memory Intensive Problems
I Aggregate sufficient memory to enable solution at all.

I Technically more challenging if the program cannot be parallelised
efficiently.
I Historically, the arena of large SGI systems.
14 of 82
Basics: Inside a Modern Computer
15 of 82
I Even small computers now have multiple CPU cores per socket
I Larger computers have multiple sockets (each with local memory)
I CPU cores also have vector (data-parallel) acceleration

(SSE/AVX).
I Todays ordinary computer is yesterdays supercomputer (with
much of the same complication).
15 of 82
= each socket contains a Symmetric Multi-Processor (SMP).

(SSE/AVX).
15 of 82

(SSE/AVX).
15 of 82
= Non-Uniform Memory Architecture (NUMA).
(SSE/AVX).
15 of 82
(SSE/AVX).
15 of 82
(SSE/AVX).
15 of 82
Basics: How to Build a Supercomputer
I A supercomputer aggregates contemporary CPUs to obtain

increased computing power.
I Usually today these are clusters.
16 of 82
I A supercomputer aggregates contemporary CPUs to obtain

increased computing power.
I Usually today these are clusters.
16 of 82
1. Take some (multicore) CPUs and add some memory.

I Could be an off-the-shelf server, or something more special.
I A NUMA multiprocessor building block: a node.
I All CPU cores (unequally) share the node memory
16 of 82

16 of 82

16 of 82

16 of 82

= the node is a shared memory multiprocessor.
16 of 82
2. Connect the nodes with a

network or networks:
Gbit Ethernet: 100 MB/sec
FDR Infiniband: 5 GB/sec
Faster network is for inter-CPU

communication across nodes.
Slower network is for
management and provisioning.
Storage may use either.
17 of 82


17 of 82


17 of 82


17 of 82


17 of 82


17 of 82


17 of 82
3. Logically bind the nodes

I Clusters consist of distinct nodes (i.e. separate Linux computers)
on common private network(s) and managed centrally.
Clusters are distributed memory machines.
Each task sees only its local node (without help).
Each task must fit within a single nodes memory.
I More expensive machines logically bind nodes into a single Linux
system.
E.g. SGI UV.
These are shared memory machines.
Logically one big node (but very non-uniform).
18 of 82

system.
E.g. SGI UV.
18 of 82

system.
E.g. SGI UV.
18 of 82

system.
E.g. SGI UV.
18 of 82

system.
E.g. SGI UV.
18 of 82
Basics: Programming a Multiprocessor Machine
I Non-parallel (serial) code

For a single node as for a workstation.
Typically run as many copies per node as cores, assuming node
memory is sufficent.
Replicate across multiple nodes.
19 of 82

19 of 82

19 of 82

19 of 82
I Parallel code
Shared memory methods within a node.
E.g. pthreads, OpenMP.
Distributed memory methods between nodes.
Message Passing Interface (MPI).
19 of 82
I Parallel code
19 of 82
I Parallel code
19 of 82
Basics: Summary
I Why have a supercomputer?

I Big problems, long problems, many problems, Big Data.
I Most current supercomputers are clusters of separate nodes.
I Each node has multiple cores and non-uniform shared memory.
I Parallel code uses shared memory (pthreads/OpenMP) within a
node, distributed memory (MPI) between nodes.
I Non-parallel code uses the memory of one node, but may be
copied across many.
20 of 82
Basics: Summary

copied across many.
20 of 82
Basics: Summary

copied across many.
20 of 82
Basics: Summary

copied across many.
20 of 82
Basics: Summary

copied across many.
20 of 82
Basics: Summary

copied across many.
20 of 82
21
Part II: The High Performance Computing Service

HPCS: Outline
A Brief History
Darwin - an Infiniband CPU Cluster
Wilkes - a Dual-Rail Infiniband GPU Cluster
The CU Cluster
Other Activities
Recent Developments
22 of 82
HPCS: A Brief History
Created: 1996 (as the HPCF).

Mission: Delivery and support of a large HPC resource for use by
the University of Cambridge research community.
Self-funding: Paying and non-paying service levels.
User base: Includes DiRAC (STFC) and industrial users.
Plus: Hosted clusters and research projects.
Absorbed into the UIS in 2014 (part of Research &

Institutional Services).
23 of 82


23 of 82


23 of 82


23 of 82
1997 76.8 Gflop/s

2002 1.4 Tflop/s
2006 18.27 Tflop/s
2010 30 Tflop/s
2012 183.38 Tflop/s
2013 183.38 CPU + 239.90 GPU Tflop/s
http://www.top500.org
24 of 82
1997 76.8 Gflop/s

2002 1.4 Tflop/s
2006 18.27 Tflop/s
2010 30 Tflop/s
2012 183.38 Tflop/s
2013 183.38 CPU + 239.90 GPU Tflop/s
http://www.top500.org
24 of 82
Darwin1 (20062012)
25 of 82
Darwin3 (2012)(b) & Wilkes (2013)(f)
26 of 82
Darwin3 (2012)(b) & Wilkes (2013)(f)
26 of 82
Darwin: an Infiniband CPU Cluster
I Each compute node:

2x8 cores, Intel Sandy Bridge 2.6 GHz.
64 GB RAM (63900 MB usable).
56 Gb/sec (4X FDR) Infiniband.
I 600 compute nodes (300 belong to Cambridge).
I 8 login nodes (login.hpc.cam.ac.uk).
27 of 82
Darwin: an Infiniband CPU Cluster

16 cores
63900 MB
5 GB/sec Infiniband (for MPI and storage)
I 600 compute nodes (300 belong to Cambridge).
27 of 82
Wilkes: a Dual-Rail Infiniband GPU Cluster

2 NVIDIA Tesla K20c GPU.
2x6 cores, Intel Ivy Bridge 2.6 GHz.
2 56 Gb/sec (4X FDR) Infiniband.
I 128 compute nodes.
I Environment shared with Darwin (same filesystems, user
environment, scheduler).
28 of 82

2 GPUs
12 cores
63900 MB
2 5 GB/sec Infiniband (for MPI and storage)
28 of 82

2 GPUs
12 cores
63900 MB
2 5 GB/sec Infiniband (for MPI and storage)
28 of 82
HPCS Production Cluster Schematic
29 of 82
The CU Cluster

2x8 cores, Intel Sandy Bridge 2.6 GHz.
56 Gb/sec (4X FDR) Infiniband.
1 Gb/sec Ethernet.
I 15 compute nodes.
I 1 login node (192.168.43.182).
I 14 TB NFS storage (2 TB /home + 12 TB /scratch)
30 of 82
The CU Cluster

16 cores
63900 MB
5 GB/sec Infiniband (for MPI)
100 MB/sec Ethernet (for storage)
I 15 compute nodes.
I 1 login node (192.168.43.182).
I 14 TB NFS storage (2 TB /home + 12 TB /scratch)
30 of 82
HPCS: Other Activities
I Hosted clusters
e.g. MRC BSU, Cardiovascular Epidemiology, Whittle Lab.
I Research projects
e.g. Square Kilometre Array, Jaguar Land Rover.
I Integration and consultancy services
I Industrial services
31 of 82
I Hosted clusters
I Research projects
31 of 82
I Hosted clusters
I Research projects
31 of 82
I Hosted clusters
I Research projects
31 of 82
HPCS: Recent Developments
I Closer integration with UIS services (e.g. UIS Password).
I Relocated during February 2015 to the West Cambridge Data

Centre.
I UIS new organisational structure in May
HPCS is a subdivision of Research & Institutional Services.
I New services trialled soon as part of the CBC Pilot Project:
I Virtual Server Service
32 of 82

Centre.
32 of 82

Centre.
32 of 82

Centre.
32 of 82

Centre.
32 of 82
The West Cambridge Data Centre
33 of 82
The West Cambridge Data Centre: Hall 1
34 of 82
The West Cambridge Data Centre: Hall 1
34 of 82
35
Part III: Using HPC

Using HPC: Outline
Security
Connecting
User Environment
Software
Job Submission
36 of 82
Using HPC: Security
I Cambridge IT is under constant attack by would-be intruders.

I Big systems are big, juicy targets.
I Your data and research career is threatened by intruders.
I Dont let intruders in.
37 of 82
Using HPC: Security

37 of 82
Using HPC: Security

37 of 82
Using HPC: Security

37 of 82
Using HPC: Security
1. Keep your password (or private key passphrase) safe.

2. Keep the software on your laptops and PCs up to date.
3. Dont share accounts.
4. Never connect from untrusted machines (e.g. internet cafes).
5. Always use SSH (never rlogin or telnet).
6. Never, ever do xhost +.
38 of 82
Using HPC: Security

38 of 82
Using HPC: Security

38 of 82
Using HPC: Security

38 of 82
Using HPC: Security

38 of 82
Using HPC: Security

38 of 82
Using HPC: Connecting
I SSH secure protocol only.
I The CU cluster currently follows the same pattern as the HPCS

(but you have an extra firewall).
I HPCS allows access from registered IP addresses only.
39 of 82

Supports login, file transfer, remote desktop. . .
39 of 82

39 of 82

39 of 82

Almost all Cambridge University addresses already registered.
39 of 82

Connection from home possible via the VPN service
http://www.ucs.cam.ac.uk/vpn
39 of 82

Connection from home possible via the VPN service
http://www.ucs.cam.ac.uk/vpn
or SSH tunnel through a departmental gateway.
39 of 82
Connecting: Windows Clients
I putty, pscp, psftp

http://www.chiark.greenend.org.uk/ sgtatham/putty/download.html
I WinSCP
http://winscp.net/eng/download.php
I TurboVNC (remote desktop, 3D optional)
http://sourceforge.net/projects/turbovnc/files/
I Cygwin
http://cygwin.com/install.html
I MobaXterm
http://mobaxterm.mobatek.net/
40 of 82

I WinSCP
I Cygwin
I MobaXterm
40 of 82

I WinSCP
I Cygwin
I MobaXterm
40 of 82

I WinSCP
I Cygwin
I MobaXterm
40 of 82

I WinSCP
I Cygwin (provides an application environment similar to Linux)
Includes X server for displaying graphical applications running remotely.
I MobaXterm
40 of 82

I WinSCP
I Cygwin
I MobaXterm
40 of 82
Connecting: Linux/MacOSX/UNIX Clients
I ssh, scp, sftp, rsync

Installed (or installable).
I On MacOSX, install XQuartz to display remote graphical
applications.
http://xquartz.macosforge.org/landing/
41 of 82

applications.
41 of 82

applications.
41 of 82

applications.
41 of 82
Connecting: Login
I For the CU cluster, replace login.hpc.cam.ac.uk with

192.168.43.182.
I From Linux/MacOSX/UNIX (or Cygwin):
ssh -Y abc123@login.hpc.cam.ac.uk
I From graphical clients:
Host: login.hpc.cam.ac.uk
Username: abc123 (your local account name)
I Dont connect to the head node (darwin.hpc in our case).
I Non-registered addresses will fail with Connection refused.
42 of 82
Connecting: Login

192.168.43.182.
42 of 82
Connecting: Login

192.168.43.182.
42 of 82
Connecting: Login

192.168.43.182.
42 of 82
Connecting: Login

192.168.43.182.
42 of 82
Connecting: First time login
I The first connection to a particular hostname produces the

following:
The authenticity of host login-sand2.hpc.cam.ac.uk (131.111.1.214)
cant be established.
RSA key fingerprint is
0b:ef:59:90:fb:13:4a:c9:56:82:7b:cd:4b:2b:e1:3b.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added login-sand2.hpc.cam.ac.uk (RSA) to the
list of known hosts.
I One should always check the fingerprint before typing yes.

I Graphical SSH clients should ask a similar question.
I Designed to detect fraudulent servers.
I Exercise 1 - Log into your HPCS training account.
43 of 82

following:
0b:ef:59:90:fb:13:4a:c9:56:82:7b:cd:4b:2b:e1:3b.

43 of 82

following:
0b:ef:59:90:fb:13:4a:c9:56:82:7b:cd:4b:2b:e1:3b.

43 of 82
MobaXterm SSH (Windows)
44 of 82
MobaXterm SSH (Windows)
44 of 82
Connecting: File Transfer

rsync -av old directory/ abc123@login.hpc.cam.ac.uk:scratch/new directory
copies contents of old directory to/scratch/new directory.
rsync -av old directory abc123@login.hpc.cam.ac.uk:scratch/new directory
copies old directory (and contents) to
/scratch/new directory/old directory.
Rerun to update or resume after interruption.
All transfers are checksummed.
For transfers in the opposite direction, place the remote machine as
the first argument.
I With graphical clients, connect as before and drag and drop.
I Exercise 2 - File transfer.
45 of 82

the first argument.
45 of 82

the first argument.
45 of 82

the first argument.
45 of 82

the first argument.
45 of 82
Connecting: Remote Desktop
I First time starting a remote desktop:
[sjr20@login-sand2 ~]$ vncserver
You will require a password to access your desktops.
Password:
Verify:
Would you like to enter a view-only password (y/n)? n
New X desktop is login-sand2:8
Starting applications specified in /home/sjr20/.vnc/xstartup.turbovnc

Log file is /home/sjr20/.vnc/login-sand2:8.log
I For 3D graphics sessions, use login-gfx1.
46 of 82
I First time starting a remote desktop:
[sjr20@login-sand2 ~]$ vncserver
You will require a password to access your desktops.
Password:
Verify:
Would you like to enter a view-only password (y/n)? n
New X desktop is login-sand2:8
Starting applications specified in /home/sjr20/.vnc/xstartup.turbovnc

Log file is /home/sjr20/.vnc/login-sand2:8.log
I For 3D graphics sessions, use login-gfx1.
46 of 82
I Remote desktop already running:
[sjr20@login-sand2 ~]$ vncserver -list
TurboVNC server sessions:
X DISPLAY # PROCESS ID
:8 12745
I Kill it:
[sjr20@login-sand2 ~]$ vncserver -kill :8

Killing Xvnc process ID 12745
I Typically you only need one remote desktop.

I Keeps running until killed, or the node reboots.
47 of 82
I To connect to the desktop from Linux:
[sjr20@themis ~]$ vncviewer -via sjr20@login-sand2.hpc.cam.ac.uk localhost:8

Connected to RFB server, using protocol version 3.8
Enabling TightVNC protocol extensions
Performing standard VNC authentication
Password:
I Press F8 to bring up the control panel.

I Exercise 3 - Remote desktop.
48 of 82
I To connect to the desktop from Linux:
[sjr20@themis ~]$ vncviewer -via sjr20@login-sand2.hpc.cam.ac.uk localhost:8

Connected to RFB server, using protocol version 3.8
Enabling TightVNC protocol extensions
Performing standard VNC authentication
Password:
I Press F8 to bring up the control panel.

I Exercise 3 - Remote desktop.
48 of 82
HPCS TurboVNC Session
49 of 82
Linux TurboVNC Control Panel
50 of 82
Connecting: Remote Desktop (MobaXterm)
51 of 82
Connecting: Remote Desktop (MobaXterm)
51 of 82
3D Remote Visualization
52 of 82
3D Remote Visualization
I Choose login-gfx1.
I Launch any application requiring 3D (OpenGL) with vglrun.
I May need to adjust the compression level for your network
connection.
52 of 82
Using HPC: User Environment
I The CU cluster is based on the HPCS 22nd April 2015 image.

I Scientific Linux 6.6 (Red Hat Enterprise Linux 6.6 rebuild)
I bash
I GNOME2 desktop (if you want)
I Lustre 2.4.1 (patched), Mellanox OFED 2.4, CUDA 6.5
I But you dont need to know that.
53 of 82

Red Hat Enterprise Linux 6
CUDA 6
I But you dont need to know that.
53 of 82

CUDA 6
I But you dont need to know that. (Probably. . . )
53 of 82

CUDA 6
I But you dont need to know that. (Probably. . . )
53 of 82
User Environment: Filesystems
I /home/abc123
I 40GB soft quota (45GB hard).
I Visible equally from all nodes.
I Single storage server.
I Backed up nightly to tape.
I Not intended for job outputs or large/many input files.
I /scratch/abc123
I Visible equally from all nodes.
I Larger and faster.
I Intended for job inputs and outputs.
I Not backed up.
54 of 82
Filesystems: Quotas
I quota
====================================================================================
Usage on /home (lfs quota -u abc123 /home):
====================================================================================
Disk quotas for user abc123 (uid 456):
Filesystem kbytes quota limit grace files quota limit grace
/home 24513908 41943040 47185920 - 75364 0 0 -
====================================================================================
Usage on /scratch (lfs quota -u abc123 /scratch):
====================================================================================
/lustre1 5467644384 0 0 - 3864823 0 0 -
...
I Aim to stay below the soft limit (quota).

I Once over the soft limit, you have 7 days grace to return below.
I When the grace period expires, or you reach the hard limit (limit),
no more data can be written.
I It is important to rectify an out of quota condition ASAP.
55 of 82
Filesystems: Quotas
I quota
====================================================================================
====================================================================================
/home *43567687 41943040 47185920 - 75364 0 0 -
====================================================================================
====================================================================================
/lustre1 5467644384 0 0 - 3864823 0 0 -
...

55 of 82
Filesystems: Quotas
I quota
====================================================================================
====================================================================================
/home *43567687 41943040 47185920 - 75364 0 0 -
====================================================================================
====================================================================================
/lustre1 5467644384 0 0 - 3864823 0 0 -
...

55 of 82
Filesystems: Quotas
I quota
====================================================================================
====================================================================================
/home *43567687 41943040 47185920 - 75364 0 0 -
====================================================================================
====================================================================================
/lustre1 5467644384 0 0 - 3864823 0 0 -
...

55 of 82
Filesystems: Backups
I Tape backups normally commence at 22:00 every night.

I They are not an undelete - take care when deleting.
I Successful restoration depends on:
I The file having existed long enough to have been backed up at all.
I The last good version existing in a current backup.
I Request restoration as soon as possible with location and exact
time of loss.
56 of 82

time of loss.
56 of 82

time of loss.
56 of 82

time of loss.
56 of 82

time of loss.
I Scratch files are not backed up.
56 of 82
Filesystems: Permissions
I Be careful and if unsure, please ask support@hpc.

I Can lead to accidental destruction of your data or account
compromise.
I Avoid changing the permissions on your home directory.
I Files under /home are particularly security sensitive.
I Easy to break passwordless communication between nodes.
57 of 82
Using HPC: Software
I Free software accompanying Red Hat Enterprise 6 is (or can be)

provided.
I Other software (free and non-free) is available via modules.
I Some proprietary software may not be generally accessible.
I See http://www.hpc.cam.ac.uk/using-clusters/software.
I New software may be possible to provide on request.
I Self-installed software must be properly licensed.
58 of 82
User Environment: Environment Modules
I Modules load or unload additional software packages.

I Some are required and automatically loaded on login.
I Others are optional extras, or possible replacements for other
modules.
I Beware unloading default modules in/.bashrc.
I Beware overwriting environment variables such as PATH and
LD LIBRARY PATH in/.bashrc. If necessary append or prepend.
59 of 82
I Currently loaded:
module list
Currently Loaded Modulefiles:
1) dot 6) intel/impi/4.1.3.045 11) default-impi
2) scheduler 7) global
3) java/jdk1.7.0_60 8) intel/cce/12.1.10.319
4) turbovnc/1.1 9) intel/fce/12.1.10.319
5) vgl/2.3.1/64 10) intel/mkl/10.3.10.319
I Available:
module av
60 of 82
I Currently loaded:
module list
Currently Loaded Modulefiles:
1) dot 4) turbovnc/1.1 7) global
2) scheduler 5) vgl/2.3.1/64 8) use.own
3) java/jdk1.7.0_60 6) openmpi/gcc/1.8.6 9) default-ompi
I Available:
module av
60 of 82
I Show:
module show castep/impi/7.0.3
-------------------------------------------------------------------
/usr/local/Cluster-Config/modulefiles/castep/impi/7.0.3:
module-whatis adds CASTEP 7.0.3 (Intel MPI) to your environment
Note that this software is restricted to registered users.
prepend-path PATH /usr/local/Cluster-Apps/castep/impi/7.0.3/bin:/usr/local/...

-------------------------------------------------------------------
I Load:
module load castep/impi/7.0.3
I Unload:
module unload castep/impi/7.0.3
61 of 82
I Purge:
module purge
I Defaults:
module show default-impi

module unload default-impi
module load default-impi-LATEST
I Run time environment must match compile time environment.
62 of 82
User Environment: Compilers
Intel: icc, icpc, ifort (recommended)
icc -O3 -xHOST -ip code.c -o prog

mpicc -O3 -xHOST -ip mpi_code.c -o mpi_prog
GCC: gcc, g++, gfortran
gcc -O3 -mtune=native code.c -o prog

mpicc -cc=gcc -O3 -mtune=native mpi_code.c -o mpi_prog
PGI: pgcc, pgCC, pgf90
pgcc -O3 -tp=sandybridge code.c -o prog

mpicc -cc=pgcc -O3 -tp=sandybridge mpi_code.c -o mpi_prog
Exercise 4: Modules and Compilers
63 of 82
User Environment: Compilers
Intel: icc, icpc, ifort (recommended)
icc -O3 -xHOST -ip code.c -o prog

mpicc -O3 -xHOST -ip mpi_code.c -o mpi_prog
GCC: gcc, g++, gfortran
gcc -O3 -mtune=native code.c -o prog

mpicc -cc=gcc -O3 -mtune=native mpi_code.c -o mpi_prog
PGI: pgcc, pgCC, pgf90
pgcc -O3 -tp=sandybridge code.c -o prog

mpicc -cc=pgcc -O3 -tp=sandybridge mpi_code.c -o mpi_prog
Exercise 4: Modules and Compilers
63 of 82
Using HPC: Job Submission
64 of 82
I Compute resources are managed by a scheduler:

SLURM/PBS/SGE/LSF/. . .
I Jobs are submitted to the scheduler
analogous to submitting jobs to a print queue.
65 of 82
I Jobs are submitted from the login nodes

not themselves managed by the scheduler.
I Jobs may be either non-interactive (batch) or interactive.
I Batch jobs run a shell script on the first of a list of allocated nodes.
I Interactive jobs provide a command line on the first of a list of
allocated nodes.
66 of 82
I Jobs are submitted from the login nodes

not themselves managed by the scheduler.
I Jobs may be either non-interactive (batch) or interactive.
I Batch jobs run a shell script on the first of a list of allocated nodes.
I Interactive jobs provide a command line on the first of a list of
allocated nodes.
66 of 82
I The HPCS moved away from Torque (a form of PBS) to SLURM

in February 2014.
I The CU cluster scheduler currently imitates the HPCS clusters
the single partition is blade (replaces sandybridge/tesla).
I The HPCS dedicates entire nodes to each job
the owner receives exclusive access.
I Template submission scripts are available.
67 of 82
Job Submission: Using SLURM or PBS
I SLURM
[abc123@login]$ sbatch slurm_submission_script

Submitted batch job 790299
I PBS (Torque, OpenPBS, PBS Pro)
[abc123@login]$ qsub pbs_submission_script

790299.master.cluster
68 of 82
Job Submission: Show Queue
I SLURM
[abc123@login]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
790299 sandybrid Test3 abc123 PD 0:00 8 (Priority)
790290 sandybrid Test2 abc123 R 27:56:10 8 sand-6-[38-40],sand-7-[27-31]

[abc123@login]$ qstat -u abc123
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
790290.master.cl abc123 tesla Test2 5519 8 32 248000 36:00 R 27:56
790281.master.cl abc123 tesla Test1 31905 4 16 124000 36:00 C 26:17
790299.master.cl abc123 tesla Test3 -- 8 32 248000 36:00 Q --
69 of 82
Job Submission: Show Queue
I SLURM
[abc123@login]$ squeue -u abc123
790299 sandybrid Test3 abc123 PD 0:00 8 (Resources)
790290 sandybrid Test2 abc123 R 27:56:10 8 sand-6-[38-40],sand-7-[27-31]

[abc123@login]$ qstat -u abc123
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
790290.master.cl abc123 tesla Test2 5519 8 32 248000 36:00 R 27:56
790281.master.cl abc123 tesla Test1 31905 4 16 124000 36:00 C 26:17
790299.master.cl abc123 tesla Test3 -- 8 32 248000 36:00 Q --
69 of 82
Job Submission: Monitor Job
I SLURM
[abc123@login]$ scontrol show job=790299
[abc123@login]$ qstat -f 790299
70 of 82
Job Submission: Cancel Job
I SLURM
[abc123@login]$ scancel 790299
[abc123@login]$ qdel 790299
71 of 82
Job Submission: Scripts
I SLURM
See slurm submit.darwin, slurm submit.wilkes.
#!/bin/bash
#! Name of the job:
#SBATCH -J darwinjob
#! Which project should be charged:
#SBATCH -A CHANGEME
#! How many whole nodes should be allocated?
#SBATCH --nodes=2
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=32
#! How much wallclock time will be required?
#SBATCH --time=02:00:00
#! Select partition:
#SBATCH -p sandybridge
...
I #SBATCH lines are structured comments

correspond to sbatch command line options.
72 of 82
I SLURM
See slurm submit.darwin, slurm submit.wilkes.
#!/bin/bash
#! Name of the job:
#SBATCH -J darwinjob
#SBATCH -A CHANGEME
#! How many whole nodes should be allocated?
#SBATCH --nodes=2
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=32
#! How much wallclock time will be required?
#SBATCH --time=02:00:00
#! Select partition:
#SBATCH -p sandybridge
...
I #SBATCH lines are structured comments

correspond to sbatch command line options.
72 of 82

#!/bin/bash
#! Name of the job:
#PBS -N darwinjob
#PBS -A CHANGEME
#! How many nodes, cores per node, memory and wall-clock time should be allocated?
#PBS -l nodes=8:ppn=16,mem=512000mb,walltime=02:00:00
#! Select queue:
#PBS -q sandybridge
...
I #PBS lines are structured comments

correspond to qsub command line options.
73 of 82

#!/bin/bash
#! Name of the job:
#PBS -N darwinjob
#PBS -A CHANGEME
#! How many nodes, cores per node, memory and wall-clock time should be allocated?
#PBS -l nodes=8:ppn=16,mem=512000mb,walltime=02:00:00
#! Select queue:
#PBS -q sandybridge
...
I #PBS lines are structured comments

correspond to qsub command line options.
73 of 82
Job Submission: Accounting Commands [HPCS]
I How many core hours available do I have?
mybalance
User Usage | Account Usage | Account Limit Available (CPU hrs)

---------- --------- + -------------- --------- + ------------- ---------
abc123 18 | STARS 171 | 100,000 99,829
abc123 18 | STARS-SL2 35 | 101,000 100,965
abc123 925 | BLACKH 10,634 | 166,667 156,033
I How many core hours does some other project or user have?
gbalance -p HALOS
User Usage | Account Usage | Account Limit Available (CPU hrs)

---------- --------- + --------- --------- + ------------- ---------
pq345 0 | HALOS 317,656 | 600,000 282,344

xyz10 11,880 | HALOS 317,656 | 600,000 282,344
(Use -u for user.)
I List all jobs charged to a project/user between certain times:

gstatement -p HALOS -u xyz10 -s "2014-01-01-00:00:00" -e "2014-01-20-00:00:00"
JobID User Account JobName Partition End NCPUS CPUTimeRAW ExitCode State
------------ --------- ---------- -------- ---------- ------------------- ---------- ---------- -------- ----------
14505 xyz10 halos help sandybrid+ 2014-01-07T12:59:40 16 32 0:9 COMPLETED
14506 xyz10 halos help sandybrid+ 2014-01-07T13:00:11 16 48 2:0 FAILED
...
74 of 82
Job Submission: Single Node Jobs
I Serial jobs requiring large memory, or OpenMP codes.
#!/bin/bash
...
#SBATCH --nodes=1
...
export OMP NUM THREADS= # For OpenMP across cores.
options=<specific option for multithreading>
$application $options
...
75 of 82
#!/bin/bash
...
#SBATCH --nodes=1
...
export OMP NUM THREADS=16 # For OpenMP across 16 cores.
...
75 of 82
#!/bin/bash
...
#SBATCH --nodes=1
...
export OMP NUM THREADS=8 # For OpenMP across 8 cores.
...
75 of 82
#!/bin/bash
...
#SBATCH --nodes=1
...
export OMP NUM THREADS= # For OpenMP across cores.
...
75 of 82
Job Submission: MPI Jobs
I Parallel job across multiple nodes.
#!/bin/bash
...
#SBATCH --nodes=4
#SBATCH --ntasks=64 # i.e. 16x4 MPI tasks in total.
...
mpirun -np 64 $application $options
...
I SLURM-aware MPI launches remote tasks via SLURM.
I The template script uses $SLURM TASKS PER NODE to set PPN.
76 of 82
#!/bin/bash
...
#SBATCH --nodes=4
...
mpirun -ppn 8 -np 32 $application $options
...
76 of 82
#!/bin/bash
...
#SBATCH --nodes=4
...
...
76 of 82
Job Submission: Hybrid Jobs
I Parallel jobs using both MPI and OpenMP.
#!/bin/bash
...
#SBATCH --nodes=4
...
export OMP NUM THREADS=2 # i.e. 2 threads per MPI task.
...
I This job uses 64 cores (each MPI task splits into 2 OpenMP threads).
77 of 82
Job Submission: Hybrid Jobs
I Parallel jobs using both MPI and OpenMP.
#!/bin/bash
...
#SBATCH --nodes=4
...
export OMP NUM THREADS=2 # i.e. 2 threads per MPI task.
...
I This job uses 64 cores (each MPI task splits into 2 OpenMP threads).
77 of 82
Job Submission: High Throughput Jobs
I Multiple serial jobs across multiple nodes.

I Use srun to launch tasks (job steps) within a job.
#!/bin/bash
...
#SBATCH --nodes=4
...
cd directory for job1
srun --exclusive -N 1 -n 1 $application $options for job1 > output 2> error &
...
srun --exclusive -N 1 -n 1 $application $options for job64 > output 2> error
wait
I Exercise 5 & 6 - Submitting Jobs.
78 of 82

#!/bin/bash
...
#SBATCH --nodes=4
...
...
wait
78 of 82

#!/bin/bash
...
#SBATCH --nodes=4
...
...
wait
78 of 82

#!/bin/bash
...
#SBATCH --nodes=4
...
...
wait
78 of 82

#!/bin/bash
...
#SBATCH --nodes=4
...
...
wait
78 of 82
Job Submission: Interactive [HPCS]
I Compute nodes are accessible via SSH while you have a job
running on them.
I Alternatively, submit an interactive job:
sintr -A MYPROJECT -p sandybridge -N2 -t 2:0:0
I Within the window (screen session):

Launches a shell on the first node (when the job starts).
Graphical applications should display correctly.
Create new shells with ctrl-a c, navigate with ctrl-a n and ctrl-a p.
ssh or srun can be used to start processes on any nodes in the job.
SLURM-aware MPI will do this automatically.
79 of 82
running on them.

79 of 82
running on them.

79 of 82
Job Submission: Array Jobs
I This feature varies between versions.
I http : //slurm.schedmd.com/job array .html
I Used for submitting and managing large sets of similar jobs.
I Each job in the array has the same initial options.
I SLURM
[abc123@login]$ sbatch --array=1-7 -A STARS-SL2 submission script
[abc123@login-sand2]$ squeue -u abc123
791609 1 sandybrid hpl abc123 R 0:06 1 sand-6-32
791609 1, 791609 3, 791609 5, 791609 7

i.e. ${SLURM ARRAY JOB ID} ${SLURM ARRAY TASK ID}
SLURM ARRAY JOB ID = SLURM JOBID for the first element.
80 of 82
I SLURM
[abc123@login]$ sbatch --array=1-7:2 -A STARS-SL2 submission script
791609 1, 791609 3, 791609 5, 791609 7

80 of 82
I SLURM
[abc123@login]$ sbatch --array=1,3,5,7 -A STARS-SL2 submission script
791609 1, 791609 3, 791609 5, 791609 7

80 of 82
I SLURM
791609 1, 791609 3, 791609 5, 791609 7

80 of 82
I SLURM
791609 1, 791609 3, 791609 5, 791609 7

80 of 82
I SLURM
791609 1, 791609 3, 791609 5, 791609 7

80 of 82
I SLURM
791609 1, 791609 3, 791609 5, 791609 7

80 of 82
Job Submission: Array Jobs (ctd)
I Updates can be applied to specific array elements using

${SLURM ARRAY JOB ID} ${SLURM ARRAY TASK ID}
I Alternatively operate on the entire array via
${SLURM ARRAY JOB ID}.
I Some commands still require the SLURM JOB ID (sacct, sreport,
sshare, sstat and a few others).
I Exercise 7 - Array Jobs.
81 of 82
Job Submission: Array Jobs (ctd)
I Updates can be applied to specific array elements using

${SLURM ARRAY JOB ID} ${SLURM ARRAY TASK ID}
I Alternatively operate on the entire array via
${SLURM ARRAY JOB ID}.
I Some commands still require the SLURM JOB ID (sacct, sreport,
sshare, sstat and a few others).
I Exercise 7 - Array Jobs.
81 of 82
Job Submission: Scheduling Top Dos & Donts
I Do . . .
I Give reasonably accurate wall times (allows backfilling).
I Check your balance occasionally (mybalance).
I Test on a small scale first.
I Implement checkpointing if possible (reduces resource wastage).
I Dont . . .
I Request more cores than you need
you will wait longer and use more credits.
I Cancel jobs unnecessarily
priority increases over time.
82 of 82

Introduction To HPC CU

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Introduction To HPC CU

Încărcat de

Drepturi de autor:

Formate disponibile

An Introduction to High Performance Computing

I Please sign in on the attendance sheet.

Why Buy a Big Computer?

Inside a Modern Computer

How to Build a Supercomputer

Programming a Multiprocessor Machine

What types of big problem might require a Big Computer?

What types of big problem might require a Big Computer?

What types of big problem might require a Big Computer?

What types of big problem might require a Big Computer?

What types of big problem might require a Big Computer?

I Using more CPUs is not necessarily faster.

I Using more CPUs is not necessarily faster.

I Parallelisation requires effort:

I Distribute the data across multiple CPUs to process in a

I Distribute the data across multiple CPUs to process in a

I Distribute the data across multiple CPUs to process in a

I Distribute the data across multiple CPUs to process in a

I Distribute the data across multiple CPUs to process in a

I Distribute the data across multiple CPUs to process in a

I Distribute work across multiple CPUs to reduce the overall

I Distribute work across multiple CPUs to reduce the overall

I Distribute work across multiple CPUs to reduce the overall

I Distribute work across multiple CPUs to reduce the overall

I Aggregate sufficient memory to enable solution at all.

I Aggregate sufficient memory to enable solution at all.

I Larger computers have multiple sockets (each with local memory)

I CPU cores also have vector (data-parallel) acceleration

I CPU cores also have vector (data-parallel) acceleration

I CPU cores also have vector (data-parallel) acceleration

I A supercomputer aggregates contemporary CPUs to obtain

I A supercomputer aggregates contemporary CPUs to obtain

1. Take some (multicore) CPUs and add some memory.

1. Take some (multicore) CPUs and add some memory.

1. Take some (multicore) CPUs and add some memory.

1. Take some (multicore) CPUs and add some memory.

1. Take some (multicore) CPUs and add some memory.

2. Connect the nodes with a

Faster network is for inter-CPU

2. Connect the nodes with a

Faster network is for inter-CPU

2. Connect the nodes with a

Faster network is for inter-CPU

2. Connect the nodes with a

Faster network is for inter-CPU

2. Connect the nodes with a

Faster network is for inter-CPU

2. Connect the nodes with a

Faster network is for inter-CPU

2. Connect the nodes with a

Faster network is for inter-CPU

3. Logically bind the nodes

3. Logically bind the nodes

3. Logically bind the nodes

3. Logically bind the nodes

3. Logically bind the nodes

I Non-parallel (serial) code

I Non-parallel (serial) code

I Non-parallel (serial) code

I Non-parallel (serial) code

I Why have a supercomputer?

I Why have a supercomputer?

I Why have a supercomputer?