Documente Academic
Documente Profesional
Documente Cultură
Stuart Rankin
sjr20@cam.ac.uk
High Performance Computing Service (http://www.hpc.cam.ac.uk/)
University Information Services (http://www.uis.cam.ac.uk/)
29th June 2015 / UIS Training for the MRC Cancer Unit
Health and Safety
2 of 82
Welcome
3 of 82
Plan of the Course
Part 1: Basics
Part 2: High Performance Computing Service
Part 3: Using a HPC Facility
11:00-11:30 Practical and break
12:00-13:30 Practical and break for lunch
14:00-14:30 Practical
14:45-15:15 Practical
15:30-CLOSE Further discussion
4 of 82
Plan of the Course
Part 1: Basics
Part 2: High Performance Computing Service
Part 3: Using a HPC Facility
11:00-11:30 Practical and break
12:00-13:30 Practical and break for lunch
14:00-14:30 Practical
14:45-15:15 Practical
15:30-CLOSE Further discussion
4 of 82
5
Part I: Basics
Basics: Outline
6 of 82
Basics: Why Buy a Big Computer?
7 of 82
Basics: Why Buy a Big Computer?
7 of 82
Basics: Why Buy a Big Computer?
7 of 82
Basics: Why Buy a Big Computer?
7 of 82
Basics: Why Buy a Big Computer?
7 of 82
Basics: Compute Intensive Problems
I Distribute the work across multiple CPUs to reduce the execution
time as far as possible.
I Program workload must be parallelised.
Parallel programs split into copies (threads).
Each thread performs a part of the work on its own
CPU, concurrently with the others.
A well-parallelised program will fully exercise as
many CPUs as it has threads.
I The CPUs may need to exchange data rapidly, using specialized
hardware.
I Large systems running multiple parallel jobs also need fast access
to storage.
I Many use cases from Physics, Chemistry, Engineering, Astronomy,
Biology...
I The traditional domain of HPC and the Supercomputer.
8 of 82
Basics: Compute Intensive Problems
I Distribute the work across multiple CPUs to reduce the execution
time as far as possible.
I Program workload must be parallelised.
Parallel programs split into copies (threads).
Each thread performs a part of the work on its own
CPU, concurrently with the others.
A well-parallelised program will fully exercise as
many CPUs as it has threads.
I The CPUs may need to exchange data rapidly, using specialized
hardware.
I Large systems running multiple parallel jobs also need fast access
to storage.
I Many use cases from Physics, Chemistry, Engineering, Astronomy,
Biology...
I The traditional domain of HPC and the Supercomputer.
8 of 82
Basics: Compute Intensive Problems
I Distribute the work across multiple CPUs to reduce the execution
time as far as possible.
I Program workload must be parallelised.
Parallel programs split into copies (threads).
Each thread performs a part of the work on its own
CPU, concurrently with the others.
A well-parallelised program will fully exercise as
many CPUs as it has threads.
I The CPUs may need to exchange data rapidly, using specialized
hardware.
I Large systems running multiple parallel jobs also need fast access
to storage.
I Many use cases from Physics, Chemistry, Engineering, Astronomy,
Biology...
I The traditional domain of HPC and the Supercomputer.
8 of 82
Basics: Compute Intensive Problems
I Distribute the work across multiple CPUs to reduce the execution
time as far as possible.
I Program workload must be parallelised.
Parallel programs split into copies (threads).
Each thread performs a part of the work on its own
CPU, concurrently with the others.
A well-parallelised program will fully exercise as
many CPUs as it has threads.
I The CPUs may need to exchange data rapidly, using specialized
hardware.
I Large systems running multiple parallel jobs also need fast access
to storage.
I Many use cases from Physics, Chemistry, Engineering, Astronomy,
Biology...
I The traditional domain of HPC and the Supercomputer.
8 of 82
Basics: Compute Intensive Problems
I Distribute the work across multiple CPUs to reduce the execution
time as far as possible.
I Program workload must be parallelised.
Parallel programs split into copies (threads).
Each thread performs a part of the work on its own
CPU, concurrently with the others.
A well-parallelised program will fully exercise as
many CPUs as it has threads.
I The CPUs may need to exchange data rapidly, using specialized
hardware.
I Large systems running multiple parallel jobs also need fast access
to storage.
I Many use cases from Physics, Chemistry, Engineering, Astronomy,
Biology...
I The traditional domain of HPC and the Supercomputer.
8 of 82
Basics: Scaling & Amdahls Law
9 of 82
Basics: Scaling & Amdahls Law
9 of 82
Basics: Amdahls Law
http://en.wikipedia.org/wiki/File:AmdahlsLaw.svg
10 of 82
The Bottom Line
11 of 82
Basics: Data Intensive Problems
12 of 82
Basics: Data Intensive Problems
12 of 82
Basics: Data Intensive Problems
12 of 82
Basics: Data Intensive Problems
12 of 82
Basics: Data Intensive Problems
12 of 82
Basics: Data Intensive Problems
12 of 82
Basics: High Throughput
13 of 82
Basics: High Throughput
13 of 82
Basics: High Throughput
13 of 82
Basics: High Throughput
13 of 82
Basics: Memory Intensive Problems
14 of 82
Basics: Memory Intensive Problems
14 of 82
Basics: Inside a Modern Computer
15 of 82
Basics: Inside a Modern Computer
I Even small computers now have multiple CPU cores per socket
15 of 82
Basics: Inside a Modern Computer
I Even small computers now have multiple CPU cores per socket
= each socket contains a Symmetric Multi-Processor (SMP).
I Larger computers have multiple sockets (each with local memory)
15 of 82
Basics: Inside a Modern Computer
I Even small computers now have multiple CPU cores per socket
= each socket contains a Symmetric Multi-Processor (SMP).
I Larger computers have multiple sockets (each with local memory)
15 of 82
Basics: Inside a Modern Computer
I Even small computers now have multiple CPU cores per socket
= each socket contains a Symmetric Multi-Processor (SMP).
I Larger computers have multiple sockets (each with local memory)
= Non-Uniform Memory Architecture (NUMA).
I CPU cores also have vector (data-parallel) acceleration
(SSE/AVX).
I Todays ordinary computer is yesterdays supercomputer (with
much of the same complication).
15 of 82
Basics: Inside a Modern Computer
I Even small computers now have multiple CPU cores per socket
= each socket contains a Symmetric Multi-Processor (SMP).
I Larger computers have multiple sockets (each with local memory)
= Non-Uniform Memory Architecture (NUMA).
I CPU cores also have vector (data-parallel) acceleration
(SSE/AVX).
I Todays ordinary computer is yesterdays supercomputer (with
much of the same complication).
15 of 82
Basics: Inside a Modern Computer
I Even small computers now have multiple CPU cores per socket
= each socket contains a Symmetric Multi-Processor (SMP).
I Larger computers have multiple sockets (each with local memory)
= Non-Uniform Memory Architecture (NUMA).
I CPU cores also have vector (data-parallel) acceleration
(SSE/AVX).
I Todays ordinary computer is yesterdays supercomputer (with
much of the same complication).
15 of 82
Basics: How to Build a Supercomputer
16 of 82
Basics: How to Build a Supercomputer
16 of 82
Basics: How to Build a Supercomputer
16 of 82
Basics: How to Build a Supercomputer
16 of 82
Basics: How to Build a Supercomputer
16 of 82
Basics: How to Build a Supercomputer
16 of 82
Basics: How to Build a Supercomputer
16 of 82
Basics: How to Build a Supercomputer
17 of 82
Basics: How to Build a Supercomputer
17 of 82
Basics: How to Build a Supercomputer
17 of 82
Basics: How to Build a Supercomputer
17 of 82
Basics: How to Build a Supercomputer
17 of 82
Basics: How to Build a Supercomputer
17 of 82
Basics: How to Build a Supercomputer
17 of 82
Basics: How to Build a Supercomputer
18 of 82
Basics: How to Build a Supercomputer
18 of 82
Basics: How to Build a Supercomputer
18 of 82
Basics: How to Build a Supercomputer
18 of 82
Basics: How to Build a Supercomputer
18 of 82
Basics: Programming a Multiprocessor Machine
19 of 82
Basics: Programming a Multiprocessor Machine
19 of 82
Basics: Programming a Multiprocessor Machine
19 of 82
Basics: Programming a Multiprocessor Machine
19 of 82
Basics: Programming a Multiprocessor Machine
I Parallel code
Shared memory methods within a node.
E.g. pthreads, OpenMP.
Distributed memory methods between nodes.
Message Passing Interface (MPI).
19 of 82
Basics: Programming a Multiprocessor Machine
I Parallel code
Shared memory methods within a node.
E.g. pthreads, OpenMP.
Distributed memory methods between nodes.
Message Passing Interface (MPI).
19 of 82
Basics: Programming a Multiprocessor Machine
I Parallel code
Shared memory methods within a node.
E.g. pthreads, OpenMP.
Distributed memory methods between nodes.
Message Passing Interface (MPI).
19 of 82
Basics: Summary
20 of 82
Basics: Summary
20 of 82
Basics: Summary
20 of 82
Basics: Summary
20 of 82
Basics: Summary
20 of 82
Basics: Summary
20 of 82
21
A Brief History
The CU Cluster
Other Activities
Recent Developments
22 of 82
HPCS: A Brief History
23 of 82
HPCS: A Brief History
23 of 82
HPCS: A Brief History
23 of 82
HPCS: A Brief History
23 of 82
HPCS: A Brief History
http://www.top500.org
24 of 82
HPCS: A Brief History
http://www.top500.org
24 of 82
Darwin1 (20062012)
25 of 82
Darwin3 (2012)(b) & Wilkes (2013)(f)
26 of 82
Darwin3 (2012)(b) & Wilkes (2013)(f)
26 of 82
Darwin: an Infiniband CPU Cluster
27 of 82
Darwin: an Infiniband CPU Cluster
27 of 82
Wilkes: a Dual-Rail Infiniband GPU Cluster
28 of 82
Wilkes: a Dual-Rail Infiniband GPU Cluster
28 of 82
Wilkes: a Dual-Rail Infiniband GPU Cluster
28 of 82
HPCS Production Cluster Schematic
29 of 82
The CU Cluster
30 of 82
The CU Cluster
30 of 82
HPCS: Other Activities
I Hosted clusters
e.g. MRC BSU, Cardiovascular Epidemiology, Whittle Lab.
I Research projects
e.g. Square Kilometre Array, Jaguar Land Rover.
I Integration and consultancy services
I Industrial services
31 of 82
HPCS: Other Activities
I Hosted clusters
e.g. MRC BSU, Cardiovascular Epidemiology, Whittle Lab.
I Research projects
e.g. Square Kilometre Array, Jaguar Land Rover.
I Integration and consultancy services
I Industrial services
31 of 82
HPCS: Other Activities
I Hosted clusters
e.g. MRC BSU, Cardiovascular Epidemiology, Whittle Lab.
I Research projects
e.g. Square Kilometre Array, Jaguar Land Rover.
I Integration and consultancy services
I Industrial services
31 of 82
HPCS: Other Activities
I Hosted clusters
e.g. MRC BSU, Cardiovascular Epidemiology, Whittle Lab.
I Research projects
e.g. Square Kilometre Array, Jaguar Land Rover.
I Integration and consultancy services
I Industrial services
31 of 82
HPCS: Recent Developments
32 of 82
HPCS: Recent Developments
32 of 82
HPCS: Recent Developments
32 of 82
HPCS: Recent Developments
32 of 82
HPCS: Recent Developments
32 of 82
The West Cambridge Data Centre
33 of 82
The West Cambridge Data Centre: Hall 1
34 of 82
The West Cambridge Data Centre: Hall 1
34 of 82
35
Security
Connecting
User Environment
Software
Job Submission
36 of 82
Using HPC: Security
37 of 82
Using HPC: Security
37 of 82
Using HPC: Security
37 of 82
Using HPC: Security
37 of 82
Using HPC: Security
38 of 82
Using HPC: Security
38 of 82
Using HPC: Security
38 of 82
Using HPC: Security
38 of 82
Using HPC: Security
38 of 82
Using HPC: Security
38 of 82
Using HPC: Connecting
39 of 82
Using HPC: Connecting
39 of 82
Using HPC: Connecting
39 of 82
Using HPC: Connecting
39 of 82
Using HPC: Connecting
39 of 82
Using HPC: Connecting
39 of 82
Using HPC: Connecting
39 of 82
Connecting: Windows Clients
I MobaXterm
http://mobaxterm.mobatek.net/
40 of 82
Connecting: Windows Clients
I MobaXterm
http://mobaxterm.mobatek.net/
40 of 82
Connecting: Windows Clients
I MobaXterm
http://mobaxterm.mobatek.net/
40 of 82
Connecting: Windows Clients
I MobaXterm
http://mobaxterm.mobatek.net/
40 of 82
Connecting: Windows Clients
40 of 82
Connecting: Windows Clients
I MobaXterm
http://mobaxterm.mobatek.net/
40 of 82
Connecting: Linux/MacOSX/UNIX Clients
41 of 82
Connecting: Linux/MacOSX/UNIX Clients
41 of 82
Connecting: Linux/MacOSX/UNIX Clients
41 of 82
Connecting: Linux/MacOSX/UNIX Clients
41 of 82
Connecting: Login
42 of 82
Connecting: Login
42 of 82
Connecting: Login
42 of 82
Connecting: Login
42 of 82
Connecting: Login
42 of 82
Connecting: First time login
43 of 82
Connecting: First time login
43 of 82
Connecting: First time login
43 of 82
MobaXterm SSH (Windows)
44 of 82
MobaXterm SSH (Windows)
44 of 82
Connecting: File Transfer
45 of 82
Connecting: File Transfer
45 of 82
Connecting: File Transfer
45 of 82
Connecting: File Transfer
45 of 82
Connecting: File Transfer
45 of 82
Connecting: Remote Desktop
Password:
Verify:
Would you like to enter a view-only password (y/n)? n
46 of 82
Connecting: Remote Desktop
Password:
Verify:
Would you like to enter a view-only password (y/n)? n
46 of 82
Connecting: Remote Desktop
X DISPLAY # PROCESS ID
:8 12745
I Kill it:
47 of 82
Connecting: Remote Desktop
48 of 82
Connecting: Remote Desktop
48 of 82
HPCS TurboVNC Session
49 of 82
Linux TurboVNC Control Panel
50 of 82
Connecting: Remote Desktop (MobaXterm)
51 of 82
Connecting: Remote Desktop (MobaXterm)
51 of 82
3D Remote Visualization
52 of 82
3D Remote Visualization
I Choose login-gfx1.
I Launch any application requiring 3D (OpenGL) with vglrun.
I May need to adjust the compression level for your network
connection.
52 of 82
Using HPC: User Environment
53 of 82
Using HPC: User Environment
CUDA 6
I But you dont need to know that.
53 of 82
Using HPC: User Environment
CUDA 6
I But you dont need to know that. (Probably. . . )
53 of 82
Using HPC: User Environment
CUDA 6
I But you dont need to know that. (Probably. . . )
53 of 82
User Environment: Filesystems
I /home/abc123
I 40GB soft quota (45GB hard).
I Visible equally from all nodes.
I Single storage server.
I Backed up nightly to tape.
I Not intended for job outputs or large/many input files.
I /scratch/abc123
I Visible equally from all nodes.
I Larger and faster.
I Intended for job inputs and outputs.
I Not backed up.
54 of 82
Filesystems: Quotas
I quota
====================================================================================
Usage on /home (lfs quota -u abc123 /home):
====================================================================================
Disk quotas for user abc123 (uid 456):
Filesystem kbytes quota limit grace files quota limit grace
/home 24513908 41943040 47185920 - 75364 0 0 -
====================================================================================
Usage on /scratch (lfs quota -u abc123 /scratch):
====================================================================================
Disk quotas for user abc123 (uid 456):
Filesystem kbytes quota limit grace files quota limit grace
/lustre1 5467644384 0 0 - 3864823 0 0 -
...
55 of 82
Filesystems: Quotas
I quota
====================================================================================
Usage on /home (lfs quota -u abc123 /home):
====================================================================================
Disk quotas for user abc123 (uid 456):
Filesystem kbytes quota limit grace files quota limit grace
/home *43567687 41943040 47185920 - 75364 0 0 -
====================================================================================
Usage on /scratch (lfs quota -u abc123 /scratch):
====================================================================================
Disk quotas for user abc123 (uid 456):
Filesystem kbytes quota limit grace files quota limit grace
/lustre1 5467644384 0 0 - 3864823 0 0 -
...
55 of 82
Filesystems: Quotas
I quota
====================================================================================
Usage on /home (lfs quota -u abc123 /home):
====================================================================================
Disk quotas for user abc123 (uid 456):
Filesystem kbytes quota limit grace files quota limit grace
/home *43567687 41943040 47185920 - 75364 0 0 -
====================================================================================
Usage on /scratch (lfs quota -u abc123 /scratch):
====================================================================================
Disk quotas for user abc123 (uid 456):
Filesystem kbytes quota limit grace files quota limit grace
/lustre1 5467644384 0 0 - 3864823 0 0 -
...
55 of 82
Filesystems: Quotas
I quota
====================================================================================
Usage on /home (lfs quota -u abc123 /home):
====================================================================================
Disk quotas for user abc123 (uid 456):
Filesystem kbytes quota limit grace files quota limit grace
/home *43567687 41943040 47185920 - 75364 0 0 -
====================================================================================
Usage on /scratch (lfs quota -u abc123 /scratch):
====================================================================================
Disk quotas for user abc123 (uid 456):
Filesystem kbytes quota limit grace files quota limit grace
/lustre1 5467644384 0 0 - 3864823 0 0 -
...
55 of 82
Filesystems: Backups
56 of 82
Filesystems: Backups
56 of 82
Filesystems: Backups
56 of 82
Filesystems: Backups
56 of 82
Filesystems: Backups
56 of 82
Filesystems: Permissions
57 of 82
Using HPC: Software
58 of 82
User Environment: Environment Modules
59 of 82
User Environment: Environment Modules
I Currently loaded:
module list
Currently Loaded Modulefiles:
1) dot 6) intel/impi/4.1.3.045 11) default-impi
2) scheduler 7) global
3) java/jdk1.7.0_60 8) intel/cce/12.1.10.319
4) turbovnc/1.1 9) intel/fce/12.1.10.319
5) vgl/2.3.1/64 10) intel/mkl/10.3.10.319
I Available:
module av
60 of 82
User Environment: Environment Modules
I Currently loaded:
module list
Currently Loaded Modulefiles:
1) dot 4) turbovnc/1.1 7) global
2) scheduler 5) vgl/2.3.1/64 8) use.own
3) java/jdk1.7.0_60 6) openmpi/gcc/1.8.6 9) default-ompi
I Available:
module av
60 of 82
User Environment: Environment Modules
I Show:
module show castep/impi/7.0.3
-------------------------------------------------------------------
/usr/local/Cluster-Config/modulefiles/castep/impi/7.0.3:
I Load:
I Unload:
61 of 82
User Environment: Environment Modules
I Purge:
module purge
I Defaults:
62 of 82
User Environment: Compilers
63 of 82
User Environment: Compilers
63 of 82
Using HPC: Job Submission
64 of 82
Using HPC: Job Submission
65 of 82
Using HPC: Job Submission
66 of 82
Using HPC: Job Submission
66 of 82
Using HPC: Job Submission
67 of 82
Job Submission: Using SLURM or PBS
I SLURM
68 of 82
Job Submission: Show Queue
I SLURM
[abc123@login]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
790299 sandybrid Test3 abc123 PD 0:00 8 (Priority)
790290 sandybrid Test2 abc123 R 27:56:10 8 sand-6-[38-40],sand-7-[27-31]
69 of 82
Job Submission: Show Queue
I SLURM
[abc123@login]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
790299 sandybrid Test3 abc123 PD 0:00 8 (Resources)
790290 sandybrid Test2 abc123 R 27:56:10 8 sand-6-[38-40],sand-7-[27-31]
69 of 82
Job Submission: Monitor Job
I SLURM
70 of 82
Job Submission: Cancel Job
I SLURM
71 of 82
Job Submission: Scripts
I SLURM
See slurm submit.darwin, slurm submit.wilkes.
#!/bin/bash
#! Name of the job:
#SBATCH -J darwinjob
#! Which project should be charged:
#SBATCH -A CHANGEME
#! How many whole nodes should be allocated?
#SBATCH --nodes=2
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=32
#! How much wallclock time will be required?
#SBATCH --time=02:00:00
#! Select partition:
#SBATCH -p sandybridge
...
72 of 82
Job Submission: Scripts
I SLURM
See slurm submit.darwin, slurm submit.wilkes.
#!/bin/bash
#! Name of the job:
#SBATCH -J darwinjob
#! Which project should be charged:
#SBATCH -A CHANGEME
#! How many whole nodes should be allocated?
#SBATCH --nodes=2
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=32
#! How much wallclock time will be required?
#SBATCH --time=02:00:00
#! Select partition:
#SBATCH -p sandybridge
...
72 of 82
Job Submission: Scripts
73 of 82
Job Submission: Scripts
73 of 82
Job Submission: Accounting Commands [HPCS]
I How many core hours available do I have?
mybalance
I How many core hours does some other project or user have?
gbalance -p HALOS
74 of 82
Job Submission: Single Node Jobs
#!/bin/bash
...
#SBATCH --nodes=1
...
export OMP NUM THREADS= # For OpenMP across cores.
options=<specific option for multithreading>
$application $options
...
75 of 82
Job Submission: Single Node Jobs
#!/bin/bash
...
#SBATCH --nodes=1
...
export OMP NUM THREADS=16 # For OpenMP across 16 cores.
options=<specific option for multithreading>
$application $options
...
75 of 82
Job Submission: Single Node Jobs
#!/bin/bash
...
#SBATCH --nodes=1
...
export OMP NUM THREADS=8 # For OpenMP across 8 cores.
options=<specific option for multithreading>
$application $options
...
75 of 82
Job Submission: Single Node Jobs
#!/bin/bash
...
#SBATCH --nodes=1
...
export OMP NUM THREADS= # For OpenMP across cores.
options=<specific option for multithreading>
$application $options
...
75 of 82
Job Submission: MPI Jobs
#!/bin/bash
...
#SBATCH --nodes=4
#SBATCH --ntasks=64 # i.e. 16x4 MPI tasks in total.
...
mpirun -np 64 $application $options
...
I SLURM-aware MPI launches remote tasks via SLURM.
I The template script uses $SLURM TASKS PER NODE to set PPN.
76 of 82
Job Submission: MPI Jobs
#!/bin/bash
...
#SBATCH --nodes=4
#SBATCH --ntasks=32 # i.e. 8x4 MPI tasks in total.
...
mpirun -ppn 8 -np 32 $application $options
...
I SLURM-aware MPI launches remote tasks via SLURM.
I The template script uses $SLURM TASKS PER NODE to set PPN.
76 of 82
Job Submission: MPI Jobs
#!/bin/bash
...
#SBATCH --nodes=4
#SBATCH --ntasks=32 # i.e. 8x4 MPI tasks in total.
...
mpirun -ppn 8 -np 32 $application $options
...
I SLURM-aware MPI launches remote tasks via SLURM.
I The template script uses $SLURM TASKS PER NODE to set PPN.
76 of 82
Job Submission: Hybrid Jobs
#!/bin/bash
...
#SBATCH --nodes=4
#SBATCH --ntasks=32 # i.e. 8x4 MPI tasks in total.
...
export OMP NUM THREADS=2 # i.e. 2 threads per MPI task.
mpirun -ppn 8 -np 32 $application $options
...
I This job uses 64 cores (each MPI task splits into 2 OpenMP threads).
77 of 82
Job Submission: Hybrid Jobs
#!/bin/bash
...
#SBATCH --nodes=4
#SBATCH --ntasks=32 # i.e. 8x4 MPI tasks in total.
...
export OMP NUM THREADS=2 # i.e. 2 threads per MPI task.
mpirun -ppn 8 -np 32 $application $options
...
I This job uses 64 cores (each MPI task splits into 2 OpenMP threads).
77 of 82
Job Submission: High Throughput Jobs
#!/bin/bash
...
#SBATCH --nodes=4
...
cd directory for job1
srun --exclusive -N 1 -n 1 $application $options for job1 > output 2> error &
cd directory for job2
srun --exclusive -N 1 -n 1 $application $options for job2 > output 2> error &
...
cd directory for job64
srun --exclusive -N 1 -n 1 $application $options for job64 > output 2> error
wait
I Exercise 5 & 6 - Submitting Jobs.
78 of 82
Job Submission: High Throughput Jobs
#!/bin/bash
...
#SBATCH --nodes=4
...
cd directory for job1
srun --exclusive -N 1 -n 1 $application $options for job1 > output 2> error &
cd directory for job2
srun --exclusive -N 1 -n 1 $application $options for job2 > output 2> error &
...
cd directory for job64
srun --exclusive -N 1 -n 1 $application $options for job64 > output 2> error
wait
I Exercise 5 & 6 - Submitting Jobs.
78 of 82
Job Submission: High Throughput Jobs
#!/bin/bash
...
#SBATCH --nodes=4
...
cd directory for job1
srun --exclusive -N 1 -n 1 $application $options for job1 > output 2> error &
cd directory for job2
srun --exclusive -N 1 -n 1 $application $options for job2 > output 2> error &
...
cd directory for job64
srun --exclusive -N 1 -n 1 $application $options for job64 > output 2> error
wait
I Exercise 5 & 6 - Submitting Jobs.
78 of 82
Job Submission: High Throughput Jobs
#!/bin/bash
...
#SBATCH --nodes=4
...
cd directory for job1
srun --exclusive -N 1 -n 1 $application $options for job1 > output 2> error &
cd directory for job2
srun --exclusive -N 1 -n 1 $application $options for job2 > output 2> error &
...
cd directory for job64
srun --exclusive -N 1 -n 1 $application $options for job64 > output 2> error
wait
I Exercise 5 & 6 - Submitting Jobs.
78 of 82
Job Submission: High Throughput Jobs
#!/bin/bash
...
#SBATCH --nodes=4
...
cd directory for job1
srun --exclusive -N 1 -n 1 $application $options for job1 > output 2> error &
cd directory for job2
srun --exclusive -N 1 -n 1 $application $options for job2 > output 2> error &
...
cd directory for job64
srun --exclusive -N 1 -n 1 $application $options for job64 > output 2> error
wait
I Exercise 5 & 6 - Submitting Jobs.
78 of 82
Job Submission: Interactive [HPCS]
I Compute nodes are accessible via SSH while you have a job
running on them.
I Alternatively, submit an interactive job:
sintr -A MYPROJECT -p sandybridge -N2 -t 2:0:0
79 of 82
Job Submission: Interactive [HPCS]
I Compute nodes are accessible via SSH while you have a job
running on them.
I Alternatively, submit an interactive job:
sintr -A MYPROJECT -p sandybridge -N2 -t 2:0:0
79 of 82
Job Submission: Interactive [HPCS]
I Compute nodes are accessible via SSH while you have a job
running on them.
I Alternatively, submit an interactive job:
sintr -A MYPROJECT -p sandybridge -N2 -t 2:0:0
79 of 82
Job Submission: Array Jobs
I This feature varies between versions.
I http : //slurm.schedmd.com/job array .html
I Used for submitting and managing large sets of similar jobs.
I Each job in the array has the same initial options.
I SLURM
[abc123@login]$ sbatch --array=1-7 -A STARS-SL2 submission script
Submitted batch job 791609
[abc123@login-sand2]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
791609 1 sandybrid hpl abc123 R 0:06 1 sand-6-32
791609 3 sandybrid hpl abc123 R 0:06 1 sand-6-37
791609 5 sandybrid hpl abc123 R 0:06 1 sand-6-59
791609 7 sandybrid hpl abc123 R 0:06 1 sand-7-27
80 of 82
Job Submission: Array Jobs
I This feature varies between versions.
I http : //slurm.schedmd.com/job array .html
I Used for submitting and managing large sets of similar jobs.
I Each job in the array has the same initial options.
I SLURM
[abc123@login]$ sbatch --array=1-7:2 -A STARS-SL2 submission script
Submitted batch job 791609
[abc123@login-sand2]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
791609 1 sandybrid hpl abc123 R 0:06 1 sand-6-32
791609 3 sandybrid hpl abc123 R 0:06 1 sand-6-37
791609 5 sandybrid hpl abc123 R 0:06 1 sand-6-59
791609 7 sandybrid hpl abc123 R 0:06 1 sand-7-27
80 of 82
Job Submission: Array Jobs
I This feature varies between versions.
I http : //slurm.schedmd.com/job array .html
I Used for submitting and managing large sets of similar jobs.
I Each job in the array has the same initial options.
I SLURM
[abc123@login]$ sbatch --array=1,3,5,7 -A STARS-SL2 submission script
Submitted batch job 791609
[abc123@login-sand2]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
791609 1 sandybrid hpl abc123 R 0:06 1 sand-6-32
791609 3 sandybrid hpl abc123 R 0:06 1 sand-6-37
791609 5 sandybrid hpl abc123 R 0:06 1 sand-6-59
791609 7 sandybrid hpl abc123 R 0:06 1 sand-7-27
80 of 82
Job Submission: Array Jobs
I This feature varies between versions.
I http : //slurm.schedmd.com/job array .html
I Used for submitting and managing large sets of similar jobs.
I Each job in the array has the same initial options.
I SLURM
[abc123@login]$ sbatch --array=1,3,5,7 -A STARS-SL2 submission script
Submitted batch job 791609
[abc123@login-sand2]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
791609 1 sandybrid hpl abc123 R 0:06 1 sand-6-32
791609 3 sandybrid hpl abc123 R 0:06 1 sand-6-37
791609 5 sandybrid hpl abc123 R 0:06 1 sand-6-59
791609 7 sandybrid hpl abc123 R 0:06 1 sand-7-27
80 of 82
Job Submission: Array Jobs
I This feature varies between versions.
I http : //slurm.schedmd.com/job array .html
I Used for submitting and managing large sets of similar jobs.
I Each job in the array has the same initial options.
I SLURM
[abc123@login]$ sbatch --array=1,3,5,7 -A STARS-SL2 submission script
Submitted batch job 791609
[abc123@login-sand2]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
791609 1 sandybrid hpl abc123 R 0:06 1 sand-6-32
791609 3 sandybrid hpl abc123 R 0:06 1 sand-6-37
791609 5 sandybrid hpl abc123 R 0:06 1 sand-6-59
791609 7 sandybrid hpl abc123 R 0:06 1 sand-7-27
80 of 82
Job Submission: Array Jobs
I This feature varies between versions.
I http : //slurm.schedmd.com/job array .html
I Used for submitting and managing large sets of similar jobs.
I Each job in the array has the same initial options.
I SLURM
[abc123@login]$ sbatch --array=1,3,5,7 -A STARS-SL2 submission script
Submitted batch job 791609
[abc123@login-sand2]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
791609 1 sandybrid hpl abc123 R 0:06 1 sand-6-32
791609 3 sandybrid hpl abc123 R 0:06 1 sand-6-37
791609 5 sandybrid hpl abc123 R 0:06 1 sand-6-59
791609 7 sandybrid hpl abc123 R 0:06 1 sand-7-27
80 of 82
Job Submission: Array Jobs
I This feature varies between versions.
I http : //slurm.schedmd.com/job array .html
I Used for submitting and managing large sets of similar jobs.
I Each job in the array has the same initial options.
I SLURM
[abc123@login]$ sbatch --array=1,3,5,7 -A STARS-SL2 submission script
Submitted batch job 791609
[abc123@login-sand2]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
791609 1 sandybrid hpl abc123 R 0:06 1 sand-6-32
791609 3 sandybrid hpl abc123 R 0:06 1 sand-6-37
791609 5 sandybrid hpl abc123 R 0:06 1 sand-6-59
791609 7 sandybrid hpl abc123 R 0:06 1 sand-7-27
80 of 82
Job Submission: Array Jobs (ctd)
81 of 82
Job Submission: Array Jobs (ctd)
81 of 82
Job Submission: Scheduling Top Dos & Donts
I Do . . .
I Give reasonably accurate wall times (allows backfilling).
I Check your balance occasionally (mybalance).
I Test on a small scale first.
I Implement checkpointing if possible (reduces resource wastage).
I Dont . . .
I Request more cores than you need
you will wait longer and use more credits.
I Cancel jobs unnecessarily
priority increases over time.
82 of 82