Engineering Challenges of Multi GPU Training

MULTI GPU TRAINING
PART 3: ENGINEERING CHALLENGES OF MULTI GPU TRAINING

Speaker, Date
COURSE AGENDA
9:00 am — 9:30 am Welcome Presenter Name

9:30 am — 11:30 am Part 1: Theory of Data Parallelism Presenter Name
11:30 am — 12:30 pm Break All
12:30 pm — 2:30 pm Part 2: Algorithmic Challenges of Multi GPU training Presenter Name
2:30 pm — 3:30 pm Break All
3:30 pm — 5:00 pm Part 3: Engineering Challenges of Multi GPU training Presenter Name
• Keeping up with the GPU (storage,
networking, CPU, PCIe, Memory)
• Job Scheduling
LAB AGENDA
• Overview of the wider AI system design
• Other
• Data Input Pipeline
• Storage
• Networking
KEEPING UP WITH • Augmentation
THE GPU • Communication
• Reference Architecture
• Other
DATA INPUT PIPELINE
DATA INPUT PIPELINE
Overview
Download Decode Augment
Batch Queue
352X352 Random Crop
Transformation
(e.g. flip, colour)
Noise
DATA INPUT PIPELINE
Simple pipeline
Even when implemented correctly you will need a Broadwell class CPU just to
saturate an 8 Pascal GPU node when training ResNet 50 on ImageNet
OPTIMIZING THE I/O PIPELINE
Overlapping data communication
INCREASING I/O CHALLENGES
GPU speed continues to increase (not matched by the CPU)
COMPUTE TO DATA RATIO
More compute equals more time
• The more compute you have to execute on a unit of data the more time you have
to deliver a new sample
• Model choices are frequently requirements driven (e.g. self driving car might not
be able to use a large model as it has a strict latency, compute and power
budget)
PROFILING YOUR MODEL
CNTK training ResNet 50 on DGX-1 V
DOES LATENCY MATTER?
DOES THROUGHPUT MATTER?
ROOFLINE ANALYSIS
Understand your constrains
DATA INPUT PIPELINE
Overview
Batch Queue
352X352 Random Crop
Transformation
(e.g. flip, colour)
Noise
AUGMENTATION
Consuming computational resource
• Is critical to a wide range of problems
• Will consume computational resource (which as we have discussed is not always a

problem)
• Can rarely be done as a pre-processing step
• Can be migrated or partially migrated to the GPU (doing so consumes GPU

compute and memory so there is a trade-off)
OPTIMIZE I/O
Take action
• Optimize I/O
• Use the deep learning framework specific data storage and loading mechanism.
• Multithreaded
• Highly optimised storage format
• Highly optimised code
• Use fast decoding libraries (for images consider libjpeg-turbo)
• Make sure you are using the system caching mechanism
• Be cautious about the third party libraries (for example using OpenCV does not allow
you to optimise performance)
OPTIMIZE AUGMENTATION
Take action
• Optimize Augmentation
• Understand the performance of your augmentation pipeline
• Optimise the parts of code that matter
• If necessary consider augmentation on the GPU

NON TRIVIAL IMPLICATIONS
We will observe them during the Lab
OPTIMIZE STORAGE
Handling large, read-dominated DL datasets
• DNNs datasets are large
• Read-dominated at beginning of
each epoch (dataset pass through)
• Large groups of random reads,
repeated over and over 2
x
• Recommend NFS appliance for long
term storage
2
x
• DGX-1 SSDs in RAID0 for NFS cache
• Option to copy all data to each node
at beginning of job
DATA STORAGE ORGANISATION
Multi tier approach
CLOUD / DC DC
Level4 Level3 Level2 Level1

Level0
Storage Storage Storage Storage
High Perf. Storage
50PB+ 10PB+ Connect 5PB 100TB
7TB
Cold Highly Low- High-
SSD
storage for available, bandwidth bandwidth
In DGX-1
archival replicated Storage Storage
storage
[Training
[All raw [large- [Labeled
data
data ever [All active scale data
cache]
collected] raw data] cache] cache]
LAB
Data Input Pipeline
- Overall impact of correct input pipeline
- Impact of inefficiencies:
- File format and caching
- Decode logic
- Augmentation logic
IN NODE COMMUNICATION
DL DATA PARALLELISM – PCIE BASED
QPI Link
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
QPI Link
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
Data loading over PCIe

QPI Link
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
Gradient averaging over PCIe and QPI

QPI Link
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
Data loading and gradient averaging share communication resources: Congestion

COMPETING FOR RESOURCES
Degradation of performance
• Data delivery
• Weight exchange between the GPUs
• Weight exchange between the nodes

DL DATA PARALLELISM – NVLINK
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
Data loading over PCIe

CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
Gradient averaging over NVLink

CPU CPU
PCIe PCIe
Switch Switch
0 1 5 4
2 3 7 6
No sharing of communication resources: No congestion

SUBSTANTIALLY MORE BANDWIDTH
NVLINK operates at 300GB/s
• For data parallel implementations gradient is exchanged over NVLINK
• For model parallel implementations activations are exchanged over NVLINK
• PCIe is reserved for data delivery (and for multi node training for external
communication)
PERFORMANCE
Intra-node performance
AllReduce bandwidth (OMB, size=128MB, in GB/s)
60
50
40
30
20
10
0
4 QPI 4 CPU 4 PCI DGX-1
NVSWITCH: ALL-TO-ALL CONNECTIVITY
GPU GPU GPU GPU GPU GPU GPU GPU

8 9 10 11 12 13 14 15
NVSwitch Fabric
GPU GPU GPU GPU GPU GPU GPU GPU

0 1 2 3 4 5 6 7
16-WAY ALL-REDUCE PERFORMANCE
8x smaller packet size with

the same performance
4x higher performance
for a given packet size
EVALUATION
Calculating the bandwidth requirements
efficiency = (Tf + Tb)/(Tf + max(Tb,M/B))
• Tf – time required for forward pass calculation
• Tb – time required for backward pass calculation
• M – amount of data that needs transporting
• B – available bandwidth
EVALUATION
Calculating the bandwidth requirements
efficiency = (Tf + Tb)/(Tf + max(Tb,M/B))

Efficiency vs Synchronization time
1.2
0.8
0.6
0.4
0.2
0
IMPLICATIONS
Requirements highly dependent on the workload
IMPLICATIONS
Requirements highly dependent on the workload
10X PERFORMANCE GAIN IN LESS THAN A YEAR
Time to Train (days)

DGX-1, SEP’17 DGX-2, Q3‘18
DGX-1 with V100 15 days
DGX-2 days
1.5 10 Times Faster
0 5 10 15 20
software improvements across the stack including NCCL, cuDNN, etc.

Workload: FairSeq, 55 epochs to solution. PyTorch training performance.
DOES THROUGHPUT MATTER?
DOES LATENCY MATTER?
COMMUNICATION REQUIREMENTS
Latency vs Throughput
BETWEEN NODE COMMUNICATION
MULTI-NODE DGX-1
Networking Topology
• Ingest data as fast as possible
• Pass data rapidly between nodes

across cluster
• Similar to HPC networking architecture
• InfiniBand = ultra high bandwidth, low

latency, collision free
• Two-tier network with root and leaf

switches
• any to any connectivity with full bi-

section bandwidth & minimum
contention between nodes
MULTI-NODE DGX-1
SMALL CLUSTER
up to 12 DGX-1 nodes
• Assume growth for up to 12 nodes
• 2 racks, 2 IB switches (36 ports)
• 19.2 kW per rack, but can split

across racks if necessary
• Full bi-section bandwidth for each

group of 6 nodes
• 2:1 oversubscription between

groups of 6
MULTI-NODE DGX-1 MEDIUM CLUSTER
up to 36 DGX-1 nodes
• Defines a DGX-1
“POD”
• Can be replicated for

greater scale, ex: Main Ethernet
large cluster Management Ethernet
configuration
• 6 racks, 6 nodes per

rack
• Larger IB director
switch (216 ports)
with capacity for
more pods via unused
ports
MULTI-NODE DGX-1 LARGE CLUSTER
up to 144 DGX-1 nodes (4 ”PODS”)
• Implements 4 DGX-1
pods
• Distributed across 24
racks
• Full bi-section
bandwidth within pod,
2:1 between pods
• Training jobs ideally

scheduled within a
pod, to minimize
inter-pod traffic
DGX-1 DEEP LEARNING DATA CENTER
Reference
Architecture
• Full system design
• Login and
management
servers
• Storage and
networking
Main Ethernet
Management Ethernet
COMMUNICATION SOFTWARE
DESIGN
What is NCCL ?
Optimized collective communication library between CUDA devices. Implements

Allreduce, Reduce, Broadcast, Reduce-scatter, Allgather.
Easy to integrate into any DL framework, as well as traditional HPC apps using MPI.
Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory,
parallel reductions, NVLink usage.
Operates on CUDA pointers. Operations are tied to a CUDA stream.
Uses as few threads as possible to permit other computation to progress

simultaneously.
NCCL
Architecture
Caffe Caffe2 Torch TF MXNET CNTK

Deep Learning Frameworks
NCCL CUDNN CUBLAS
CUDA
NVIDIA GPUs
DESIGN
Rings
NCCL uses rings to move data across all GPUs and perform reductions.
DESIGN
Rings
PCIe / QPI : 1 unidirectional ring

DESIGN
Rings
PCIe / QPI : 1 unidirectional ring DGX-1 : 4 unidirectional rings

DESIGN
Kernels
sendbuff recvbuff
FIFO
Reduction
Previous GPU Next GPU

in the ring in the ring
NCCL 2.0
Inter-node communication
Inter-node communication using Sockets or Infiniband verbs, with multi-rail support,

topology detection and automatic use of GPU Direct RDMA.
Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth

and create rings across nodes.
PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

NCCL 2.0
Inter-node communication
Inter-node communication using Sockets or Infiniband verbs, with multi-rail support,

topology detection and automatic use of GPU Direct RDMA.
Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth

and create rings across nodes.
PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband

PERFORMANCE
Inter-node performance
AllReduce bandwidth (OMB, size=128MB, in GB/s)
45
40
35
30
25 MPI
20 Baidu Allreduce
15 NCCL
10
0
2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)
PERFORMANCE
Deep Learning - CNTK
CNTK scalingResNet50, images/s
8000
7000
6000 6569
5000
4000
3000 3281 3360

1684
2000
1645 1744
1000 217
0
0 8 16 24 32
Idea l MPI NCCL

LAB
In node communication
- NVLINK + NCCL TensorFlow example
- In the next Lab we will see how to use Horovod to hide the explicit use of NCCL
REFERENCE ARCHITECTURE
BALANCED HARDWARE
DGX-1 as a reference point for solution design
MULTI-NODE SCALING WHITEPAPER
Use this asset to aid the

design process, ensuring
you develop the
optimized architecture
for your multi-node
cluster, following NVIDIA
best practices learned
from our customer
deployments and our
own DGX SATURNV
NGC – EFFICIENT SOFTWARE
Regardless of the hardware

solution you choose,
whether it is in the
datacenter or in the cloud
make sure you use NGC. For
free and on a monthly basis
we provide a set of docker
containers with the latest
and highly optimized version
of all of the deep learning
frameworks.
INNOVATE IN MINUTES,
NOT WEEKS WITH DEEP
LEARNING CONTAINERS
Benefits of Containers:
Simplify deployment of
GPU-accelerated applications,
eliminating time-consuming software
integration work
Isolate individual frameworks

or applications
Share, collaborate,
and test applications across 69
END-TO-END PRODUCT FAMILY
TRAINING INFERENCE
FULLY INTERGRATED DL SUPERCOMPUTER DATA CENTER AUTOMOTIVE EMBEDDED
DGX Family
(DGX Station, DGX-1, DGX-2, Cloud Service Provider)
DESKTOP DATA CENTER
Tesla P100/V100
Drive PX2 Jetson TX2
Tesla P100
TITAN V Tesla P4
Tesla V100
Tesla P100
MANAGING RESOURCES
JOB SCHEDULING
The role of a job scheduler
Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
JOB SCHEDULING
Feature comparison (this is not an exhaustive list)
JOB SCHEDULING
Feature comparison
JOB SCHEDULING
Feature comparison
JOB SCHEDULING
Feature comparison
NGC VISION
AI platform of choice
MORE THAN JOB SCHEDULING
Wider set of Machine Learning activities (Uber’s Michelangelo)
https://eng.uber.com/michelangelo/
DL IS A HPC WORKLOAD
HPC expertise is important for success
It makes sense to build an AI team and a separate systems/HPC team and have the
two teams sit next to each other.
That is because solving some of the problems discussed in the lecture requires very
specialised systems/HPC knowledge. It is incredibly difficult for any single human to
acquire both the AI and systems/HPC knowledge.
For detailed discussion see Andrew Ng’s “Nuts and Bolts of Applying Deep Learning ”: https://www.youtube.com/watch?
v=F1ka6a13S9I&t=120s
OTHER
MULTI-NODE DGX-1
“A-HA” MOMENTS IN DL CLUSTER DESIGN
Additional design insights to get you started
Overall Cluster Rack Design Networking Storage Facilities Software
• HPC similar to DL • DL drives close to • Like HPC, • DGX-1 read cache • GPU data center • Scale requires
operational limits; InfiniBand is is critical operates at near- “cluster-aware”
• HPC expertise can preferred max power software
help in design • Assume less • Datasets range
headroom • Require high from 10k’s to • Assume higher • NCCL2 =
• Even with HPC, bandwidth, low millions objects watts per-rack GPU/multi-node
the similarities are • Proper airflow is latency acceleration
limited crucial to cluster • Terabyte levels of • Dramatically
performance • Maximize per- storage higher FLOPS/watt • Automatic
node IB = floor space topology detect
connections • Large variance saved
• DL framework
optimizations
TALK TO US
www.nvidia.com/dli

Engineering Challenges of Multi GPU Training

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Engineering Challenges of Multi GPU Training

Încărcat de

Drepturi de autor:

Formate disponibile

MULTI GPU TRAINING

PART 3: ENGINEERING CHALLENGES OF MULTI GPU TRAINING

9:00 am — 9:30 am Welcome Presenter Name

Download Decode Augment

352X352 Random Crop

Download Decode Augment

352X352 Random Crop

• Is critical to a wide range of problems

• Will consume computational resource (which as we have discussed is not always a

• Can rarely be done as a pre-processing step

• Can be migrated or partially migrated to the GPU (doing so consumes GPU

• Highly optimised storage format

• Highly optimised code

• Use fast decoding libraries (for images consider libjpeg-turbo)

• Make sure you are using the system caching mechanism

• Optimise the parts of code that matter

• If necessary consider augmentation on the GPU

Level4 Level3 Level2 Level1

- Overall impact of correct input pipeline

Data loading over PCIe

Gradient averaging over PCIe and QPI

Data loading and gradient averaging share communication resources: Congestion

• Weight exchange between the GPUs

• Weight exchange between the nodes

Data loading over PCIe

Gradient averaging over NVLink

No sharing of communication resources: No congestion

• For data parallel implementations gradient is exchanged over NVLINK

• For model parallel implementations activations are exchanged over NVLINK

GPU GPU GPU GPU GPU GPU GPU GPU

GPU GPU GPU GPU GPU GPU GPU GPU

8x smaller packet size with

efficiency = (Tf + Tb)/(Tf + max(Tb,M/B))

• Tf – time required for forward pass calculation

• Tb – time required for backward pass calculation

• M – amount of data that needs transporting

efficiency = (Tf + Tb)/(Tf + max(Tb,M/B))

Time to Train (days)

software improvements across the stack including NCCL, cuDNN, etc.

• Pass data rapidly between nodes

• Similar to HPC networking architecture

• InfiniBand = ultra high bandwidth, low

• Two-tier network with root and leaf

• any to any connectivity with full bi-

• 2 racks, 2 IB switches (36 ports)

• 19.2 kW per rack, but can split

• Full bi-section bandwidth for each

• 2:1 oversubscription between

• Can be replicated for

• 6 racks, 6 nodes per

• Training jobs ideally

• Full system design

Optimized collective communication library between CUDA devices. Implements

Operates on CUDA pointers. Operations are tied to a CUDA stream.

Uses as few threads as possible to permit other computation to progress

Caffe Caffe2 Torch TF MXNET CNTK

NCCL CUDNN CUBLAS

PCIe / QPI : 1 unidirectional ring

PCIe / QPI : 1 unidirectional ring DGX-1 : 4 unidirectional rings

Previous GPU Next GPU

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support,

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband