Sunteți pe pagina 1din 83

MULTI GPU TRAINING

PART 3: ENGINEERING CHALLENGES OF MULTI GPU TRAINING


Speaker, Date
COURSE AGENDA

9:00 am — 9:30 am Welcome Presenter Name


9:30 am — 11:30 am Part 1: Theory of Data Parallelism Presenter Name
11:30 am — 12:30 pm Break All
12:30 pm — 2:30 pm Part 2: Algorithmic Challenges of Multi GPU training Presenter Name
2:30 pm — 3:30 pm Break All
3:30 pm — 5:00 pm Part 3: Engineering Challenges of Multi GPU training Presenter Name
• Keeping up with the GPU (storage,
networking, CPU, PCIe, Memory)
• Job Scheduling
LAB AGENDA
• Overview of the wider AI system design
• Other
• Data Input Pipeline
• Storage
• Networking
KEEPING UP WITH • Augmentation
THE GPU • Communication
• Reference Architecture
• Other
DATA INPUT PIPELINE
DATA INPUT PIPELINE
Overview
Download Decode Augment

Batch Queue

Download Decode Augment

352X352 Random Crop

Transformation
(e.g. flip, colour)

Noise
DATA INPUT PIPELINE
Simple pipeline

Even when implemented correctly you will need a Broadwell class CPU just to
saturate an 8 Pascal GPU node when training ResNet 50 on ImageNet
OPTIMIZING THE I/O PIPELINE
Overlapping data communication
INCREASING I/O CHALLENGES
GPU speed continues to increase (not matched by the CPU)
COMPUTE TO DATA RATIO
More compute equals more time

• The more compute you have to execute on a unit of data the more time you have
to deliver a new sample

• Model choices are frequently requirements driven (e.g. self driving car might not
be able to use a large model as it has a strict latency, compute and power
budget)
PROFILING YOUR MODEL
CNTK training ResNet 50 on DGX-1 V
DOES LATENCY MATTER?
DOES THROUGHPUT MATTER?
ROOFLINE ANALYSIS
Understand your constrains
DATA INPUT PIPELINE
Overview
Download Decode Augment

Batch Queue

Download Decode Augment

352X352 Random Crop

Transformation
(e.g. flip, colour)

Noise
AUGMENTATION
Consuming computational resource

• Is critical to a wide range of problems

• Will consume computational resource (which as we have discussed is not always a


problem)

• Can rarely be done as a pre-processing step

• Can be migrated or partially migrated to the GPU (doing so consumes GPU


compute and memory so there is a trade-off)
OPTIMIZE I/O
Take action

• Optimize I/O
• Use the deep learning framework specific data storage and loading mechanism.

• Multithreaded

• Highly optimised storage format

• Highly optimised code

• Use fast decoding libraries (for images consider libjpeg-turbo)

• Make sure you are using the system caching mechanism

• Be cautious about the third party libraries (for example using OpenCV does not allow
you to optimise performance)
OPTIMIZE AUGMENTATION
Take action

• Optimize Augmentation
• Understand the performance of your augmentation pipeline

• Optimise the parts of code that matter

• If necessary consider augmentation on the GPU


NON TRIVIAL IMPLICATIONS
We will observe them during the Lab
OPTIMIZE STORAGE
Handling large, read-dominated DL datasets
• DNNs datasets are large
• Read-dominated at beginning of
each epoch (dataset pass through)
• Large groups of random reads,
repeated over and over 2
x
• Recommend NFS appliance for long
term storage
2
x
• DGX-1 SSDs in RAID0 for NFS cache
• Option to copy all data to each node
at beginning of job
DATA STORAGE ORGANISATION
Multi tier approach
CLOUD / DC DC

Level4 Level3 Level2 Level1


Level0
Storage Storage Storage Storage
High Perf. Storage
50PB+ 10PB+ Connect 5PB 100TB
7TB
Cold Highly Low- High-
SSD
storage for available, bandwidth bandwidth
In DGX-1
archival replicated Storage Storage
storage
[Training
[All raw [large- [Labeled
data
data ever [All active scale data
cache]
collected] raw data] cache] cache]
LAB
Data Input Pipeline

- Overall impact of correct input pipeline

- Impact of inefficiencies:
- File format and caching

- Decode logic

- Augmentation logic
IN NODE COMMUNICATION
DL DATA PARALLELISM – PCIE BASED
QPI Link
CPU CPU

PCIe PCIe
Switch Switch

0 1 5 4

2 3 7 6
DL DATA PARALLELISM – PCIE BASED
QPI Link
CPU CPU

PCIe PCIe
Switch Switch

0 1 5 4

2 3 7 6

Data loading over PCIe


DL DATA PARALLELISM – PCIE BASED
QPI Link
CPU CPU

PCIe PCIe
Switch Switch

0 1 5 4

2 3 7 6

Gradient averaging over PCIe and QPI


DL DATA PARALLELISM – PCIE BASED
QPI Link
CPU CPU

PCIe PCIe
Switch Switch

0 1 5 4

2 3 7 6

Data loading and gradient averaging share communication resources: Congestion


COMPETING FOR RESOURCES
Degradation of performance

• Data delivery

• Weight exchange between the GPUs

• Weight exchange between the nodes


DL DATA PARALLELISM – NVLINK

CPU CPU

PCIe PCIe
Switch Switch

0 1 5 4

2 3 7 6
DL DATA PARALLELISM – NVLINK

CPU CPU

PCIe PCIe
Switch Switch

0 1 5 4

2 3 7 6

Data loading over PCIe


DL DATA PARALLELISM – NVLINK

CPU CPU

PCIe PCIe
Switch Switch

0 1 5 4

2 3 7 6

Gradient averaging over NVLink


DL DATA PARALLELISM – NVLINK

CPU CPU

PCIe PCIe
Switch Switch

0 1 5 4

2 3 7 6

No sharing of communication resources: No congestion


SUBSTANTIALLY MORE BANDWIDTH
NVLINK operates at 300GB/s

• For data parallel implementations gradient is exchanged over NVLINK

• For model parallel implementations activations are exchanged over NVLINK

• PCIe is reserved for data delivery (and for multi node training for external
communication)
DL DATA PARALLELISM – NVLINK
PERFORMANCE
Intra-node performance
AllReduce bandwidth (OMB, size=128MB, in GB/s)
60

50

40

30

20

10

0
4 QPI 4 CPU 4 PCI DGX-1
NVSWITCH: ALL-TO-ALL CONNECTIVITY

GPU GPU GPU GPU GPU GPU GPU GPU


8 9 10 11 12 13 14 15

NVSwitch Fabric

GPU GPU GPU GPU GPU GPU GPU GPU


0 1 2 3 4 5 6 7
16-WAY ALL-REDUCE PERFORMANCE

8x smaller packet size with


the same performance
4x higher performance
for a given packet size
EVALUATION
Calculating the bandwidth requirements

efficiency = (Tf + Tb)/(Tf + max(Tb,M/B))

• Tf – time required for forward pass calculation

• Tb – time required for backward pass calculation

• M – amount of data that needs transporting

• B – available bandwidth
EVALUATION
Calculating the bandwidth requirements

efficiency = (Tf + Tb)/(Tf + max(Tb,M/B))


Efficiency vs Synchronization time
1.2

0.8

0.6

0.4

0.2

0
IMPLICATIONS
Requirements highly dependent on the workload
IMPLICATIONS
Requirements highly dependent on the workload
10X PERFORMANCE GAIN IN LESS THAN A YEAR

Time to Train (days)


DGX-1, SEP’17 DGX-2, Q3‘18
DGX-1 with V100 15 days

DGX-2 days
1.5 10 Times Faster

0 5 10 15 20

software improvements across the stack including NCCL, cuDNN, etc.


Workload: FairSeq, 55 epochs to solution. PyTorch training performance.
DOES THROUGHPUT MATTER?
DOES LATENCY MATTER?
COMMUNICATION REQUIREMENTS
Latency vs Throughput
BETWEEN NODE COMMUNICATION
MULTI-NODE DGX-1
Networking Topology
• Ingest data as fast as possible

• Pass data rapidly between nodes


across cluster

• Similar to HPC networking architecture

• InfiniBand = ultra high bandwidth, low


latency, collision free

• Two-tier network with root and leaf


switches

• any to any connectivity with full bi-


section bandwidth & minimum
contention between nodes
MULTI-NODE DGX-1
SMALL CLUSTER
up to 12 DGX-1 nodes
• Assume growth for up to 12 nodes

• 2 racks, 2 IB switches (36 ports)

• 19.2 kW per rack, but can split


across racks if necessary

• Full bi-section bandwidth for each


group of 6 nodes

• 2:1 oversubscription between


groups of 6
MULTI-NODE DGX-1 MEDIUM CLUSTER
up to 36 DGX-1 nodes
• Defines a DGX-1
“POD”

• Can be replicated for


greater scale, ex: Main Ethernet
large cluster Management Ethernet
configuration

• 6 racks, 6 nodes per


rack

• Larger IB director
switch (216 ports)
with capacity for
more pods via unused
ports
MULTI-NODE DGX-1 LARGE CLUSTER
up to 144 DGX-1 nodes (4 ”PODS”)

• Implements 4 DGX-1
pods

• Distributed across 24
racks

• Full bi-section
bandwidth within pod,
2:1 between pods

• Training jobs ideally


scheduled within a
pod, to minimize
inter-pod traffic
DGX-1 DEEP LEARNING DATA CENTER
Reference
Architecture

• Full system design

• Login and
management
servers

• Storage and
networking

Main Ethernet

Management Ethernet
COMMUNICATION SOFTWARE
DESIGN
What is NCCL ?

Optimized collective communication library between CUDA devices. Implements


Allreduce, Reduce, Broadcast, Reduce-scatter, Allgather.

Easy to integrate into any DL framework, as well as traditional HPC apps using MPI.

Runs on the GPU using asynchronous CUDA kernels, for faster access to GPU memory,
parallel reductions, NVLink usage.

Operates on CUDA pointers. Operations are tied to a CUDA stream.

Uses as few threads as possible to permit other computation to progress


simultaneously.
NCCL
Architecture

Caffe Caffe2 Torch TF MXNET CNTK


Deep Learning Frameworks

NCCL CUDNN CUBLAS

CUDA

NVIDIA GPUs
DESIGN
Rings

NCCL uses rings to move data across all GPUs and perform reductions.
DESIGN
Rings

NCCL uses rings to move data across all GPUs and perform reductions.

PCIe / QPI : 1 unidirectional ring


DESIGN
Rings

NCCL uses rings to move data across all GPUs and perform reductions.

PCIe / QPI : 1 unidirectional ring DGX-1 : 4 unidirectional rings


DESIGN
Kernels

sendbuff recvbuff

FIFO

Reduction

Previous GPU Next GPU


in the ring in the ring
NCCL 2.0
Inter-node communication

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support,


topology detection and automatic use of GPU Direct RDMA.

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth


and create rings across nodes.

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband


NCCL 2.0
Inter-node communication

Inter-node communication using Sockets or Infiniband verbs, with multi-rail support,


topology detection and automatic use of GPU Direct RDMA.

Optimal combination of NVLink, PCI and network interfaces to maximize bandwidth


and create rings across nodes.

PCIe, Infiniband DGX-1 : NVLink, 4x Infiniband


PERFORMANCE
Inter-node performance
AllReduce bandwidth (OMB, size=128MB, in GB/s)
45

40

35

30

25 MPI
20 Baidu Allreduce
15 NCCL

10

0
2 nodes x 4 GPUs (IB EDR, PCI Switch) 4 nodes x 8 GPUs (DGX-1 : 4x IB EDR, 4x NVLink)
PERFORMANCE
Deep Learning - CNTK
CNTK scalingResNet50, images/s
8000

7000

6000 6569

5000

4000

3000 3281 3360


1684
2000

1645 1744
1000 217

0
0 8 16 24 32

Idea l MPI NCCL


LAB
In node communication

- NVLINK + NCCL TensorFlow example

- In the next Lab we will see how to use Horovod to hide the explicit use of NCCL
REFERENCE ARCHITECTURE
BALANCED HARDWARE
DGX-1 as a reference point for solution design
MULTI-NODE SCALING WHITEPAPER

Use this asset to aid the


design process, ensuring
you develop the
optimized architecture
for your multi-node
cluster, following NVIDIA
best practices learned
from our customer
deployments and our
own DGX SATURNV
NGC – EFFICIENT SOFTWARE

Regardless of the hardware


solution you choose,
whether it is in the
datacenter or in the cloud
make sure you use NGC. For
free and on a monthly basis
we provide a set of docker
containers with the latest
and highly optimized version
of all of the deep learning
frameworks.
INNOVATE IN MINUTES,
NOT WEEKS WITH DEEP
LEARNING CONTAINERS
Benefits of Containers:
Simplify deployment of
GPU-accelerated applications,
eliminating time-consuming software
integration work

Isolate individual frameworks


or applications

Share, collaborate,
and test applications across 69
END-TO-END PRODUCT FAMILY
TRAINING INFERENCE
FULLY INTERGRATED DL SUPERCOMPUTER DATA CENTER AUTOMOTIVE EMBEDDED

DGX Family
(DGX Station, DGX-1, DGX-2, Cloud Service Provider)

DESKTOP DATA CENTER

Tesla P100/V100

Drive PX2 Jetson TX2

Tesla P100
TITAN V Tesla P4
Tesla V100
Tesla P100
MANAGING RESOURCES
JOB SCHEDULING
The role of a job scheduler

Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
JOB SCHEDULING
Feature comparison (this is not an exhaustive list)

Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
JOB SCHEDULING
Feature comparison

Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
JOB SCHEDULING
Feature comparison

Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
JOB SCHEDULING
Feature comparison

Reuther, Albert, et al. "Scalable system scheduling for HPC and big data." Journal of Parallel and Distributed Computing111 (2018): 76-92.
NGC VISION
AI platform of choice
MORE THAN JOB SCHEDULING
Wider set of Machine Learning activities (Uber’s Michelangelo)

https://eng.uber.com/michelangelo/
DL IS A HPC WORKLOAD
HPC expertise is important for success

It makes sense to build an AI team and a separate systems/HPC team and have the
two teams sit next to each other.

That is because solving some of the problems discussed in the lecture requires very
specialised systems/HPC knowledge. It is incredibly difficult for any single human to
acquire both the AI and systems/HPC knowledge.

For detailed discussion see Andrew Ng’s “Nuts and Bolts of Applying Deep Learning ”: https://www.youtube.com/watch?
v=F1ka6a13S9I&t=120s
OTHER
MULTI-NODE DGX-1
“A-HA” MOMENTS IN DL CLUSTER DESIGN
Additional design insights to get you started
Overall Cluster Rack Design Networking Storage Facilities Software
• HPC similar to DL • DL drives close to • Like HPC, • DGX-1 read cache • GPU data center • Scale requires
operational limits; InfiniBand is is critical operates at near- “cluster-aware”
• HPC expertise can preferred max power software
help in design • Assume less • Datasets range
headroom • Require high from 10k’s to • Assume higher • NCCL2 =
• Even with HPC, bandwidth, low millions objects watts per-rack GPU/multi-node
the similarities are • Proper airflow is latency acceleration
limited crucial to cluster • Terabyte levels of • Dramatically
performance • Maximize per- storage higher FLOPS/watt • Automatic
node IB = floor space topology detect
connections • Large variance saved
• DL framework
optimizations
TALK TO US
www.nvidia.com/dli

S-ar putea să vă placă și