EfficientML.ai Lec 18: Distributed Train
boardking_
编辑于 2024年04月25日 08:34

TLDR

本节课介绍了多种并行策略同时执行的Hybrid parallelism, 介绍了alpa论文,里面有如何自动找到多种并行超参数的方法。然后介绍了分布式训练中带宽和延迟的问题,为了解决带宽问题,可以采用梯度压缩的方式,其中包括梯度剪枝,可以通过Gradient Compression w/ Momentum Correction或者PowerSGD来实现,还包括梯度量化,可以用1-Bit SGD或者TernGrad来实现。为了解决延迟的问题,可以采用Delayed Gradient Update, 来让多GPU运行时的计算时间覆盖communication时间。

Overview

03:06Review Parallelization Methods05:19How to auto-parallelize

10:06bandwidth and latency bottleneck17:07Gradient compression

37:02Gradient Quantization

40:46Delayed gradient update

Paper List

CMU15-849: Machine Learning Systems by Zhihao Jia

Alpa: Automating inter- and intra-operator parallelism for distributed deep learning [Zheng et al., 2022]

Deep Gradient Compression: Reducing the Communication Bandwith for Distributed Training [Lin et al., 2017]

PowerSGD: Low-Rank Gradient Compression for Distributed Optimization [Vogels et al., 2019]

TernGrad

Delayed Gradient Averaging: Tolerate the Communication Latency in Federated Learning [Zhu 2021]

Lecture Notes

03:06Review Parallelization MethodsDP: high utilization, high memory cost, low communication

PP: low utilization, low memory cost, medium communication

TP: high utilization, low memory cost, high communication

Hybrid: DP+PP

PP+TP

3D Parallelism: PP+TP+DP

05:19How to auto-parallelize

Alpa: Automating inter- and intra-operator parallelism for distributed deep learning [Zheng et al., 2022]

Define the search space for parallel strategies

search for inter-op

workload roughly equal to avoid starvation

search for intra-op

10:06bandwith and latency bottleneckCommunication is essential

Parameter server: requires sync

Bandwidth requirement calculation

pay communication overhead, speedup not linearly on #GPUs

water bucket principle

Latency analysis on networks

17:07Gradient compressionPaper: Deep Gradient Compression: Reducing the Communication Bandwith for Distributed Training [Lin et al., 2017]

when send gradients, no need to send it in fp32

sparse communication: send top-k gradients by magnitude, keep the error (residual) locally

-1% accuracy on resnet

Reason: momentum

accumulated gradient problem

should accumulate the velocity, i.e. Momentum Correction

Warmup training

warm up sparsity

Performance

match accuracy with 99.9% sparsity

LM Compression Ratio 462x

all-reduce for sparse gradients: sparse tensors become denser

solution: 1) coarse grained sparsity

2) prune in the middle of ring all-reduce

PowerSGD: Low-Rank Gradient Compression for Distributed Optimization [Vogels et al., 2019]

37:02Gradient Quantization1 bit SGD

compare with zero to get a 1 bit matrix + scaling factor

threshold quantization

TernGrad

normalize by max(g) to 0,1,-1

40:46Delayed gradient updateBandwidth vs Latency

Bandwidth can be upgraded by gradient compression and quantization techniques/ hardware

worker has to wait the transmission finish

Delayed Gradient Averaging: Tolerate the Communication Latency in Federated Learning [Zhu 2021]

update with gradients at D steps before

communication is covered by computation

Methods Compare

Real-world benchmark 7.5x on Language Task