427e0e886ebf87538afdf0badb805b7f@2021@MLSYS

Total: 1

#1 Scaling Distributed Training with Adaptive Summation [PDF] [Copy] [Kimi] [REL]

Authors: Saeed Maleki ; Madan Musuvathi ; Todd Mytkowicz ; Olli Saarikivi ; Tianju Xu ; Vadim Eksarevskiy ; Jaliya Ekanayake ; Emad Barsoum

Data parallelism is a common way to parallelize stochastic gradient descent (SGD). However, the loss of convergence at large minibatch sizes limits the scalability of data parallelism. This paper introduces a novel method to combine gradients called Adasum that significantly improves the convergence when using large minibatches. This paper provides the intuition and formal justification of Adasum along with a convergence proof. Additionally, the paper describes an efficient implementation of Adasum and its integration into the open-source toolkit Horovod for use in both TensorFlow and PyTorch.