24e845415c1486dd2d582a9d639237f9@2023@MLSYS

Total: 1

#1 Breadth-First Pipeline Parallelism [PDF⁸] [Copy] [Kimi³] [REL]

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.

Subject: MLSYS.2023

24e845415c1486dd2d582a9d639237f9@2023@MLSYS

#1 Breadth-First Pipeline Parallelism [PDF8] [Copy] [Kimi3] [REL]

#1 Breadth-First Pipeline Parallelism [PDF⁸] [Copy] [Kimi³] [REL]