unger@osdi22@USENIX

Total: 1

#1 Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization [PDF] [Copy] [Kimi] [REL]

Authors: Colin Unger ; Zhihao Jia ; Wei Wu ; Sina Lin ; Mandeep Baines ; Carlos Efrain Quintero Narvaez ; Vinay Ramakrishnaiah ; Nirmal Prajapati ; Pat McCormick ; Jamaludin Mohd-Yusof ; Xi Luo ; Dheevatsa Mudigere ; Jongsoo Park ; Misha Smelyanskiy ; Alex Aiken

This paper presents Unity, the first system that jointly optimizes algebraic transformations and parallelization in distributed DNN training. Unity represents both parallelization and algebraic transformations as substitutions on a unified parallel computation graph (PCG), which simultaneously expresses the computation, parallelization, and communication of a distributed DNN training procedure. Optimizations, in the form of graph substitutions, are automatically generated given a list of operator specifications, and are formally verified correct using an automated theorem prover. Unity then uses a novel hierarchical search algorithm to jointly optimize algebraic transformations and parallelization while maintaining scalability. The combination of these techniques provides a generic and extensible approach to optimizing distributed DNN training, capable of integrating new DNN operators, parallelization strategies, and model architectures with minimal manual effort. We evaluate Unity on seven real-world DNNs running on up to 192 GPUs on 32 nodes and show that Unity outperforms existing DNN training frameworks by up to 3.6× while keeping optimization times under 20 minutes. Unity is available to use as part of the open-source DNN training framework FlexFlow at https://github.com/flexflow/flexflow.