MLSYS.2023 | Cool Papers - Immersive Paper Discovery

#1 XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse [PDF³] [Copy] [Kimi⁹] [REL]

Authors: Hyoukjun Kwon ; Krishnakumar Nair ; Jamin Seo ; Jason Yik ; Debabrata Mohapatra ; Dongyuan Zhan ; JINOOK SONG ; Peter Capak ; Peizhao Zhang ; Peter Vajda ; Colby Banbury ; Mark Mazumder ; Liangzhen Lai ; Ashish Sirasao ; Tushar Krishna ; Harshit Khaitan ; Vikas Chandra ; Vijay Janapa Reddi

Real-time multi-task multi-model (MTMM) workloads, a new form of deep learning inference workloads, are emerging for applications areas like extended reality (XR) to support metaverse use cases. These workloads combine user interactivity with computationally complex machine learning (ML) activities. Compared to standard ML applications, these ML workloads present unique difficulties and constraints. Real-time MTMM workloads impose heterogeneity and concurrency requirements on future ML systems and devices, necessitating the development of new capabilities. This paper begins with a discussion of the various characteristics of these real-time MTMM ML workloads and presents an ontology for evaluating the performance of future ML hardware for XR systems. Next, we present XRBENCH, a collection of MTMM ML tasks, models, and usage scenarios that execute these models in three representative ways: cascaded, concurrent, and cascaded-concurrency for XR use cases. Finally, we emphasize the need for new metrics that capture the requirements properly. We hope that our work will stimulate research and lead to the development of a new generation of ML systems for XR use cases. XRBench is available as an open-source project: https://github.com/XRBench

#2 FLINT: A Platform for Federated Learning Integration [PDF³] [Copy] [Kimi³] [REL]

Authors: Ewen Wang ; Boyi Chen ; Mosharaf Chowdhury ; Ajay Kannan ; Franco Liang

Cross-device federated learning (FL) has been well-studied from algorithmic, system scalability, and training speed perspectives. Nonetheless, moving from centralized training to cross-device FL for millions or billions of devices presents many risks, including performance loss, developer inertia, poor user experience, and unexpected application failures. In addition, the corresponding infrastructure, development costs, and return on investment are difficult to estimate. In this paper, we present a device-cloud collaborative FL platform that integrates with an existing machine learning platform, providing tools to measure real-world constraints, assess infrastructure capabilities, evaluate model training performance, and estimate system resource requirements to responsibly bring FL into production. We also present a decision workflow that leverages the FL-integrated platform to comprehensively evaluate the trade-offs of cross-device FL and share our empirical evaluations of business-critical machine learning applications that impact hundreds of millions of users.

#3 Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization [PDF⁵] [Copy] [Kimi⁴] [REL]

Authors: Zining Zhang ; Bingsheng He ; Zhenjie Zhang

In the domain of computer vision, transformer models have shown noteworthy success, prompting extensive research on optimizing their inference, particularly concerning their deployment on edge devices. While quantization has emerged as a viable solution for enabling energy efficiency in Convolutional Neural Networks (CNNs), achieving direct quantization of complex activation and normalization operators in transformer models proves to be a challenging task. Existing methods that rely on 64-bit integers often suffer from data truncation issues when deployed to energy-constrained edge devices, resulting in a significant loss of model accuracy. In this paper, we propose a range-constrained quantization technique for activation and normalization operators in transformers that addresses the dilemma between data range and precision. Our approach is the first 32-bit integer-based edge kernel implementation for vision transformers with post-training integer-only quantization, ensuring both efficiency and accuracy. Experimental results demonstrate a remarkable 5 times kernel speedup when deployed on two different ARM CPUs, with negligible accuracy loss in comparison to full-precision vision transformers. This innovative work is poised to significantly impact the deployment of transformer models on energy-efficient edge devices.

#4 Breadth-First Pipeline Parallelism [PDF⁶] [Copy] [Kimi²] [REL]

Author: Joel Lamy-Poirier

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.

#5 Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models [PDF] [Copy] [Kimi⁴] [REL]

Authors: Daochen Zha ; Louis Feng ; Liang Luo ; Bhargav Bhushanam ; Zirui Liu ; Yusuo Hu ; Jade Nie ; Yuzhen Huang ; Yuandong Tian ; Arun Kejariwal ; Xia Hu

Sharding a large machine learning model across multiple devices to balance the costs is important in distributed training. This is challenging because partitioning is NP-hard, and estimating the costs accurately and efficiently is difficult. In this work, we explore a "pre-train, and search" paradigm for efficient sharding. The idea is to pre-train a universal and once-for-all neural network to predict the costs of all the possible shards, which serves as an efficient sharding simulator. Built upon this pre-trained cost model, we then perform an online search to identify the best sharding plans given any specific sharding task. We instantiate this idea in deep learning recommendation models (DLRMs) and propose NeuroShard for embedding table sharding. NeuroShard pre-trains neural cost models on augmented tables to cover various sharding scenarios. Then it identifies the best column-wise and table-wise sharding plans with beam search and greedy grid search, respectively. Experiments show that NeuroShard significantly and consistently outperforms the state-of-the-art on the benchmark sharding dataset, achieving up to 23.8% improvement. When deployed in an ultra-large production DLRM with multi-terabyte embedding tables, NeuroShard achieves 11.6% improvement in embedding costs over the state-of-the-art, which translates to 6.6% end-to-end training throughput improvement. To facilitate future research of the "pre-train, and search" paradigm in ML for Systems, we open-source our code at https://github.com/daochenzha/neuroshard

#6 FedTree: A Federated Learning System For Trees [PDF] [Copy] [Kimi] [REL]

Authors: Qinbin Li ; WU ZHAOMIN ; Yanzheng Cai ; yuxuan han ; Ching Man Yung ; Tianyuan Fu ; Bingsheng He

While the quality of machine learning services largely relies on the volume of training data, data regulations such as the General Data Protection Regulation (GDPR) impose stringent requirements on data transfer. Federated learning has emerged as a popular approach for enabling collaborative machine learning without sharing raw data. To facilitate the rapid development of federated learning, efficient and user-friendly federated learning systems are essential. Despite many existing federated learning systems designed for deep learning, tree-based federated learning systems have not been well exploited. This paper presents a tree-based federated learning system under a histogram-sharing scheme, named FedTree, that supports both horizontal and vertical federated training of GBDTs with configurable privacy protection techniques. Our extensive experiments show that FedTree achieves competitive accuracy to centralized training while incurring much less computational cost than the other generic federated learning systems.

#7 Hotline Profiler: Automatic Annotation and A Multi-Scale Timeline for Visualizing Time-Use in DNN Training [PDF] [Copy] [Kimi] [REL]

Authors: Daniel Snider ; Fanny Chevalier ; Gennady Pekhimenko

Profiling is a standard practice used to investigate the efficiency of software and hardware operation at runtime and is a crucial part of proving new concepts, debugging problems, and optimizing performance. However, most machine learning (ML) developers find profiling secondary to their goal of improving model accuracy or just too difficult (especially with existing ML tools). As a result, profiling is frequently an afterthought, and so many ML developers rely on opaque metrics such as iteration time and GPU utilization which give little insight into why ML training may be slow. This leads developers to spend excessive time investigating performance issues. In this work, we aim to provide better tools to the large group of ML developers who currently do not profile their deep neural network (DNN) training workloads or are not happy with existing tools. To help ML developers investigate and understand time-use in DNN training, we propose Hotline, a novel profiler designed specifically for runtime bottleneck identification. Hotline is the first profiler to automatically annotate a standard data format for program runtime traces with DNN concepts that most ML developers are familiar with, i.e. the DNN training loop and model architecture. Hotline does so without modifying DNN libraries or making use of vendor-specific tools and introduces no additional overhead on measurements. We further introduce noise reduction techniques and a multi-scale timeline visualization to make the presentation of DNN runtime data more insightful, familiar, and easy to navigate. We demonstrate Hotline’s utility through in-depth case studies of finding bottlenecks in real-world DNN applications and we report on a user study with 17 software developers in which most participants were able to perform common performance investigation tasks in under 30 seconds (avg = 26 sec) and further commented that Hotline’s visualization “takes less time to findinsights compared to existing approaches”. Source code: https://github.com/UofT-EcoSystem/hotline

#8 On Noisy Evaluation in Federated Hyperparameter Tuning [PDF] [Copy] [Kimi¹] [REL]

Authors: Kevin Kuo ; Pratiksha Thaker ; Mikhail Khodak ; John Nguyen ; Daniel Jiang ; Ameet Talwalkar ; Virginia Smith

Hyperparameter tuning is critical to the success of federated learning applications. Unfortunately, appropriately selecting hyperparameters is challenging in federated networks, as issues of scale, privacy, and heterogeneity introduce noise in the tuning process and make it difficult to faithfully evaluate the performance of various hyperparameters. In this work we perform the first systematic study on the effect of noisy evaluation in federated hyperparameter tuning. We first identify and rigorously explore key sources of noise, including client subsampling, data and systems heterogeneity, and data privacy. Surprisingly, our results indicate that even small amounts of noise can have a significant impact on tuning methods—reducing the performance of state-of-the-art approaches to that of naive baselines. To address noisy evaluation in such scenarios, we propose a simple and effective approach that leverages public proxy data to boost evaluation signal. Our work establishes general challenges, baselines, and best practices for future work in federated hyperparameter tuning.

#9 Be Careful with PyPI Packages: You May Unconsciously Spread Backdoor Model Weights [PDF] [Copy] [Kimi¹] [REL]

Authors: Tianhang Zheng ; Hao Lan ; Baochun Li

To facilitate deep learning project development, some popular platforms provide model (sub)packages for developers to import and instantiate a deep learning model with few lines of code. For example, PyTorch provides \texttt{torchvision.models} for developers to instantiate models such as VGG and ResNet. Although those model packages are easy to install and use, their integrity may not be well-protected locally. In this paper, we show that an adversary can manipulate the \texttt{.py} files in the developers' locally installed model packages, if the developers install the adversary's PyPI package for using its claimed features. When installing the adversary's package, the system does not report any warning or error related to the manipulation. Leveraging this integrity vulnerability, we design an attack to manipulate the model forward function in the local \texttt{.py} files, such as \texttt{resnet.py} in the local \texttt{torchvision.models} subpackage. With our attack, the adversary can implant a backdoor into the developers' trained model weights, even supposing that the developers use seemingly clean training data and seemingly normal training code.

#10 GiPH: Generalizable Placement Learning for Adaptive Heterogeneous Computing [PDF] [Copy] [Kimi] [REL]

Authors: Yi Hu ; Chaoran Zhang ; Edward Andert ; Harshul Singh ; Aviral Shrivastava ; James Laudon ; Yanqi Zhou ; Bob Iannucci ; Carlee Joe-Wong

Careful placement of a distributed computational application within a target device cluster is critical for achieving low application completion time. The problem is challenging due to its NP-hardness and combinatorial nature. In recent years, learning-based approaches have been proposed to learn a placement policy that can be applied to unseen applications, motivated by the problem of placing a neural network across cloud servers. These approaches, however, generally assume the device cluster is fixed, which is not the case in mobile or edge computing settings, where heterogeneous devices move in and out of range for a particular application. To address the challenge of scaling to different-sized device clusters and adapting to the addition of new devices, we propose a new learning approach called GiPH, which learns policies that generalize to dynamic device clusters via a novel graph representation gpNet that efficiently encodes the information needed for choosing a good placement, and a scalable graph neural network (GNN) that learns a summary of the gpNet information. GiPH turns the placement problem into that of finding a sequence of placement improvements, learning a policy for selecting this sequence that scales to problems of arbitrary size. We evaluate GiPH with a wide range of task graphs and device clusters and show that our learned policy rapidly finds good placements for new problem instances. GiPH finds placements that achieve up to 30.5% better makespan, searching up to 3 times faster than other search-based placement policies.

#11 SIRIUS: Harvesting Whole-Program Optimization Opportunities for DNNs [PDF] [Copy] [Kimi¹] [REL]

Authors: YIJIN LI ; Jiacheng Zhao ; Sun Qianqi ; Haohui Mai ; Lei Chen ; Wanlu Cao ; Yanfan Chen ; Li zhicheng ; YING LIU ; Xinyuan Zhang ; Xiyu Shi ; Jie Zhao ; Jingling Xue ; HUIMIN CUI ; XiaoBing Feng

As emerging applications are rapidly moving to accelerators, a greatdeal of research has been proposed to improve the performance of the accelerators. For the AI applications, fruitful software-driven research has been focused on proposing new programming languages, new kernel fusion heuristics,new optimization tuning approaches, and new software execution engines. However, how to leverage classical compiler optimizations to generate efficient code is an overlooked aspect of performance. In this paper, we propose a whole-program analysis and optimization compiler framework, SIRIUS, to uniformly model the host and kernel computations in a unified polyhedral representation and,further, seek maximal fusion opportunities from the global view so that the fused kernel can benefit from classical optimizations. Evaluations over representative DNN models demonstrate that SIRIUS can achieve up to 11.98x speedup over TensorRT, and 154.84x speedup over TensorFlow. In particular, for BERT, SIRIUS can achieve 1.46x speedup over TensorRT.

#12 Adaptive Message Quantization and Parallelization for Distributed Full-graph GNN Training [PDF] [Copy] [Kimi] [REL]

Authors: Borui Wan ; Juntao Zhao ; Chuan Wu

Distributed full-graph training of Graph Neural Networks (GNNs) over large graphs is bandwidth-demanding and time-consuming. Frequent exchanges of node features, embeddings and embedding gradients (all referred to as messages) across devices bring significant communication overhead for nodes with remote neighbors on other devices (marginal nodes) and unnecessary waiting time for nodes without remote neighbors (central nodes) in the training graph. This paper proposes an efficient GNN training system, AdaQP, to expedite distributed full-graph GNN training. We stochastically quantize messages transferred across devices to lower-precision integers for communication traffic reduction and advocate communication-computation parallelization between marginal nodes and central nodes. We provide theoretical analysis to prove fast training convergence (at the rate of O(T^{-1}) with T being the total number of training epochs) and design an adaptive quantization bit-width assignment scheme for each message based on the analysis, targeting a good trade-off between training convergence and efficiency. Extensive experiments on mainstream graph datasets show that AdaQP substantially improves distributed full-graph training's throughput (up to 3.01 X) with negligible accuracy drop (at most 0.30%) or even accuracy improvement (up to 0.19%) in most cases, showing significant advantages over the state-of-the-art works.

#13 PyTorch RPC: Distributed Deep Learning Built on Tensor-Optimized Remote Procedure Calls [PDF²] [Copy] [Kimi¹] [REL]

Authors: Pritam Damania ; Shen Li ; Alban Desmaison ; Alisson Azzolini ; Brian Vaughan ; Edward Yang ; Gregory Chanan ; Guoqiang Jerry Chen ; Hongyi Jia ; Howard Huang ; Joseph Spisak ; Luca Wehrstedt ; Lucas Hosseini ; Manoj Krishnan ; Omkar Salpekar ; Pavel Belevich ; Rohan Varma ; Satendra Gera ; Wanchao Liang ; Shihao Xu ; Soumith Chintala ; Chaoyang He ; Amir Ziashahabi ; Salman Avestimehr ; NameError ; Zachary DeVito

Distributed training technologies have advanced rapidly in the past few years and have unlocked unprecedented scalability with increasingly complex solutions. These technologies have made distributed training much more efficient and accessible, though they impose specific constraints on the training paradigm or the model structure. As a result, applications that fail to meet these constraints must rely on general-purpose distributed computing frameworks to scale out. However, without access to the internal components of deep learning frameworks, these distributed computing frameworks usually significantly fall short in terms of efficiency and usability. To address these problems, we propose PyTorch RPC as a generic and high-performance solution for distributed deep learning. Compared to generic distributed computing frameworks, PyTorch RPC natively provides essential features for implementing training applications in a distributed environment, including optimized tensor communications, remote memory management, and distributed autograd. Evaluations show that PyTorch RPC attains up to two orders of magnitude faster tensor communication compared to gRPC with one-tenth of the user code. Case studies further demonstrate that users can easily employ PyTorch RPC to build efficient reinforcement learning applications (video game solver), implement large language models (GPT3), train recommendation models (DLRM), and scale federated learning tasks (FedML).

#14 Virtual Machine Allocation with Lifetime Predictions [PDF] [Copy] [Kimi¹] [REL]

Authors: Hugo Barbalho ; Patricia Kovaleski ; Beibin Li ; Luke Marshall ; Marco Molinaro ; Abhisek Pan ; Eli Cortez ; Matheus Leao ; Harsh Patwari ; Zuzu Tang ; Larissa Rozales Gonçalves ; David Dion ; Thomas Moscibroda ; Ishai Menache

The emergence of machine learning technology has motivated the use of ML-based predictors in computer systems to improve their efficiency and robustness. However, there are still numerous algorithmic and systems challenges in effectively utilizing ML models in large-scale resource management services that require high throughput and response latency of milliseconds. In this paper, we describe the design and implementation of a VM allocation service that uses ML predictions of the VM lifetime to improve packing efficiencies. We design lifetime-aware placement algorithms that are provably robust to prediction errors and demonstrate their merits in extensive real-trace simulations. We significantly upgraded the VM allocation infrastructure of Microsoft Azure to support such algorithms that require ML inference in the critical path. A robust version of our algorithms has been recently deployed in production, and obtains efficiency improvements expected from simulations.

#15 Edge Impulse: An MLOps Platform for Tiny Machine Learning [PDF¹] [Copy] [Kimi¹] [REL]

Authors: colby banbury ; Vijay Janapa Reddi ; Alexander Elium ; Shawn Hymel ; David Tischler ; Daniel Situnayake ; Carl Ward ; Louis Moreau ; Jenny Plunkett ; Matthew Kelcey ; Mathijs Baaijens ; Alessandro Grande ; Dmitry Maslov ; Arthur Beavis ; Jan Jongboom ; Jessica Quaye

Edge Impulse is a cloud-based machine learning operations (MLOps) platform for developing embedded and edge ML (TinyML) systems that can be deployed to a wide range of hardware targets. Current TinyML workflows are plagued by fragmented software stacks and heterogeneous deployment hardware, making ML model optimizations difficult and unportable. We present Edge Impulse, a practical MLOps platform for developing TinyML systems at scale. Edge Impulse addresses these challenges and streamlines the TinyML design cycle by supporting various software and hardware optimizations to create an extensible and portable software stack for a multitude of embedded systems. As of Oct. 2022, Edge Impulse hosts 118,185 projects from 50,953 developers.

#16 Tutel: Adaptive Mixture-of-Experts at Scale [PDF²] [Copy] [Kimi³] [REL]

Authors: Changho Hwang ; Wei Cui ; Yifan Xiong ; Ziyue Yang ; Ze Liu ; Han Hu ; Zilong Wang ; Rafael Salas ; Jithin Jose ; Prabhat Ram ; HoYuen Chau ; Peng Cheng ; Fan Yang ; Mao Yang ; Yongqiang Xiong

Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning models to trillion-plus parameters with fixed computational cost. The algorithmic performance of MoE relies on its token routing mechanism that forwards each input token to the right sub-models or experts. While token routing dynamically determines the amount of expert workload at runtime, existing systems suffer inefficient computation due to their static execution, namely static parallelism and pipelining, which does not adapt to the dynamic workload.We present Tutel, a highly scalable stack design and implementation for MoE with dynamically adaptive parallelism and pipelining. Tutel designs an identical layout for distributing MoE model parameters and input data, which can be leveraged by switchable parallelism and dynamic pipelining methods without mathematical inequivalence or tensor migration overhead. This enables adaptive parallelism/pipelining optimization at zero cost during runtime. Based on this key design, Tutel also implements various MoE acceleration techniques including Flexible All-to-All, two-dimensional hierarchical (2DH) All-to-All, fast encode/decode, etc. Aggregating all techniques, Tutel finally delivers 4.96x and 5.75x speedup of a single MoE layer over 16 and 2,048 A100 GPUs, respectively, over the previous state-of-the-art.Our evaluation shows that Tutel efficiently and effectively runs a real-world MoE-based model named SwinV2-MoE, built upon Swin Transformer V2, a state-of-the-art computer vision architecture. On efficiency, Tutel accelerates SwinV2-MoE, achieving up to 1.55x and 2.11x speedup in training and inference over Fairseq, respectively. On effectiveness, the SwinV2-MoE model achieves superior accuracy in both pre-training and down-stream computer vision tasks such as COCO object detection than the counterpart dense model, indicating the readiness of Tutel for end-to-end real-world model training and inference.

#17 MegaBlocks: Efficient Sparse Training with Mixture-of-Experts [PDF²] [Copy] [Kimi²] [REL]

Authors: Trevor Gale ; Deepak Narayanan ; Cliff Young ; Matei Zaharia

We present MegaBlocks, a system for efficient Mixture-of-Experts (MoE) training on GPUs. Our system ismotivated by the limitations of current frameworks, which restrict the dynamic routing in MoE layers to satisfythe constraints of existing software and hardware. These formulations force a tradeoff between model quality andhardware efficiency, as users must choose between dropping tokens from the computation or wasting computationand memory on padding. To address these limitations, we reformulate MoE computation in terms of block-sparseoperations and develop new block-sparse GPU kernels that efficiently handle the dynamism present in MoEs. Ourapproach never drops tokens and maps efficiently to modern hardware, enabling end-to-end training speedups ofup to 40% over MoEs trained with the state-of-the-art Tutel library and 2.4× over dense DNNs trained with thehighly-optimized Megatron-LM framework.

#18 Safe Optimized Static Memory Allocation for Parallel Deep Learning [PDF] [Copy] [Kimi²] [REL]

Authors: Ioannis Lamprou ; Zhen Zhang ; Javier de Juan ; Hang Yang ; Yongqiang Lai ; Etienne Filhol ; Cedric Bastoul

Parallel training is mandatory in order to maintain performance efficiency and tackle memory constraints for deep neural network (DNN) models. For this purpose, a critical optimization in order to tune a parallelism strategy is to schedule tensors onto device memory in compilation time. In this paper, we present a safe and optimized solver for this problem capturing a general parallel scenario to enable execution in open-source MindSpore framework. The input is a computational graph and a partition of its operators into streams of execution, which may run in parallel. First, we design algorithms to efficiently and provably decide if it is safe, for any two tensors, to reuse memory. Second, given such a set of reuse constraints, as well as a set of contiguous constraints to enable bulk communication among processing elements, we design algorithms to assign an offset to each tensor, such that all constraints are satisfied and total memory is minimized. Our experiments in parallel training of a variety of DNNs demonstrate nearly optimal, improved in some cases, memory consumption compared to state-of-the-art (adapted for our setting) and a sequential execution lower bound. Our algorithms show compilation time gains of up to 44% in determining safety and up to 70% in tensor offset assignment.

#19 X-RLFLOW: GRAPH REINFORCEMENT LEARNING FOR NEURAL NETWORK SUBGRAPHS TRANSFORMATION [PDF] [Copy] [Kimi] [REL]

Authors: Guoliang HE ; Sean Parker ; Eiko Yoneki

Tensor graph superoptimisation systems perform a sequence of subgraph substitution to neural networks, to find the optimal computation graph structure. Such a graph transformation process naturally falls into the framework of sequential decision-making, and existing systems typically employ a greedy search approach, which cannot explore the whole search space as it cannot tolerate a temporary loss of performance. In this paper, we address the tensor graph superoptimisation problem by exploring an alternative search approach, reinforcement learning (RL). Our proposed approach, X-RLflow, can learn to perform neural network dataflow graph rewriting, which substitutes a subgraph one at a time. X-RLflow is based on a model-free RL agent that uses a graph neural network (GNN) to encode the target computation graph and outputs a transformed computation graph iteratively. We show that our approach can outperform state-of-the-art superoptimisation systems over a range of deep learning models and achieve by up to 40% on those that are based on transformer-style architectures. We show that our approach can outperform state-of-the-art superoptimisation systems over a range of deep learning models and achieve by up to 40% on those that are based on transformer-style architectures.

#20 Reducing Activation Recomputation in Large Transformer Models [PDF²] [Copy] [Kimi²] [REL]

Authors: Vijay Anand Korthikanti ; Jared Casper ; Sangkug Lym ; Lawrence McAfee ; Michael Andersch ; Mohammad Shoeybi ; Bryan Catanzaro

Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate the training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnecessary because we can reduce memory consumption sufficiently without it. We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation. In conjunction with tensor parallelism, these techniques almost eliminate the need to recompute activations. We evaluate our approach on language models up to one trillion parameters in scale and show that our method reduces activation memory by 5x, while reducing execution time overhead from activation recomputation by over 90%. For example, when training a 530B parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model Flops Utilization of 54.2%, which is 29% faster than the 42.1% we achieve using recomputation.

#21 SysNoise: Exploring and Benchmarking Training-Deployment System Inconsistency [PDF] [Copy] [Kimi] [REL]

Authors: Yan Wang ; Yuhang Li ; Ruihao Gong ; Aishan Liu ; yanfei wang ; Jian Hu ; Yongqiang Yao ; Yunchen Zhang ; tianzi xiaotian ; Fengwei Yu ; Xianglong Liu

Extensive studies have shown that deep learning models are vulnerable to adversarial and natural noises, yet little is known about model robustness on noises caused by different system implementations. In this paper, we for the first time introduce SysNoise, a frequently occurred but often overlooked noise in the deep learning training-deployment cycle. In particular, SysNoise happens when the source training system switches to a disparate target system in deployments, where various tiny system mismatch adds up to a non-negligible difference. We first identify and classify SysNoise into three categories based on the inference stage; we then build a holistic benchmark to quantitatively measure the impact of SysNoise on 20+ models, comprehending image classification, object detection, instance segmentation and natural language processing tasks. Our extensive experiments revealed that SysNoise could bring certain impacts on model robustness across different tasks and common mitigations like data augmentation and adversarial training show limited effects on it. Together, our findings open a new research topic and we hope this work will raise research attention to deep learning deployment systems accounting for model performance.

#22 Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Zhuang Wang ; Xinyu Wu ; Zhaozhuo Xu ; T. S. Eugene Ng

Data-parallel distributed training (DDT) is the de facto way to accelerate deep learning on multiple GPUs.In DDT, communication for gradient synchronization is the major efficiency bottleneck.Many gradient compression (GC) algorithms have been proposed to address this communication bottleneck by reducing the amount of communicated data.Unfortunately, it has been observed that GC only achieves moderate performance improvement in DDT, or even harms the performance.In this paper, we argue that the current way of deploying GC in a layer-wise fashion reduces communication time at the cost of non-negligible compression overheads.To address this problem, we propose Cupcake, a compression optimizer to fully unleash GC algorithms' advantages in accelerating DDT. It applies GC algorithms in a fusion fashion and determines the provably optimal fusion strategy to maximize the training throughput of compression-enabled DDT jobs.Experimental evaluations show that GC algorithms with Cupcake can achieve up to 2.03x speedup in training throughput over training without GC, and up to 1.79x speedup over the state-of-the-art approaches of applying GC to DDT in a layer-wise fashion.

#23 HyperGef: A Framework Enabling Efficient Fusion for Hypergraph Neural Network on GPUs [PDF] [Copy] [Kimi] [REL]

Authors: Zhongming Yu ; Guohao Dai ; Shang Yang ; Genghan Zhang ; Hengrui Zhang ; Feiwen Zhu ; June Yang ; Jishen Zhao ; Yu Wang

Hypergraph Neural Network (HyperGNN) is an emerging type of Graph Neural Networks (GNNs) that can utilize hyperedges to model high-order relationships among vertices. Current GNN frameworks fail to fuse two message-passing steps from vertices to hyperedges and hyperedges to vertices, leading to high latency and redundant memory consumption. The following challenges need to be solved for efficient fusion in HyperGNNs: (1) Inefficient partition: hardware-efficient and workload-balanced partitions are required for parallel workers to process two consecutive message passing steps after fusion. (2) Workload-Agnostic Format: current data formats like Compressed Sparse Row (CSR) fail to represent a two-step computation workload. (3) Heavy writing conflicts: partitioning leads to heavy writing conflicts when updating the same vertex.To enable efficient fusion for HyperGNNs, we present HyperGef. HyperGef proposes an edge-split workload balance partition scheme to achieve higher efficiency and better workload balancing. To represent the workload after fusion and partition, HyperGef introduces a novel fusion workload-aware format. HyperGef also introduces a shared memory-aware grouping scheme to reduce writing conflicts. Extensive experiments demonstrate that our fused kernel outperforms the NVIDIA cuSPARSE kernel by 3.31x. By enabling efficient fusion for HyperGNNs, HyperGef achieves 2.25x to 3.99x end-to-end speedup on various HyperGNN models compared with state-of-the-art frameworks like DGL and PyG.

#24 ApproxCaliper: A Programmable Framework for Application-aware Neural Network Optimization [PDF¹] [Copy] [Kimi²] [REL]

Authors: Yifan Zhao ; Hashim Sharif ; Peter Pao-Huang ; Vatsin Shah ; Arun Narenthiran Sivakumar ; Mateus Valverde Gasparino ; Abdulrahman Mahmoud ; Nathan Zhao ; Sarita Adve ; Girish Chowdhary ; Sasa Misailovic ; Vikram Adve

To deploy compute-intensive neural networks on resource-constrained edge systems, developers use model optimization techniques that reduce model size and computational cost. Existing optimization tools are application-agnostic -- they optimize model parameters solely in view of the neural network accuracy -- and can thus miss optimization opportunities. We propose ApproxCaliper, the first programmable framework for application-aware neural network optimization. By incorporating application-specific goals, ApproxCaliper facilitates more aggressive optimization of the neural networks compared to application-agnostic techniques. We perform experiments on five different neural networks used in two real-world robotics systems: a commercial agriculture robot and a simulation of an autonomous electric cart. Compared to Learning Rate Rewinding (LRR), a state-of-the-art structured pruning tool used in an application-agnostic setting, ApproxCaliper achieves 5.3x higher speedup and 2.9x lower GPU resource utilization, and 36x and 6.1x additional model size reduction, respectively.

#25 Transcending Runtime-Memory Tradeoffs in Checkpointing by being Fusion Aware [PDF] [Copy] [Kimi¹] [REL]

Authors: Horace He ; Shangdi Yu

Gradient checkpointing is an optimization that reduces the memory footprint by re-computing some operations instead of saving their activations. Previous works on checkpointing have viewed this as a tradeoff between peak memory and performance. However, we argue that this framing does not account for a key aspect of modern deep learning systems -- operator fusion. In this work, we demonstrate that with a fusion aware checkpointing algorithm, we can transcend the runtime-memory tradeoffs of traditional checkpointing and improve both memory and runtime simultaneously. We evaluate our algorithm on a wide range of standard neural network models as well as some novel patterns. We achieve a geomean of 12% throughput improvement over an existing compiled baseline, and the maximum batch size that can be attained is up to 1.75 times larger on standard models. In novel patterns, we achieve up to a 10x improvement, with by a 5x reduction in peak memory.