MLSYS.2024

| Total: 37

#1 Punica: Multi-Tenant LoRA Serving [PDF5] [Copy] [Kimi14] [REL]

Authors: Lequn Chen ; Zihao Ye ; Yongji Wu ; Danyang Zhuo ; Luis Ceze ; Arvind Krishnamurthy

Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains.We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token.

#2 ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time [PDF1] [Copy] [Kimi6] [REL]

Authors: Pratik Fegade ; Tianqi Chen ; Phillip Gibbons ; Todd Mowry

Dynamic control flow is an important technique often used to design expressive and efficient deep learning computations for applications such as text parsing, machine translation, exiting early out of deep models and so on. However, the resulting control flow divergence makes batching, an important performance optimization, difficult to perform manually. In this paper, we present ACRoBat, a framework that enables efficient automatic batching for dynamic deep learning computations by performing hybrid static+dynamic compiler optimizations and end-to-end tensor code generation. ACRoBat performs up to 8.5 better than DyNet, a state-of-the-art framework for automatic batching, on an Nvidia GeForce RTX 3070 GPU.

#3 HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning [PDF2] [Copy] [Kimi6] [REL]

Authors: Gyudong Kim ; Mehdi Ghasemi ; Soroush Heidari ; Seungryong Kim ; Young Geun Kim ; Sarma Vrudhula ; Carole-Jean Wu

Federated Learning (FL) is a practical approach to train deep learning models collaboratively across user-end devices, protecting user privacy by retaining raw data on-device. In FL, participating user-end devices are highly fragmented in terms of hardware and software configurations. Such fragmentation introduces a new type of data heterogeneity in FL, namely system-induced data heterogeneity, as each device generates distinct data depending on its hardware and software configurations. In this paper, we first characterize the impact of system-induced data heterogeneity on FL model performance. We collect a dataset using heterogeneous devices with variations across vendors and performance tiers. By using this dataset, we demonstrate that system-induced data heterogeneity negatively impacts accuracy, and deteriorates fairness and domain generalization problems in FL. To address these challenges, we propose HeteroSwitch, which adaptively adopts generalization techniques (i.e., ISP transformation and SWAD) depending on the level of bias caused by varying HW and SW configurations. In our evaluation with a realistic FL dataset (FLAIR), HeteroSwitch reduces the variance of averaged precision by 6.3% across device types.

#4 JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training [PDF1] [Copy] [Kimi3] [REL]

Authors: Mohamed Ibrahim ; Shaizeen Aga ; Ada Li ; Suchita Pati ; Mahzabeen Islam

Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-precision during training. Furthermore, with emerging directional data-formats (e.g., MX9, MX6, etc.) multiple low-precision weight copies can be required. To lower memory capacity needs of weights, we explore just-in-time quantization (JIT-Q) where we only store high-precision weights in memory and generate low-precision weights only when needed. To perform JIT-Q efficiently, in this work, we evaluate emerging processing-in-memory (PIM) technology to execute quantization. With PIM, we can offload quantization to in-memory compute units enabling quantization to be performed without incurring costly data-movement while allowing quantization to be concurrent with accelerator computation. Our proposed PIM-offloaded quantization keeps up with GPU compute and delivers considerable capacity savings (up to 24\%) at marginal throughput loss (up to 2.4\%). Said memory capacity savings can unlock several benefits such as fitting larger model in the same system, reducing model parallelism requirement, and improving overall ML training efficiency.

#5 Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers [PDF] [Copy] [Kimi2] [REL]

Authors: Milos Nikolic ; Enrique Torres Sanchez ; Jiahui Wang ; Ali Hadi Zadeh ; Mostafa Mahmoud ; Ameer Abdelhadi ; Kareem Ibrahim ; Andreas Moshovos

No summary was provided.

#6 Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping [PDF] [Copy] [Kimi1] [REL]

Authors: Chenyu Jiang ; Ye Tian ; Zhen Jia ; Shuai Zheng ; Chuan Wu ; Yida Wang

The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters, but it grapples with the challenge of prolonged all-to-all communication latency during training. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. However, this approach often falls short of achieving sufficient overlap, thereby limiting potential performance improvements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, an optimization system for DNN compilers designed to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.

#7 AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration [PDF3] [Copy] [Kimi5] [REL]

Authors: Ji Lin ; Jiaming Tang ; Haotian Tang ; Shang Yang ; Wei-Ming Chen ; Wei-Chen Wang ; Guangxuan Xiao ; Xingyu Dang ; Chuang Gan ; Song Han

Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth). In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks. Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement an efficient and flexible inference framework tailored for LLMs on the edge, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B LLaMA-2 model on mobile GPUs.

#8 DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines [PDF4] [Copy] [Kimi1] [REL]

Authors: Ye Tian ; Zhen Jia ; Ziyue Luo ; Yida Wang ; Chuan Wu

Diffusion models have emerged as dominant performers for image generation. To support training large diffusion models, this paper studies pipeline parallel training of diffusion models and proposes DiffusionPipe, a synchronous pipeline training system that advocates innovative pipeline bubble filling technique, catering to structural characteristics of diffusion models. State-of-the-art diffusion models typically include trainable (the backbone) and non-trainable (e.g., frozen input encoders) parts. We first unify optimal stage partitioning and pipeline scheduling of single and multiple backbones in representative diffusion models with a dynamic programming approach. We then propose to fill the computation of non-trainable model parts into idle periods of the pipeline training of the backbones by an efficient greedy algorithm, thus achieving high training throughput. Extensive experiments show that DiffusionPipe can achieve up to 1.41x speedup over pipeline parallel methods and 1.28x speedup over data parallel training on popular diffusion models.

#9 Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference [PDF3] [Copy] [Kimi2] [REL]

Authors: Muhammad Adnan ; Akhil Arunkumar ; Gaurav Jain ; Prashant Nair ; Ilya Soloveychik ; Purushotham Kamath

No summary was provided.

#10 Accelerating ReLU for MPC-Based Private Inference with a Communication-Efficient Sign Estimation [PDF] [Copy] [Kimi1] [REL]

Authors: Kiwan Maeng ; G. Edward Suh

No summary was provided.

#11 FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics [PDF3] [Copy] [Kimi3] [REL]

Authors: Ke Hong ; Guohao Dai ; Jiaming Xu ; Qiuli Mao ; Xiuhong Li ; Jun Liu ; kangdi chen ; Yuhan Dong ; Yu Wang

As the Large Language Model (LLM) becomes increasingly important in various domains, the performance of LLM inference is crucial to massive LLM applications. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to ∼20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and 50% performance loss after padding zeros in previous designs (e.g., cuBLAS, CUTLASS, etc.). (3) Performance loss to static dataflow. Kernel performance in LLM depends on varied input data features, hardware configurations, etc. A single and static dataflow may lead to 50.25% performance loss for GEMMs of different shapes in LLM inference.We present FlashDecoding++, a fast LLM inference engine supporting mainstream LLMs and hardware back- ends. To tackle the above challenges, FlashDecoding++ creatively proposes: (1) Asynchronized softmax with unified max value. FlashDecoding++ introduces a unified max value technique for different partial softmax computations to avoid synchronization. Based on this, the fine-grained pipelining is proposed, leading to 1.05× and 1.14× for the prefill and decoding stage in LLM inference, respectively. (2) Flat GEMM optimization with double buffering. FlashDecoding++ points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced, leading up to 52% speedup for the flat GEMM operation. (3) Heuristic dataflow with hardware resource adaption. FlashDecoding++ heuristically optimizes dataflow using different hardware resource (e.g., Tensor Core or CUDA core) considering input dynamics. The design leads to up to 29% speedup compared with the static dataflow. Due to the versatility of optimizations in FlashDecoding++, FlashDecoding++ can achieve up to 4.86× and 2.18× speedup on both NVIDIA and AMD GPUs compared with Hugging Face implementations. FlashDecoding++ also achieves an average of 1.37× speedup compared with state-of-the-art LLM inference engines, FlashDecoding, on various LLMs (e.g., Llama2, ChatGLM2, etc.).

#12 HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices [PDF5] [Copy] [Kimi4] [REL]

Authors: ZHAO XUANLEI ; Bin Jia ; Haotian Zhou ; Ziming Liu ; Shenggan Cheng ; Yang You

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317\% at most.

#13 CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation [PDF] [Copy] [Kimi] [REL]

Authors: Yifei Xu ; Yuning Chen ; Xumiao Zhang ; Xianshang Lin ; Pan Hu ; Yunfei Ma ; Songwu Lu ; Wan Du ; Zhuoqing Mao ; Ennan Zhai ; Dennis Cai

Among the thriving ecosystem of cloud computing and the proliferation of Large Language Model (LLM)-based code generation tools, there is a lack of benchmarking for code generation in cloud-native applications. In response to this need, we present CloudEval-YAML, a practical benchmark for cloud configuration generation. CloudEval-YAML tackles the diversity challenge by focusing on YAML, the de facto standard of numerous cloud-native tools. We develop the CloudEval-YAML benchmark with practicality in mind: the dataset consists of hand-written problems with unit tests targeting practical scenarios. We further enhanced the dataset to meet practical needs by rephrasing questions in a concise, abbreviated, and bilingual manner. The dataset consists of 1011 problems that take more than 1200 human hours to complete. To improve practicality during evaluation, we build a scalable evaluation platform for CloudEval-YAML that achieves a 20 times speedup over a single machine. To the best of our knowledge, the CloudEval-YAML dataset is the first hand-written dataset targeting cloud-native applications. We present an in-depth evaluation of 12 LLMs, leading to a deeper understanding of the problems and LLMs, as well as effective methods to improve task performance and reduce cost.

#14 Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving [PDF2] [Copy] [Kimi2] [REL]

Authors: Yilong Zhao ; Chien-Yu Lin ; Kan Zhu ; Zihao Ye ; Lequn Chen ; Size Zheng ; Luis Ceze ; Arvind Krishnamurthy ; Tianqi Chen ; Baris Kasikci

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligentchatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently useGPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to furtherspeed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity.However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage thecapabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance.To maximize LLMs’ serving throughput, we introduce Atom, a low-bit quantization method that achieves highthroughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by usinglow-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracyby applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to 7.73×compared to the FP16 and by 2.53× compared to INT8 quantization, while maintaining the same latency target.

#15 ACCURATE LOW-DEGREE POLYNOMIAL APPROXIMATION OF NON-POLYNOMIAL OPERATORS FOR FAST PRIVATE INFERENCE IN HOMOMORPHIC ENCRYPTION [PDF] [Copy] [Kimi] [REL]

Authors: Jingtian Dang ; Jianming Tong ; Anupam Golder ; Cong "Callie" Hao ; Arijit Raychowdhury ; Tushar Krishna

As machine learning (ML) permeates fields like healthcare, facial recognition, and blockchain, the need to protect sensitive data intensifies. Fully Homomorphic Encryption (FHE) allows inference on encrypted data, preserving the privacy of both data and the ML model. However, it slows down non-secure inference by up to five magnitudes, with a root cause of replacing non-polynomial operators (ReLU and MaxPooling) with high-degree Polynomial Approximated Function (PAF).We propose SmartPAF, a framework to replace non-polynomial operators with low-degree PAF and then recover the accuracy of PAF-approximated model through four techniques: (1) Coefficient Tuning (CT) -- adjust PAF coefficients based on the input distributions before training, (2) Progressive Approximation (PA) -- progressively replace one non-polynomial operator at a time followed by a fine-tuning, (3) Alternate Training (AT) -- alternate the training between PAFs and other linear operators in the decoupled manner, and (4) Dynamic Scale (DS) / Static Scale (SS) -- dynamically scale PAF input value within [-1, 1] in training, and fix the scale as the running max value in FHE deployment.The synergistic effect of CT, PA, AT, and DS/SS enables SmartPAF to enhance the accuracy of the various models approximated by PAFs with various low degrees under multiple datasets. For ResNet-18 under ImageNet-1k, the Pareto-frontier spotted by SmartPAF in latency-accuracy tradeoff space achieves 1.42X ~ 13.64X accuracy improvement and 6.79X~14.9X speedup than prior works. Further, SmartPAF enables a 14-degree PAF to achieve a 7.81X speedup compared to the 27-degree PAF obtained by minimax approximation with the same 69.4% post-replacement accuracy. Our code is available at https://anonymous.4open.science/r/SmartPAF-64E1

#16 SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models [PDF1] [Copy] [Kimi1] [REL]

Authors: Zhixu Du ; Shiyu Li ; Yuhao Wu ; Xiangyu Jiang ; Jingwei Sun ; Qilin Zheng ; Yongkai Wu ; Ang Li ; Hai Li ; Yiran Chen

Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA-MoE (Sparsity-inspired Data-Aware), an efficient inference approach tailored for large MoE models. SiDA-MoE judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA-MoE achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA-MoE attains a remarkable speedup in MoE inference with up to 3.93x throughput increasing, up to 72% latency reduction, and up to 80% GPU memory saving with down to 1% performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even with constrained resources. Code is available at: https://github.com/timlee0212/SiDA-MoE.

#17 Does Compressing Activations Help Model Parallel Training? [PDF1] [Copy] [Kimi2] [REL]

Authors: Song Bian ; Dacheng Li ; Hongyi Wang ; Eric Xing ; Shivaram Venkataraman

Foundation models have superior performance across a wide array of machine learning tasks. The training of these models typically involves model parallelism (MP) to navigate the constraints of GPU memory capacity. However, MP strategies involve transmitting model activations between GPUs, which can hinder training speed in large clusters. Previous research has examined gradient compression in data-parallel contexts, but its applicability in MP settings remains largely unexplored. In this paper, we investigate the unique characteristics of compression in MP and study why strategies from gradient compression might not be directly applicable to MP scenarios. Subsequently, to systematically understand the capabilities and limitations of \underline{M}odel Parallelism \underline{C}ompression, we present a benchmarking framework \textbf{MCBench}. MCBench not only includes four major categories of compression algorithms but also includes several widely used models spanning language and vision tasks on a well-established distributed training framework, Megatron-LM. We initiate the first comprehensive empirical study by using MCBench. Our empirical study encompasses both the fine-tuning and pre-training of FMs. We probe over 200 unique training configurations and present results using 10 widely used datasets. To comprehend the scalability of compression advantages with the expansion of model size and cluster size, we propose a novel cost model designed specifically for training with MP compression. The insights derived from our findings can help direct the future development of new MP compression algorithms for distributed training.

#18 Distributed Matrix-Based Sampling for Graph Neural Network Training [PDF1] [Copy] [Kimi1] [REL]

Authors: Alok Tripathy ; Katherine Yelick ; Aydin Buluc

No summary was provided.

#19 Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation [PDF] [Copy] [Kimi] [REL]

Authors: Liang Luo ; Buyun Zhang ; Michael Tsang ; Yinbin Ma ; Ching-Hsiang Chu ; Yuxin Chen ; Shen Li ; Yuchen Hao ; Yanli Zhao ; Guna Lakshminarayanan ; Ellie Wen ; Jongsoo Park ; Dheevatsa Mudigere ; Maxim Naumov

We study a mismatch between the deep learning recommendation models’ flat architecture, common distributedtraining paradigm and hierarchical data center topology. To address the associated inefficiencies, we proposeDisaggregated Multi-Tower (DMT), a modeling technique that consists of (1) semantic-preserving tower transform(SPTT), a novel training paradigm that decomposes the monolithic global embedding lookup process into disjointtowers to exploit data center locality; (2) Tower Module (TM), a synergistic dense component attached to eachtower to reduce model complexity and communication volume through hierarchical feature interaction; and (3)Tower Partitioner (TP), a feature partitioner to systematically create towers with meaningful feature interactionsand load balanced assignments to preserve model quality and training throughput via learned embeddings. Weshow that DMT can achieve up to 1.9× speedup compared to the state-of-the-art baselines without losing accuracyacross multiple generations of hardware at large data center scales.

#20 VQPy: An Object-Oriented Approach to Modern Video Analytics [PDF] [Copy] [Kimi] [REL]

Authors: Shan Yu ; Zhenting Zhu ; Yu Chen ; Hanchen Xu ; Pengzhan Zhao ; Yang Wang ; Arthi Padmanabhan ; Hugo Latapie ; Harry Xu

Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a front-end— a Python variant with constructs that make it easy for users to express video objects and their interactions—as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which is currently used in a major tech company as part of their DeepVision framework.

#21 SLoRA: Scalable Serving of Thousands of LoRA Adapters [PDF1] [Copy] [Kimi1] [REL]

Authors: Ying Sheng ; Shiyi Cao ; Dacheng Li ; Coleman Hooper ; Nicholas Lee ; Shuo Yang ; Christopher Chou ; Banghua Zhu ; Lianmin Zheng ; Kurt Keutzer ; Joseph Gonzalez ; Ion Stoica

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present SLoRA, a system designed for the scalable serving of many LoRA adapters. SLoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, SLoRA proposes a unified memory pool. This memory pool uses a unified paging mechanism to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths.Additionally, SLoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for batched LoRA computation. Collectively, these features enable SLoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), SLoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, SLoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.

#22 L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep Learning [PDF2] [Copy] [Kimi] [REL]

Authors: Ilia Markov ; Kaveh Alim ; Elias Frantar ; Dan Alistarh

No summary was provided.

#23 Prompt Cache: Modular Attention Reuse for Low-Latency Inference [PDF1] [Copy] [Kimi1] [REL]

Authors: In Gim ; Guojun Chen ; Seung-seob Lee ; Nikhil Sarda ; Anurag Khandelwal ; Lin Zhong

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context.Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt.Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.

#24 Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems [PDF] [Copy] [Kimi] [REL]

Authors: Yunhao Yang ; Neel P. Bhatt ; Tyler Ingebrand ; William Ward ; Steven Carr ; Atlas Wang ; Ufuk Topcu

Although pre-trained language models encode generic knowledge beneficial for planning and control, they may fail to generate appropriate control policies for domain-specific tasks. Existing fine-tuning methods use human feedback to address this limitation, however, sourcing human feedback is labor intensive and costly. We present a fully automated approach to fine-tune pre-trained language models for applications in autonomous systems, bridging the gap between generic knowledge and domain-specific requirements while reducing cost. The method synthesizes automaton-based controllers from pre-trained models guided by natural language task descriptions. These controllers are verifiable against independently provided specifications within a world model, which can be abstract or obtained from a high-fidelity simulator. Controllers with high compliance with the desired specifications receive higher ranks, guiding the iterative fine-tuning process. We provide quantitative evidences, primarily in autonomous driving, to demonstrate the method's effectiveness across multiple tasks. The results indicate an improvement in percentage of specifications satisfied by the controller from 60\% to 90\%.

#25 VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE [PDF2] [Copy] [Kimi1] [REL]

Authors: Amey Agrawal ; Nitin Kedia ; Jayashree Mohan ; Ashish Panwar ; Nipun Kwatra ; Bhargav Gulavani ; Ramachandran Ramjee ; Alexey Tumanov

Large language models (LLMs) are widely used in various domains for their ability to perform tasks that requirehuman-like skills. However, LLM inference is expensive today. Furthermore, optimizing LLM inference ischallenging, as its performance depends on many configuration options such as model parallelization strategy, thebatching algorithm, scheduling policy, maximum batch size allowed, etc. Identifying the optimal configuration fora large-scale cluster by experimentally running hundreds of configuration combinations is impractical due to theexorbitant time and monetary cost involved. To tackle this challenge, we present VIDUR and VIDUR-BENCH,the first large-scale, high-fidelity, collaborative, and easily extensible simulation framework for LLM inferencealongside a benchmark suite. VIDUR carefully models the performance of various operators involved in LLMinference using a combination of experimental profiling and predictive modeling, and evaluates the end-to-endmodel inference performance for different workloads by estimating several key performance metrics such aslatency, throughput, and time-to-first-byte. We experimentally validate our simulator on several LLMs and showthat it can estimate metrics such as inference latency and throughput with less than 5% error rate. VIDUR alsohelps answer large-scale deployment related what-if questions such as what is the best tensor-parallel dimension tomaximize serving throughput of the LlaMa-7B model across 32 A100 GPUs? We will open-source the simulatorcode, along with the workload benchmark suite, so that researchers and practitioners can collaboratively exploremodel and systems optimizations for efficient deployment of LLMs.