Punica: Multi-Tenant LoRA Serving

054de805fcceb78a201f5e9d53c85908@2024@MLSYS

Total: 1

#1 Punica: Multi-Tenant LoRA Serving [PDF⁴⁹] [Copy] [Kimi⁹³] [REL]

Authors: Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy

Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains.We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token.

Subject: MLSYS.2024

054de805fcceb78a201f5e9d53c85908@2024@MLSYS

#1 Punica: Multi-Tenant LoRA Serving [PDF49] [Copy] [Kimi93] [REL]

#1 Punica: Multi-Tenant LoRA Serving [PDF⁴⁹] [Copy] [Kimi⁹³] [REL]