WU86J4uM9V@OpenReview

Total: 1

#1 HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression [PDF] [Copy] [Kimi] [REL]

Authors: Yujin Wang, Shunan Dong, Zongle Huang, Yichen You, Liu He, Huazhong Yang, Yongpan Liu, Hongyang Jia

Large Language Models (LLMs) are widely used in applications like conversation and text summarization. With the demand for model customization and privacy, lightweight fine-tuning methods for large models have begun to receive widespread attention. Low-Rank Adaption (LoRA) is one of the most widely used fine-tuning algorithms, which significantly reduces the tunable weights and associated optimizer memory when transferring pre-trained LLMs to downstream tasks. However, past works lacked attention to the overhead of buffered activations in low-rank adaption, leading to suboptimal system memory usage. To reduce buffered activation memory consumption and further enable the on-device memory-efficient fine-tuning system, we propose \textbf{HyC-LoRA}, a variant of the LoRA training method using a hybrid compression framework enabling almost 2-bit buffered activation quantization in all operators. HyC-LoRA observes that the temporarily buffered activation for backpropagation dominates the memory consumption in the LoRA fine-tuning process, and those in non-linear modules act as dominant memory consumers, whose quantization is more challenging. Based on this, HyC-LoRA proposes a hybrid compression mechanism with two tiers: \textbf{(1)} \textit{\textbf{Intra-operator hybrid compression}}: HyC-LoRA detects extreme outliers in buffered activation and mitigates the quantization error by structured outlier storage; \textbf{(2)} \textit{\textbf{Inter-operator hybrid compression}}: HyC-LoRA utilizes the LoRA adapter to achieve compensation for quantization errors and selective recomputation, through inter-operator reordering and fusion. Finally, HyC-LoRA implements a buffered activation compression system and integrates it with the existing machine learning framework to complete the last mile of lightweight storage for fine-tuning algorithms. Evaluations with multiple LLMs such as Llama series, in widely-used downstream tasks show the proposed HyC-LoRA framework achieves up to 3.97× end-to-end memory reduction compared to baseline, with negligible accuracy degradation. The code is available at \url{https://github.com/thu-ee-acts-lab/HyC-LoRA-release}.

Subject: MLSYS.2025