WSUO2uRDPc@OpenReview

Total: 1

#1 MVA: Linear Attention with High-order Query-Keys Integration and Multi-level Vocabulary Decomposition [PDF1] [Copy] [Kimi1] [REL]

Authors: ning wang, Zekun Li, Tongxin Bai, Man Yao, Zhen Qin, Guoqi Li

Linear attention offers the advantages of linear inference time and fixed memory usage compared to Softmax attention. However, training large-scale language models with linear attention from scratch remains prohibitively expensive and exhibits significant performance gaps compared to Softmax-based models. To address these challenges, we focus on transforming pre-trained Softmax-based language models into linear attention models. We unify mainstream linear attention methods using a **high-order QK integration theory** and a **multi-level vocabulary decomposition**. Specifically, the QK integration theory explains the efficacy of combining linear and sparse attention from the perspective of information collection across different frequency bands. The multi-level vocabulary decomposition exponentially expands memory capacity by recursively exploiting compression loss from compressed states. Through detailed error analysis, we demonstrate superior approximation of Softmax attention achieved by our approach. To further improve performance and reduce training costs, we adopt a **soft integration strategy** with attention scores, effectively combining a sliding window mechanism. With less than 100M tokens, our method fine-tunes models to achieve linear complexity while retaining 99\% of their original performance. Compared to state-of-the-art linear attention model and method, our approach improves MMLU scores by 1.2 percentage points with minimal fine-tuning. Furthermore, even without the sliding window mechanism, our method achieves state-of-the-art performance on all test sets with 10B tokens.

Subject: ICML.2025 - Poster