RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

#1 RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [PDF³] [Copy] [Kimi⁴] [REL]

Authors: Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, Chen Chen, Fan Yang, Yuqing Yang, Lili Qiu

Transformer-based Large Language Models (LLMs) have become increasingly important. However, scaling LLMs to longer contexts incurs slow inference speed and high GPU memory consumption for caching key-value (KV) vectors. This paper presents RetrievalAttention, a training-free approach to both accelerate the decoding phase and reduce GPU memory consumption by pre-building KV vector indexes for fixed contexts and maintaining them in CPU memory for efficient retrieval. Unlike conventional KV cache methods, RetrievalAttention integrate approximate nearest neighbor search (ANNS) indexes into attention computation. We observe that off-the-shelf ANNS techniques often fail due to the out-of-distribution (OOD) nature of query and key vectors in attention mechanisms. RetrievalAttention overcomes this with an attention-aware vector index. Our evaluation shows RetrievalAttention achieves near full attention accuracy while accessing only 1-3\% of the data, significantly reducing inference costs. Remarkably, RetrievalAttention enables LLMs with 8B parameters to handle 128K tokens on a single NVIDIA RTX4090 (24GB), achieving a decoding speed of 0.107 seconds per token.

Subject: NeurIPS.2025 - Poster

8z3cOVER4z@OpenReview

#1 RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [PDF3] [Copy] [Kimi4] [REL]

#1 RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [PDF³] [Copy] [Kimi⁴] [REL]