EPIC: Efficient Position-Independent Caching for Serving Large Language Models

#1 EPIC: Efficient Position-Independent Caching for Serving Large Language Models [PDF] [Copy] [Kimi¹] [REL]

Authors: JUNHAO HU, Wenrui Huang, Weidong Wang, Haoyi Wang, tiancheng hu, zhang qin, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie

Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-IndependentCaching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate “attention sink” effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8× improvements in Time-To-First-Token (TTFT) and 7× throughput gains over existing systems, with negligible or no accuracy loss.

Subject: ICML.2025 - Poster

qjd3ZUiHRT@OpenReview

#1 EPIC: Efficient Position-Independent Caching for Serving Large Language Models [PDF] [Copy] [Kimi1] [REL]

#1 EPIC: Efficient Position-Independent Caching for Serving Large Language Models [PDF] [Copy] [Kimi¹] [REL]