Vision-centric Token Compression in Large Language Model

#1 Vision-centric Token Compression in Large Language Model [PDF³] [Copy] [Kimi²] [REL]

Authors: Ling Xing, Alex Jinpeng Wang, Rui Yan, Xiangbo Shu, Jinhui Tang

Real-world applications are stretching context windows to hundreds of thousand of tokens while Large Language Models (LLMs) swell from billions to trillions of parameters. This dual expansion send compute and memory costs skyrocketing, making $\textit{token compression}$ indispensable. We introduce Vision Centric Token Compression ($\textbf{Vist}$), a $\textit{slow–fast}$ compression framework that mirrors human reading: the $\textit{fast}$ path renders distant tokens into images, letting a $\textbf{frozen, lightweight vision encoder}$ skim the low-salience context; the $\textit{slow}$ path feeds the proximal window into the LLM for fine-grained reasoning. A Probability-Informed Visual Enhancement (PVE) objective masks high-frequency tokens during training, steering the Resampler to concentrate on semantically rich regions—just as skilled reader gloss over function words. On eleven in-context learning benchmarks, $\textbf{Vist}$ achieves the same accuracy with 2.3$\times$ fewer tokens, cutting FLOPs by 16\% and memory by 50\%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by $\textbf{7.6}$\% on average over benchmarks like TriviaQA, NQ, PopQA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. The source code will be released.

Subject: NeurIPS.2025 - Spotlight

YdggdEL41C@OpenReview

#1 Vision-centric Token Compression in Large Language Model [PDF3] [Copy] [Kimi2] [REL]

#1 Vision-centric Token Compression in Large Language Model [PDF³] [Copy] [Kimi²] [REL]