Stepwise Token Selection for Efficient Multimodal Large Language Models

#1 Stepwise Token Selection for Efficient Multimodal Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Landi He, Shawn Young, Lijian Xu

In multimodal large language models (MLLMs), inference cost is largely dominated by the visual token prefix rather than the language backbone, making token reduction a key factor for improving efficiency. Existing approaches typically assign independent importance scores to visual tokens and retain a fixed number of top-ranked tokens, implicitly assuming token independence and a uniform compression ratio across inputs. In this work, we reformulate visual token pruning as a sequential decision-making process. Specifically, we introduce a pointer-style selection mechanism that iteratively chooses informative tokens, conditioning each decision on previously selected ones, and dynamically determines when to stop via a learned termination action. This enables joint optimization of both the selected subset and its size. To enable end-to-end training under standard language modeling objectives, we design a differentiable relaxation based on a variance-preserving noise interpolation scheme, allowing gradients to propagate through the discrete selection process. Extensive experiments on LLaVA-v1.5-7B and Qwen2.5-VL-7B demonstrate that our approach consistently outperforms fixed-ratio baselines across different compression levels. Under aggressive pruning that removes 88.9% of visual tokens, our method preserves 94.6% of the original accuracy while achieving a 1.88x speed-up in prefill latency.

Subject: Computer Vision and Pattern Recognition

Publish: 2026-06-14 23:50:08 UTC

2606.16067

#1 Stepwise Token Selection for Efficient Multimodal Large Language Models [PDF] [Copy] [Kimi] [REL]