SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

#1 SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference [PDF²] [Copy] [Kimi¹] [REL]

Authors: Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang

In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed **SparseVLM** without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens. To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, when LLaVA is equipped with SparseVLM, it achieves a 54\% reduction in FLOPs, lowers CUDA time by 37\%, and maintains an accuracy rate of 97\%. Our code is available at https://github.com/Gumpest/SparseVLMs.

Subject: ICML.2025 - Poster

80faIPZ67S@OpenReview

#1 SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference [PDF2] [Copy] [Kimi1] [REL]

#1 SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference [PDF²] [Copy] [Kimi¹] [REL]