SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference

#1 SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference [PDF] [Copy] [Kimi] [REL]

Authors: Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, Zhijian Liu

Vision language models have received increasing attention for their ability to integrate visual and textual understanding, with some capable of processing native-resolution images and long videos. While the capacity to process large visual data unlocks numerous downstream applications, it often introduces significant latency challenges, as the visual tokens dominate the resource consumption. In this work, we introduce SparseVILA, a novel method of query-aware token retrieval to dynamically accelerate the underlying LLM by pruning tokens in the prefill stage while attending to a sparse subset of visual tokens during the decoding phase. By decoupling the context and generation compression, we can migrate the majority of sparsity into the generation stage, enabling query-aware support for multi-turn conversation while achieving a 1.4x speedup on image benchmarks. This approach leads to +5.9% average accuracy improvements on image-centric benchmarks over previous works. Finally, SparseVILA enables efficient long-context/long-generation tasks by achieving a 3.6x and 1.7x speedup in prefill and decoding, respectively.

Subject: ICCV.2025 - Poster

Khaki_SparseVILA_Decoupling_Visual_Sparsity_for_Efficient_VLM_Inference@ICCV2025@CVF

#1 SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference [PDF] [Copy] [Kimi] [REL]