Total: 1
Vision language models have received increasing attention for their ability to integrate visual and textual understanding, with some capable of processing native-resolution images and long videos. While the capacity to process large visual data unlocks numerous downstream applications, it often introduces significant latency challenges, as the visual tokens dominate the resource consumption. In this work, we introduce SparseVILA, a novel method of query-aware token retrieval to dynamically accelerate the underlying LLM by pruning tokens in the prefill stage while attending to a sparse subset of visual tokens during the decoding phase. By decoupling the context and generation compression, we can migrate the majority of sparsity into the generation stage, enabling query-aware support for multi-turn conversation while achieving a 1.4x speedup on image benchmarks. This approach leads to +5.9% average accuracy improvements on image-centric benchmarks over previous works. Finally, SparseVILA enables efficient long-context/long-generation tasks by achieving a 3.6x and 1.7x speedup in prefill and decoding, respectively.