Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression

#1 Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression [PDF] [Copy] [Kimi] [REL]

Authors: Juan Chen, Honglin liu, Yingying Ao, Ting Zhang, Yan Huang, Xudong Liu, Biao Li, Jintao Fang

Vision-language models heavily rely on visual representations, yet ensuring its efficiency remains a critical challenge. Most existing approaches focus on reducing visual tokens either at the visual encoder phase or during the LLM decoder stage. Inspired by human visual cognition, where an initial global glance precedes focused attention on semantically salient regions, we introduce Glance2Gaze, a cognitively inspired framework that mimics the human two-stage attention process. The framework consists of two key components: the Glance Fusion module, which integrates multi-layer vision transformer features with text-aware attention to generate a semantically enriched global representation, and the Gaze Compression module, which utilizes a novel query-guided mechanism to selectively compress visual tokens based on their semantic relevance. Experimental results on widely adopted benchmarks demonstrate that Glance2Gaze outperforms existing methods, achieving superior performance with equal or lower computational cost. Furthermore, it generalizes well to high-resolution and video scenarios, showcasing robust and scalable efficiency improvements in VLMs.

Subject: NeurIPS.2025 - Poster

gm65gK3uOJ@OpenReview

#1 Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression [PDF] [Copy] [Kimi] [REL]