GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models

#1 GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models [PDF⁴] [Copy] [Kimi³] [REL]

Recent large vision-language models (LVLMs) have advanced capabilities in visual question answering (VQA). However, interpreting where LVLMs direct their visual attention remains a significant challenge, yet is essential for understanding model behavior. We introduce GLIMPSE (Gradient-Layer Importance Mapping for Prompted Visual Saliency Explanation), a lightweight, model-agnostic framework that jointly attributes LVLM outputs to the most relevant visual evidence and textual signals that support open-ended generation. GLIMPSE fuses gradient-weighted attention, adaptive layer propagation, and relevance-weighted token aggregation to produce holistic response-level heat maps for interpreting cross-modal reasoning, outperforming prior methods in faithfulness and pushing the state-of-the-art in human-attention alignment. We demonstrate an analytic approach to uncover fine-grained insights into LVLM cross-modal attribution, trace reasoning dynamics, analyze systematic misalignment, diagnose hallucination and bias, and ensure transparency.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2025-06-23 18:00:04 UTC

2506.18985

#1 GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models [PDF4] [Copy] [Kimi3] [REL]

#1 GLIMPSE: Holistic Cross-Modal Explainability for Large Vision-Language Models [PDF⁴] [Copy] [Kimi³] [REL]