SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

#1 SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs [PDF] [Copy] [Kimi] [REL]

Authors: Bo Yin, Xiaobin Hu, Chengming Xu, Ruolin Shen, Mo Yang, Jiangning Zhang, Peng-Tao Jiang, Cheng Tan, Shuicheng YAN

Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2026-06-18 13:56:30 UTC

2606.20244

#1 SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs [PDF] [Copy] [Kimi] [REL]