LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

#1 LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models [PDF¹] [Copy] [Kimi] [REL]

Authors: Zhou Tao, Fang Zhang, Zewen Ding, Shida Wang, Xiaokun Sun, YongXiang Hua, Haoyu Cao, Linli Xu

Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidence may exist in the full image, yet fail to be reliably selected and used amid redundant visual context. We propose LOCUS (LOcal visual CUe Search), a training framework that teaches MLLMs to internalize local evidence search through a verifiable proxy task. During training, LOCUS provides a local crop as a visual cue and optimizes the model to recover its spatial support in the full image using an IoU-based reward. The visual cue is used only during training, leaving the standard image-question inference interface unchanged. Experiments across fine-grained perception, hallucination, general understanding, and reasoning benchmarks show that LOCUS improves localization-sensitive visual understanding while preserving broad capabilities. Attention analyses further indicate stronger focus on task-relevant evidence regions, suggesting that training-time visual cue search provides an effective route to internalized fine-grained evidence selection.

Subject: Computer Vision and Pattern Recognition

Publish: 2026-06-15 11:30:56 UTC

2606.16586

#1 LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models [PDF1] [Copy] [Kimi] [REL]

#1 LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models [PDF¹] [Copy] [Kimi] [REL]