Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models

#1 Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models [PDF⁵] [Copy] [Kimi⁴] [REL]

Authors: Jeonghwan Kim, Renjie Tao, Sanat Sharma, Jiaqi Wang, Kai Sun, Zhaojiang Lin, Seungwhan Moon, Lambert Mathias, Anuj Kumar, Heng Ji, Xin Luna Dong

Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits <search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries, eliminating the reliance on modular pipelines (detectors, segmenters, captioners, etc.). A two-stage supervised fine-tuning regimen with search-interleaved supervision teaches retrieval timing and query selection while preserving segmentation ability. On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization, yielding a 19.7% relative gain in accuracy on CRAG-MM compared to whole image retrieval, while retaining competitive reasoning performance on various VQA and text-only QA tasks.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence

Publish: 2026-01-27 00:46:08 UTC

2601.19060

#1 Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models [PDF5] [Copy] [Kimi4] [REL]

#1 Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models [PDF⁵] [Copy] [Kimi⁴] [REL]