Object-centric Video Question Answering with Visual Grounding and Referring

#1 Object-centric Video Question Answering with Visual Grounding and Referring [PDF] [Copy] [Kimi] [REL]

Authors: Haochen Wang, Qirui Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie, Stratis Gavves

Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multi-round interactions. In this paper, we make three contributions:(i) we address these limitations by introducing a VideoLLM, termed as **RGA3**, capable of performing both object referring and grounding for video reasoning tasks in a multi-round conversational manner, i.e., allowing users to iteratively interact with videos using both textual and visual queries; (ii) we propose **STOM** (Spatial-Temporal Overlay Module), a novel approach that allows arbitrary visual prompts to be processed at any timestamp within a video;(iii) we present **VideoInfer**, a manually curated object-centric video instruction dataset featuring question-answering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring video object segmentation. The results on 12 benchmarks spanning 6 tasks show that RGA3 consistently outperforms baseline models in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. The code, dataset, and web demo will be publicly released.

Subject: ICCV.2025 - Poster

Wang_Object-centric_Video_Question_Answering_with_Visual_Grounding_and_Referring@ICCV2025@CVF

#1 Object-centric Video Question Answering with Visual Grounding and Referring [PDF] [Copy] [Kimi] [REL]