2025.findings-emnlp.50@ACL

Total: 1

#1 Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, Lifu Huang

Despite their impressive performance in coarse-grained video understanding, Video Large Language Models (Video-LLMs) still face challenges in fine-grained temporal grounding, including ineffective temporal modeling and inadequate timestamp representations. In this work, we introduce Grounded-VideoLLM, a novel Video-LLM designed to perceive and reason over specific video moments with fine-grained temporal precision. Our model features (1) a two-stream encoder that explicitly captures inter-frame relationships while preserving intra-frame visual details and (2) discrete temporal tokens enriched with structured time knowledge for timestamp representation. Besides, we propose a multi-stage training strategy tailored to such grounding-specific architecture. The model is initially trained on simple video-caption tasks and progressively introduced to complex video temporal grounding tasks, ensuring a smooth learning curve and temporal alignment. We further strengthen Grounded-VideoLLM’s temporal reasoning by constructing a VideoQA dataset with grounded information using an automated annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only surpasses existing models in fine-grained grounding tasks but also exhibits strong potential as a general video understanding assistant.

Subject: EMNLP.2025 - Findings