VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

#1 VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning [PDF¹] [Copy] [Kimi²] [REL]

Authors: Zishan Xu, Yifu Guo, Yuquan Lu, Fengyu Yang, Junxin Li

Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.

Subject: Computer Vision and Pattern Recognition

Publish: 2025-11-20 06:12:25 UTC

2511.16077

#1 VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning [PDF1] [Copy] [Kimi2] [REL]

#1 VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning [PDF¹] [Copy] [Kimi²] [REL]