Total: 1
Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While methods like Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms show promise in text and image domains, their application to video understanding remains underexplored. This paper systematically investigates RL with GRPO for video MLLMs, proposing a framework that enhances spatio-temporal perception capacities without compromising general capabilities. Through a sophisticated reward mechanism that integrates both format and spatial-temporal rewards to guide optimization directions with limited samples effectively, we develop STAR-R1, a powerful framework that achieves state-of-the-art performance on video understanding tasks while exhibiting strong spatio-temporal reasoning abilities on video perception benchmarks. Compared to Qwen2.5-VL-7B, STAR-R1 boosts performance in tasks like temporal grounding and general video QA benchmarks (such as VideoMME and MVBench). Additionally, it significantly improves the performance on perception tasks (such as video reasoning segmentation), demonstrating STAR-R1's ability to generalize across different video perception domains while presenting an explicit reasoning process. Our findings underscore the potential of reinforcement learning for improving the video perception capacity of Video MLLMs.