🤖 AI Summary
Traditional video reasoning segmentation methods suffer from reliance on supervised fine-tuning, poor out-of-distribution generalization, and absence of explicit reasoning. To address these limitations, this paper proposes the first reinforcement learning (RL)-based video reasoning segmentation framework. Methodologically: (1) it decouples referring expression grounding from video mask propagation, jointly optimizing spatial localization and temporal consistency; (2) it introduces a hierarchical text-guided frame sampling mechanism to enhance robustness in keyframe selection; (3) it incorporates a task-difficulty-aware adaptive reasoning chain length control strategy to balance efficiency and accuracy; and (4) it integrates SAM2 and XMem into a segmentation propagation module, augmented with an explicit reasoning model. Evaluated on multiple benchmarks, the framework achieves state-of-the-art performance, significantly improving both out-of-distribution generalization and accuracy on complex, multi-step reasoning tasks.
📝 Abstract
Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose extbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.