VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional video reasoning segmentation methods suffer from reliance on supervised fine-tuning, poor out-of-distribution generalization, and absence of explicit reasoning. To address these limitations, this paper proposes the first reinforcement learning (RL)-based video reasoning segmentation framework. Methodologically: (1) it decouples referring expression grounding from video mask propagation, jointly optimizing spatial localization and temporal consistency; (2) it introduces a hierarchical text-guided frame sampling mechanism to enhance robustness in keyframe selection; (3) it incorporates a task-difficulty-aware adaptive reasoning chain length control strategy to balance efficiency and accuracy; and (4) it integrates SAM2 and XMem into a segmentation propagation module, augmented with an explicit reasoning model. Evaluated on multiple benchmarks, the framework achieves state-of-the-art performance, significantly improving both out-of-distribution generalization and accuracy on complex, multi-step reasoning tasks.

Technology Category

Application Category

📝 Abstract
Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose extbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art performance in complex video reasoning and segmentation tasks. The code will be publicly available at https://github.com/euyis1019/VideoSeg-R1.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video object segmentation generalization beyond supervised training scenarios
Introducing explicit reasoning chains for improved segmentation decision transparency
Adaptively controlling reasoning length to balance efficiency and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for video segmentation
Decoupled architecture with joint segmentation
Adaptive reasoning length for efficiency
🔎 Similar Papers
No similar papers found.
Zishan Xu
Zishan Xu
Tsinghua University
Y
Yifu Guo
Sun Yat-sen University
Y
Yuquan Lu
Sun Yat-sen University
F
Fengyu Yang
South China Normal University
J
Junxin Li
Sun Yat-sen University