Training-Free Spatio-temporal Decoupled Reasoning Video Segmentation with Adaptive Object Memory

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the temporal instability in video object segmentation guided by complex textual instructions, which often stems from reliance on fine-tuning and tight spatiotemporal coupling. To overcome these limitations, we propose a training-free inference framework that decouples spatial and temporal processing and incorporates an adaptive object memory mechanism. By leveraging motion cues to drive key object selection, our approach significantly enhances cross-frame propagation stability and target localization accuracy. Notably, this is the first method to achieve high-performance referring video object segmentation without any fine-tuning, outperforming existing fine-tuned approaches across five benchmarks—Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS—while markedly improving both temporal consistency and segmentation accuracy.

Technology Category

Application Category

📝 Abstract
Reasoning Video Object Segmentation (ReasonVOS) is a challenging task that requires stable object segmentation across video sequences using implicit and complex textual inputs. Previous methods fine-tune Multimodal Large Language Models (MLLMs) to produce segmentation outputs, which demand substantial resources. Additionally, some existing methods are coupled in the processing of spatio-temporal information, which affects the temporal stability of the model to some extent. To address these issues, we propose Training-Free \textbf{S}patio-temporal \textbf{D}ecoupled Reasoning Video Segmentation with \textbf{A}daptive Object \textbf{M}emory (SDAM). We aim to design a training-free reasoning video segmentation framework that outperforms existing methods requiring fine-tuning, using only pre-trained models. Meanwhile, we propose an Adaptive Object Memory module that selects and memorizes key objects based on motion cues in different video sequences. Finally, we propose Spatio-temporal Decoupling for stable temporal propagation. In the spatial domain, we achieve precise localization and segmentation of target objects, while in the temporal domain, we leverage key object temporal information to drive stable cross-frame propagation. Our method achieves excellent results on five benchmark datasets, including Ref-YouTubeVOS, Ref-DAVIS17, MeViS, ReasonVOS, and ReVOS.
Problem

Research questions and friction points this paper is trying to address.

Reasoning Video Object Segmentation
Spatio-temporal Decoupling
Training-Free
Temporal Stability
Multimodal Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-Free
Spatio-temporal Decoupling
Adaptive Object Memory
Reasoning Video Object Segmentation
Temporal Propagation
🔎 Similar Papers
No similar papers found.
Z
Zhengtong Zhu
School of Computer Science & Technology, Soochow University, Suzhou, China
J
Jiaqing Fan
School of Computer Science & Technology, Soochow University, Suzhou, China
Zhixuan Liu
Zhixuan Liu
PhD student at Shanghai Jiaotong University
deep learningreinforcement learning
F
Fanzhang Li
School of Computer Science & Technology, Soochow University, Suzhou, China