Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder

📅 2025-06-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video segmentation methods (e.g., Sa2VA) directly fuse visual features, causing entanglement between dynamic motion and static semantic information, thereby degrading segmentation accuracy. To address this, we propose DeSa2VA—a novel framework that introduces, for the first time, a linear disentanglement module to project large language model hidden states into orthogonal text and vision subspaces, enabling explicit modality disentanglement. Integrated within the SAM-2 architecture, DeSa2VA further incorporates point-level prompt generation, dynamic mask fusion, and a hybrid triple-supervision loss to realize disentangled–recomposed–collaborative multimodal reasoning. Extensive experiments demonstrate state-of-the-art performance across video segmentation, localization, and cross-modal question answering, with significant improvements in semantic grounding accuracy and temporal reasoning capability. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at https://github.com/longmalongma/DeSa2VA.
Problem

Research questions and friction points this paper is trying to address.

Decoupling dynamic and static features in video segmentation
Improving semantic grounding via text pre-training and masks
Enhancing segmentation accuracy with disentangled visual-text features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text pre-training for point-level prompts
Linear projection for feature disentanglement
Dynamic mask fusion with triple supervision
🔎 Similar Papers
No similar papers found.
J
Jisheng Dang
Lanzhou University
X
Xudong Wu
Sun Yat-sen University
B
Bimei Wang
National University of Singapore
N
Ning Lv
Lanzhou University
J
Jiayu Chen
Lanzhou University
Jingwen Zhao
Jingwen Zhao
Sun Yat-sen University
Y
Yichu liu
South China University of Technology
Jizhao Liu
Jizhao Liu
Associate Professor@Lanzhou University
ChaosNonlinear DynamicsBrain-inspired ComputingVisual CognitionComputational Neuroscience
Juncheng Li
Juncheng Li
East China Normal University
Super ResolutionImage RestorationComputer VisionMedical Image Analysis
T
Teng Wang
The University of Hong Kong