Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder

📅 2025-06-28

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing video segmentation methods (e.g., Sa2VA) directly fuse visual features, causing entanglement between dynamic motion and static semantic information, thereby degrading segmentation accuracy. To address this, we propose DeSa2VA—a novel framework that introduces, for the first time, a linear disentanglement module to project large language model hidden states into orthogonal text and vision subspaces, enabling explicit modality disentanglement. Integrated within the SAM-2 architecture, DeSa2VA further incorporates point-level prompt generation, dynamic mask fusion, and a hybrid triple-supervision loss to realize disentangled–recomposed–collaborative multimodal reasoning. Extensive experiments demonstrate state-of-the-art performance across video segmentation, localization, and cross-modal question answering, with significant improvements in semantic grounding accuracy and temporal reasoning capability. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at https://github.com/longmalongma/DeSa2VA.

Problem

Research questions and friction points this paper is trying to address.

Decoupling dynamic and static features in video segmentation

Improving semantic grounding via text pre-training and masks

Enhancing segmentation accuracy with disentangled visual-text features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text pre-training for point-level prompts

Linear projection for feature disentanglement

Dynamic mask fusion with triple supervision

🔎 Similar Papers

TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval