🤖 AI Summary
Existing video segmentation methods (e.g., Sa2VA) directly fuse visual features, causing entanglement between dynamic motion and static semantic information, thereby degrading segmentation accuracy. To address this, we propose DeSa2VA—a novel framework that introduces, for the first time, a linear disentanglement module to project large language model hidden states into orthogonal text and vision subspaces, enabling explicit modality disentanglement. Integrated within the SAM-2 architecture, DeSa2VA further incorporates point-level prompt generation, dynamic mask fusion, and a hybrid triple-supervision loss to realize disentangled–recomposed–collaborative multimodal reasoning. Extensive experiments demonstrate state-of-the-art performance across video segmentation, localization, and cross-modal question answering, with significant improvements in semantic grounding accuracy and temporal reasoning capability. The code is publicly available.
📝 Abstract
Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at https://github.com/longmalongma/DeSa2VA.