🤖 AI Summary
Referring Audio-Visual Segmentation (Ref-AVS) aims to localize and segment a specific object in video based on natural language descriptions by jointly leveraging audio and visual cues, posing dual challenges of cross-modal alignment and fine-grained spatial localization. To address these, we propose a multimodal collaborative framework: (1) a multimodal large language model (MLLM) generates semantically rich tokens encoding audio-visual-language context; (2) a target-consistent semantic alignment loss enforces representation consistency for the same entity across modalities and linguistic expressions; and (3) the aligned semantic tokens are injected into the Segment Anything Model (SAM) to enable frame-level precise segmentation. Evaluated on the Ref-AVS benchmark, our method significantly outperforms existing state-of-the-art approaches, demonstrating superior cross-modal semantic understanding and object-level localization capability.
📝 Abstract
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM). The MLLM is guided to generate a special semantic token representing the referred object. This compact token, enriched with contextual information from all modalities, acts as a prompt to guide SAM to segment objectsacross video frames. To further improve semantic learning, we introduce a novel target-consistent semantic alignment loss that aligns token embeddings from different expressions but referring to the same object. Experiments on the Ref-AVS benchmark demonstrate that our approach achieves superior performance compared to existing methods.Code will be available at https://github.com/DianJin-HFUT/SimToken