SimToken: A Simple Baseline for Referring Audio-Visual Segmentation

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Referring Audio-Visual Segmentation (Ref-AVS) aims to localize and segment a specific object in video based on natural language descriptions by jointly leveraging audio and visual cues, posing dual challenges of cross-modal alignment and fine-grained spatial localization. To address these, we propose a multimodal collaborative framework: (1) a multimodal large language model (MLLM) generates semantically rich tokens encoding audio-visual-language context; (2) a target-consistent semantic alignment loss enforces representation consistency for the same entity across modalities and linguistic expressions; and (3) the aligned semantic tokens are injected into the Segment Anything Model (SAM) to enable frame-level precise segmentation. Evaluated on the Ref-AVS benchmark, our method significantly outperforms existing state-of-the-art approaches, demonstrating superior cross-modal semantic understanding and object-level localization capability.

Technology Category

Application Category

📝 Abstract

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM). The MLLM is guided to generate a special semantic token representing the referred object. This compact token, enriched with contextual information from all modalities, acts as a prompt to guide SAM to segment objectsacross video frames. To further improve semantic learning, we introduce a novel target-consistent semantic alignment loss that aligns token embeddings from different expressions but referring to the same object. Experiments on the Ref-AVS benchmark demonstrate that our approach achieves superior performance compared to existing methods.Code will be available at https://github.com/DianJin-HFUT/SimToken

Problem

Research questions and friction points this paper is trying to address.

Segmenting specific objects in videos using audio, vision, and text information

Addressing challenges in cross-modal reasoning and fine-grained object localization

Integrating multimodal language models with segmentation models for video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates multimodal LLM with Segment Anything Model

Generates semantic token to prompt object segmentation

Uses target-consistent semantic alignment loss

🔎 Similar Papers

Progressive Confident Masking Attention Network for Audio-Visual Segmentation