Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment audible target objects in videos guided by natural language expressions; however, existing methods rely heavily on pixel-level supervision and opaque, end-to-end multimodal fusion, severely limiting interpretability. To address this, we propose TGS-Agent—a novel framework introducing an explicit, three-stage object-aware reasoning paradigm: “Think → Localize → Segment”—which decomposes cross-modal understanding into interpretable, traceable inference steps. TGS-Agent leverages an instruction-tuned dataset annotated with reasoning paths and synergistically integrates a multimodal LLM (Ref-Thinker), Grounding-DINO, and SAM2, enabling fully supervised-free, end-to-end segmentation. Evaluated on Ref-AVSBench and our newly introduced R²-AVSBench benchmark, TGS-Agent achieves state-of-the-art performance, notably improving segmentation accuracy under complex referring expressions while significantly enhancing model transparency and explainability.

Technology Category

Application Category

📝 Abstract
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R extsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R extsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.
Problem

Research questions and friction points this paper is trying to address.

Segment objects in videos using audio and text references
Improve interpretability and reduce reliance on pixel-level supervision
Enhance model generalization with diverse reasoning-intensive references
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal reasoning via Ref-Thinker language model
Think-Ground-Segment human-like reasoning process
Object-aware prompts for unsupervised grounding and segmentation