Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment audible target objects in videos guided by natural language expressions; however, existing methods rely heavily on pixel-level supervision and opaque, end-to-end multimodal fusion, severely limiting interpretability. To address this, we propose TGS-Agent—a novel framework introducing an explicit, three-stage object-aware reasoning paradigm: “Think → Localize → Segment”—which decomposes cross-modal understanding into interpretable, traceable inference steps. TGS-Agent leverages an instruction-tuned dataset annotated with reasoning paths and synergistically integrates a multimodal LLM (Ref-Thinker), Grounding-DINO, and SAM2, enabling fully supervised-free, end-to-end segmentation. Evaluated on Ref-AVSBench and our newly introduced R²-AVSBench benchmark, TGS-Agent achieves state-of-the-art performance, notably improving segmentation accuracy under complex referring expressions while significantly enhancing model transparency and explainability.

Technology Category

Application Category

📝 Abstract
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R extsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R extsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.
Problem

Research questions and friction points this paper is trying to address.

Segment objects in videos using audio and text references
Improve interpretability and reduce reliance on pixel-level supervision
Enhance model generalization with diverse reasoning-intensive references
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal reasoning via Ref-Thinker language model
Think-Ground-Segment human-like reasoning process
Object-aware prompts for unsupervised grounding and segmentation
🔎 Similar Papers
No similar papers found.
J
Jinxing Zhou
Mohamed Bin Zayed University of Artificial Intelligence
Y
Yanghao Zhou
National University of Singapore
Mingfei Han
Mingfei Han
MBZUAI; University of Technology Sydney; Bytedance Seed; MMLab, SIAT
Object RecognitionVideo UnderstandingVision Language ModelsRobotics
T
Tong Wang
Mohamed Bin Zayed University of Artificial Intelligence
X
Xiaojun Chang
Mohamed Bin Zayed University of Artificial Intelligence, University of Science and Technology of China
Hisham Cholakkal
Hisham Cholakkal
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer VisionLarge Multimodal ModelsLLMHealthcare Foundation ModelConversational Assistant
Rao Muhammad Anwer
Rao Muhammad Anwer
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer VisionObject Recognition