MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reference audio-visual segmentation methods struggle to explicitly identify expression difficulty and dominant modalities, rely heavily on high-quality instruction-tuning data, and lack a mechanism for reflecting on prediction outcomes, often leading to erroneous masks. To address these limitations, this work proposes MAR3, a training-free multi-agent framework that introduces, for the first time, a multi-agent paradigm integrating Delphi-inspired consensus for multimodal recognition, adaptive collaborative reasoning guided by modality dominance and difficulty-aware rules, and reflective learning with iteratively refined prompts for segmentation. Evaluated on Ref-AVSBench, MAR3 achieves a J&F score of 69.2%, surpassing the current state-of-the-art by 3.4%.
📝 Abstract
Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the quality of the instruction-tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training-free Multi-Agent Recognition, Reasoning, and Reflection framework to achieve high-quality Reference Audio-Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal cues. Based on our modality-dominant difficulty rule, we propose an adaptive Collaborative Object Reasoning strategy to reliably reason about the referred object. To further ensure precise mask prediction, we develop a Reflective Learning Segmentation mechanism, in which a check agent examines intermediate segmentation results and iteratively corrects the object text prompt of the segment agent. Experiments demonstrate that MAR3 achieves superior performance (69.2% in J&F) on the Ref-AVSBench dataset, outperforming SOTA by 3.4% absolutely.
Problem

Research questions and friction points this paper is trying to address.

Reference Audio-Visual Segmentation
multimodal cues
expression difficulty
dominant modality
segmentation validation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Framework
Consensus Multimodal Recognition
Adaptive Collaborative Reasoning
Reflective Learning Segmentation
Reference Audio-Visual Segmentation
🔎 Similar Papers
No similar papers found.