MAR3: Multi-Agent Recognition, Reasoning, and Reflection for Reference Audio-Visual Segmentation

📅 2026-03-29

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing reference audio-visual segmentation methods struggle to explicitly identify expression difficulty and dominant modalities, rely heavily on high-quality instruction-tuning data, and lack a mechanism for reflecting on prediction outcomes, often leading to erroneous masks. To address these limitations, this work proposes MAR3, a training-free multi-agent framework that introduces, for the first time, a multi-agent paradigm integrating Delphi-inspired consensus for multimodal recognition, adaptive collaborative reasoning guided by modality dominance and difficulty-aware rules, and reflective learning with iteratively refined prompts for segmentation. Evaluated on Ref-AVSBench, MAR3 achieves a J&F score of 69.2%, surpassing the current state-of-the-art by 3.4%.

Technology Category

Application Category

📝 Abstract

Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the quality of the instruction-tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training-free Multi-Agent Recognition, Reasoning, and Reflection framework to achieve high-quality Reference Audio-Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal cues. Based on our modality-dominant difficulty rule, we propose an adaptive Collaborative Object Reasoning strategy to reliably reason about the referred object. To further ensure precise mask prediction, we develop a Reflective Learning Segmentation mechanism, in which a check agent examines intermediate segmentation results and iteratively corrects the object text prompt of the segment agent. Experiments demonstrate that MAR3 achieves superior performance (69.2% in J&F) on the Ref-AVSBench dataset, outperforming SOTA by 3.4% absolutely.

Problem

Research questions and friction points this paper is trying to address.

Reference Audio-Visual Segmentation

multimodal cues

expression difficulty

dominant modality

segmentation validation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Framework

Consensus Multimodal Recognition

Adaptive Collaborative Reasoning

Reflective Learning Segmentation

Reference Audio-Visual Segmentation

🔎 Similar Papers

Progressive Confident Masking Attention Network for Audio-Visual Segmentation

2024-06-04arXiv.orgCitations: 0