AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Current multimodal large language models (e.g., GPT-4o, Qwen3-Omni) exhibit weak agent-level reasoning in multi-speaker dialogue scenarios—struggling to track speaker identity, maintain role consistency, and temporally anchor events across modalities—thereby limiting their effectiveness in applications such as conversational video assistants and meeting analytics. To address this, we introduce AMUSE, the first benchmark explicitly designed for agent-level reasoning in multi-speaker audio-visual understanding. We further propose RAFT, an agent alignment framework that jointly leverages multimodal self-assessment rewards and selective parameter adaptation to achieve dual efficiency in data and parameter optimization. Experiments demonstrate that RAFT improves average accuracy by 39.52% over baselines across zero-shot, instruction-guided, and agent-oriented evaluation settings on AMUSE, significantly enhancing multimodal temporal coherence and robustness in speaker-role reasoning.

Technology Category

Application Category

📝 Abstract

Recent multimodal large language models (MLLMs) such as GPT-4o and Qwen3-Omni show strong perception but struggle in multi-speaker, dialogue-centric settings that demand agentic reasoning tracking who speaks, maintaining roles, and grounding events across time. These scenarios are central to multimodal audio-video understanding, where models must jointly reason over audio and visual streams in applications such as conversational video assistants and meeting analytics. We introduce AMUSE, a benchmark designed around tasks that are inherently agentic, requiring models to decompose complex audio-visual interactions into planning, grounding, and reflection steps. It evaluates MLLMs across three modes zero-shot, guided, and agentic and six task families, including spatio-temporal speaker grounding and multimodal dialogue summarization. Across all modes, current models exhibit weak multi-speaker reasoning and inconsistent behavior under both non-agentic and agentic evaluation. Motivated by the inherently agentic nature of these tasks and recent advances in LLM agents, we propose RAFT, a data-efficient agentic alignment framework that integrates reward optimization with intrinsic multimodal self-evaluation as reward and selective parameter adaptation for data and parameter efficient updates. Using RAFT, we achieve up to 39.52% relative improvement in accuracy on our benchmark. Together, AMUSE and RAFT provide a practical platform for examining agentic reasoning in multimodal models and improving their capabilities.

Problem

Research questions and friction points this paper is trying to address.

Addresses multi-speaker reasoning challenges in audio-visual MLLMs.

Improves agentic reasoning for dialogue-centric multimodal tasks.

Enhances multimodal alignment via data-efficient agentic frameworks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic alignment framework with reward optimization

Multimodal self-evaluation as intrinsic reward signal

Selective parameter adaptation for efficient updates

🔎 Similar Papers

Video-Language Understanding: A Survey from Model Architecture, Model Training, and Data Perspectives