AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal-to-multi-audio (MM2MA) generation faces three core challenges: data scarcity, task heterogeneity, and difficulty in cross-modal alignment—hindering simultaneous achievement of audio diversity and contextual consistency. To address these, we propose the first training-free, two-tier multi-agent architecture: a *generation group* that performs task-driven fine-grained decomposition and adaptive Mixture-of-Experts (MoE) collaboration, refined iteratively via trial-and-error for enhanced fidelity; and a *supervision group* that enforces spatiotemporal consistency and cross-modal alignment validation for self-correction. We further introduce MA-Bench—the first dedicated MM2MA benchmark. Our method achieves state-of-the-art performance across all 8 tasks and 9 metrics. User studies confirm significant improvements in audio quality, semantic accuracy, cross-modal alignment, and perceptual aesthetics.

Technology Category

Application Category

📝 Abstract
Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in tackling the above issues. However, directly applying it to MM2MA task presents three critical challenges: (1) inadequate fine-grained understanding of multimodal inputs (especially for video), (2) the inability of single models to handle diverse audio events, and (3) the absence of self-correction mechanisms for reliable outputs. To this end, we propose AudioGenie, a novel training-free multi-agent system featuring a dual-layer architecture with a generation team and a supervisor team. For the generation team, a fine-grained task decomposition and an adaptive Mixture-of-Experts (MoE) collaborative entity are designed for dynamic model selection, and a trial-and-error iterative refinement module is designed for self-correction. The supervisor team ensures temporal-spatial consistency and verifies outputs through feedback loops. Moreover, we build MA-Bench, the first benchmark for MM2MA tasks, comprising 198 annotated videos with multi-type audios. Experiments demonstrate that our AudioGenie outperforms state-of-the-art (SOTA) methods across 9 metrics in 8 tasks. User study further validate the effectiveness of the proposed method in terms of quality, accuracy, alignment, and aesthetic. The anonymous project website with samples can be found at https://audiogenie.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing diverse audio types from multimodal inputs
Addressing inadequate fine-grained understanding of multimodal inputs
Overcoming single models' inability to handle diverse audio events
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free multi-agent system with dual-layer architecture
Adaptive Mixture-of-Experts for dynamic model selection
Trial-and-error iterative refinement for self-correction
🔎 Similar Papers
No similar papers found.
Y
Yan Rong
The Hong Kong University of Science and Technology (Guangzhou)
Jinting Wang
Jinting Wang
Central University of Finance and Economics
Operations ManagementService ScienceQueueing TheoryReliabilityStochastic Modeling
S
Shan Yang
Tencent AI Lab
G
Guangzhi Lei
Tencent AI Lab
L
Li Liu
The Hong Kong University of Science and Technology (Guangzhou)