AudioMoG: Guiding Audio Generation with Mixture-of-Guidance

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing cross-modal audio generation methods (e.g., text-to-audio or video-to-audio) typically rely on a single guidance mechanism—such as conditional alignment or score matching—struggling to jointly optimize fidelity and diversity. This paper introduces AudioMoG, the first diffusion-based hybrid guidance framework that systematically integrates complementary mechanisms: classifier-free guidance (CFG) and autoregressive guidance (AG). It employs a learnable weighting strategy for dynamic, joint guidance and supports flexible degeneration into either individual guidance mode during sampling—without increasing inference overhead. AudioMoG consistently outperforms single-guidance baselines across diverse tasks, including text-to-audio, video-to-audio, text-to-music, and image generation. Its core contribution lies in establishing a multi-guidance synergy paradigm that simultaneously enhances both generation quality and diversity while preserving computational efficiency.

Technology Category

Application Category

📝 Abstract
Guidance methods have demonstrated significant improvements in cross-modal audio generation, including text-to-audio (T2A) and video-to-audio (V2A) generation. The popularly adopted method, classifier-free guidance (CFG), steers generation by emphasizing condition alignment, enhancing fidelity but often at the cost of diversity. Recently, autoguidance (AG) has been explored for audio generation, encouraging the sampling to faithfully reconstruct the target distribution and showing increased diversity. Despite these advances, they usually rely on a single guiding principle, e.g., condition alignment in CFG or score accuracy in AG, leaving the full potential of guidance for audio generation untapped. In this work, we explore enriching the composition of the guidance method and present a mixture-of-guidance framework, AudioMoG. Within the design space, AudioMoG can exploit the complementary advantages of distinctive guiding principles by fulfilling their cumulative benefits. With a reduced form, AudioMoG can consider parallel complements or recover a single guiding principle, without sacrificing generality. We experimentally show that, given the same inference speed, AudioMoG approach consistently outperforms single guidance in T2A generation across sampling steps, concurrently showing advantages in V2A, text-to-music, and image generation. These results highlight a "free lunch" in current cross-modal audio generation systems: higher quality can be achieved through mixed guiding principles at the sampling stage without sacrificing inference efficiency. Demo samples are available at: https://audio-mog.github.io.
Problem

Research questions and friction points this paper is trying to address.

Improving audio generation quality through mixed guidance principles
Balancing fidelity and diversity in cross-modal audio synthesis
Enhancing text-to-audio and video-to-audio generation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-guidance framework combines multiple principles
Exploits complementary advantages of different guidance methods
Maintains inference speed while improving generation quality
🔎 Similar Papers
No similar papers found.
Junyou Wang
Junyou Wang
Tsinghua University, University of Science and Technology of China
Zehua Chen
Zehua Chen
PostDoc at Tsinghua University | Ph.D. from Imperial College
Generative ModelsMulti-modal GenerationHealth Monitoring
B
Binjie Yuan
Tsinghua University, Beijing, China
K
Kaiwen Zheng
Tsinghua University, Beijing, China
C
Chang Li
Tsinghua University, Beijing, China
Y
Yuxuan Jiang
Tsinghua University, Beijing, China
J
Jun Zhu
Tsinghua University, Beijing, China