UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Auditory generation has long suffered from the artificial separation of speech and music modeling, leading to task conflicts and data imbalance that hinder the development of general-purpose audio synthesis models. To address this, we propose a dynamic-capacity Mixture-of-Experts (MoE) architecture that integrates domain-specific experts, shared experts, and skipable null experts, coupled with Top-P routing and a three-stage collaborative training strategy. This design enables cross-domain knowledge fusion and adaptive computational allocation. Our method effectively mitigates performance degradation during joint training, achieving state-of-the-art results on major speech (e.g., VALL-E X) and music (e.g., MusicGen, AudioLDM) benchmarks. It significantly enhances cross-domain cooperative learning capability while maintaining scalability and computational efficiency. The proposed framework establishes a unified, extensible, and high-performance paradigm for general audio generation.

Technology Category

Application Category

📝 Abstract
Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each "proto-expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni-MoE-site/home.html
Problem

Research questions and friction points this paper is trying to address.

Unifying speech and music generation within a single audio synthesis model
Addressing task conflicts and data imbalances in universal audio generation
Developing dynamic-capacity MoE architecture for specialized audio domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic-Capacity MoE framework for unified audio generation
Top-P routing strategy with hybrid expert design
Three-stage training curriculum addressing data imbalance
🔎 Similar Papers
No similar papers found.
Z
Zhenyu Liu
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
Y
Yunxin Li
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
X
Xuanyu Zhang
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
Q
Qixun Teng
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
S
Shenyuan Jiang
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
X
Xinyu Chen
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
H
Haoyuan Shi
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
J
Jinchao Li
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
Q
Qi Wang
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
H
Haolan Chen
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
F
Fanbo Meng
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
M
Mingjun Zhao
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
Yu Xu
Yu Xu
University of Cambridge
Multi-omicsHealth Data ScienceData MiningSocial NetworkWeb Services
Yancheng He
Yancheng He
Alibaba Group
LLM
Baotian Hu
Baotian Hu
Harbin Institute of Technology (Shenzhen)
LLMMLLMNLP
M
Min Zhang
Department of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China