UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Temporal alignment (TA) and non-temporal alignment (NTA) tasks in audio generation have long relied on disjoint modeling paradigms, lacking a unified, efficient, and multimodal-compatible general-purpose framework. Method: We propose the first flow-matching-based non-autoregressive unified audio generation model, featuring a dual-fusion mechanism—temporal alignment modeling coupled with cross-modal cross-attention—and a task-balanced sampling strategy. This enables joint modeling of diverse audio types (e.g., speech, music, sound effects) across TA/NTA tasks under text, audio, or video conditioning. Contribution/Results: Trained on <8K hours of public data with ≤1B parameters, our model achieves state-of-the-art performance on seven benchmark tasks. Remarkably, even a compact 200M-parameter variant remains highly competitive, demonstrating strong feasibility and scalability as a general-purpose audio foundation model.

Technology Category

Application Category

📝 Abstract
Audio generation, including speech, music and sound effects, has advanced rapidly in recent years. These tasks can be divided into two categories: time-aligned (TA) tasks, where each input unit corresponds to a specific segment of the output audio (e.g., phonemes aligned with frames in speech synthesis); and non-time-aligned (NTA) tasks, where such alignment is not available. Since modeling paradigms for the two types are typically different, research on different audio generation tasks has traditionally followed separate trajectories. However, audio is not inherently divided into such categories, making a unified model a natural and necessary goal for general audio generation. Previous unified audio generation works have adopted autoregressive architectures, while unified non-autoregressive approaches remain largely unexplored. In this work, we propose UniFlow-Audio, a universal audio generation framework based on flow matching. We propose a dual-fusion mechanism that temporally aligns audio latents with TA features and integrates NTA features via cross-attention in each model block. Task-balanced data sampling is employed to maintain strong performance across both TA and NTA tasks. UniFlow-Audio supports omni-modalities, including text, audio, and video. By leveraging the advantage of multi-task learning and the generative modeling capabilities of flow matching, UniFlow-Audio achieves strong results across 7 tasks using fewer than 8K hours of public training data and under 1B trainable parameters. Even the small variant with only ~200M trainable parameters shows competitive performance, highlighting UniFlow-Audio as a potential non-auto-regressive foundation model for audio generation. Code and models will be available at https://wsntxxn.github.io/uniflow_audio.
Problem

Research questions and friction points this paper is trying to address.

Unifying time-aligned and non-time-aligned audio generation tasks
Developing a non-autoregressive universal framework for audio generation
Integrating omni-modalities like text, audio, and video inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified flow matching framework for audio generation
Dual-fusion mechanism aligning TA and NTA features
Task-balanced data sampling for multi-task performance
🔎 Similar Papers
No similar papers found.
Xuenan Xu
Xuenan Xu
Shanghai Jiao Tong University
audio generationaudio understandingspeech synthesis
J
Jiahao Mei
Shanghai Jiao Tong University
Z
Zihao Zheng
Shanghai Artificial Intelligence Lab, Shanghai Jiao Tong University
Y
Ye Tao
Shanghai Artificial Intelligence Lab, Shanghai Jiao Tong University
Z
Zeyu Xie
Peking University
Yaoyun Zhang
Yaoyun Zhang
Shanghai Jiao Tong University
Haohe Liu
Haohe Liu
Research Scientist at Meta AI
Audio GenerationAudio ClassificationSpeech Quality EnhancementMusic Source Separation
Yuning Wu
Yuning Wu
Wayne State University
perceptions of crime & justicepolice attitudes and behaviorsvictimizationcriminological theorieslaw and society
M
Ming Yan
Alibaba Group
W
Wen Wu
Shanghai Artificial Intelligence Lab
C
Chao Zhang
Shanghai Artificial Intelligence Lab
Mengyue Wu
Mengyue Wu
Shanghai Jiao Tong University
Speech perception and productionaffective computingaudio cognition