UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the longstanding fragmentation in generative audio modeling—traditionally split into isolated tasks such as text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA)—which struggles to jointly handle structured semantic content and unstructured acoustic textures. To bridge this gap, we propose UniSonate, a unified generative framework based on flow matching that introduces a dynamic token injection mechanism to precisely map environmental sounds into a structured temporal latent space. Coupled with a multi-stage curriculum learning strategy, our approach harmonizes cross-modal optimization and, for the first time, enables accurate duration control of unstructured audio within a shared latent space. Joint training facilitates positive cross-modal transfer, significantly enhancing structural coherence and prosodic expressiveness. Experiments demonstrate that UniSonate outperforms task-specific baselines across instruction-driven TTS (WER: 1.47%), TTM (SongEval coherence: 3.18), and TTA tasks.

Technology Category

Application Category

📝 Abstract

Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.

Problem

Research questions and friction points this paper is trying to address.

text-to-speech

text-to-music

text-to-audio

unified audio generation

multimodal generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified audio generation

dynamic token injection

flow-matching