InstructAudio: Unified speech and music generation with natural language instruction

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing text-to-speech (TTS) and text-to-music (TTM) systems rely separately on reference audio or expert-annotated heterogeneous conditions, lacking a unified, natural-language-driven framework for fine-grained, multi-attribute control—and have long been modeled in isolation. Method: We propose the first natural-language-instruction-based unified speech and music generation framework, enabling cross-modal control over timbre, emotion, style, language, instrumentation, tempo, and more. Our approach employs a joint-single diffusion Transformer architecture with standardized instruction-phoneme inputs, trained end-to-end on 50K hours of speech and 20K hours of music data to achieve cross-modal alignment and multi-task learning. Contribution/Results: The framework generates expressive bilingual (Chinese/English) speech, music, and spoken dialogue. Experiments demonstrate state-of-the-art performance across standard metrics, validating the effectiveness and generalizability of instruction-driven unified generation.

Technology Category

Application Category

📝 Abstract

Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion, style, accent), and musical (genre, instrument, rhythm, atmosphere). It supports expressive speech, music, and dialogue generation in English and Chinese. The model employs joint and single diffusion transformer layers with a standardized instruction-phoneme input format, trained on 50K hours of speech and 20K hours of music data, enabling multi-task learning and cross-modal alignment. Fig. 1 visualizes performance comparisons with mainstream TTS and TTM models, demonstrating that InstructAudio achieves optimal results on most metrics. To our best knowledge, InstructAudio represents the first instruction-controlled framework unifying speech and music generation. Audio samples are available at: https://qiangchunyu.github.io/InstructAudio/

Problem

Research questions and friction points this paper is trying to address.

TTS systems lack natural language control over timbre and dialogue generation

TTM systems require expert annotations and have limited input conditioning

No unified framework exists for joint speech and music generation via instructions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for speech and music generation via instructions

Joint diffusion transformer with standardized phoneme-instruction input

Multi-task learning across 70K hours of speech and music data

🔎 Similar Papers

Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning