Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing approaches struggle to generate unified audio that seamlessly integrates speech and sound effects directly from free-form text, often relying on structured inputs or disjointed pipelines that limit expressive flexibility. This work proposes PlanAudio, a novel framework based on an autoregressive large language model that dispenses with conventional text encoders and instead introduces a semantic implicit chain-of-thought mechanism to enable end-to-end generation of composite audio from unstructured text. The study pioneers this task formulation, constructs PlanAudio-Bench—the first benchmark for composite audio evaluation—and employs a multi-scenario continual training strategy. Results demonstrate that PlanAudio outperforms existing baselines across speech, sound effects, and composite scenarios, achieving performance on par with specialized single-task models and validating the efficacy of the proposed mechanism.

📝 Abstract

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

Problem

Research questions and friction points this paper is trying to address.

unified audio synthesis

free-form text prompts

compositional speech and sound

audio generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

free-form text-to-audio

unified audio synthesis

semantic latent chain-of-thought

LLM-based audio generation