Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to generate unified audio that seamlessly integrates speech and sound effects directly from free-form text, often relying on structured inputs or disjointed pipelines that limit expressive flexibility. This work proposes PlanAudio, a novel framework based on an autoregressive large language model that dispenses with conventional text encoders and instead introduces a semantic implicit chain-of-thought mechanism to enable end-to-end generation of composite audio from unstructured text. The study pioneers this task formulation, constructs PlanAudio-Bench—the first benchmark for composite audio evaluation—and employs a multi-scenario continual training strategy. Results demonstrate that PlanAudio outperforms existing baselines across speech, sound effects, and composite scenarios, achieving performance on par with specialized single-task models and validating the efficacy of the proposed mechanism.
📝 Abstract
Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.
Problem

Research questions and friction points this paper is trying to address.

unified audio synthesis
free-form text prompts
compositional speech and sound
audio generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

free-form text-to-audio
unified audio synthesis
semantic latent chain-of-thought
LLM-based audio generation
composite speech and sound
🔎 Similar Papers