FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-audio (T2A) methods struggle with temporally precise prompts (e.g., “an owl hoots from 2.4–5.2 seconds”) due to the scarcity of high-quality time-aligned training data and the need for model fine-tuning. This work introduces the first training-free, controllable long-duration T2A generation framework. It leverages a large language model to automatically decompose complex prompts into non-overlapping temporal windows and rewrite natural-language descriptions accordingly. Temporal alignment is achieved through attention decoupling and aggregation control, contextual latent variable composition, and reference-guided synthesis. Without any parameter updates or task-specific training, our method enables coherent audio generation spanning tens of seconds. Quantitative and qualitative evaluations demonstrate that it matches state-of-the-art supervised models—such as Stable Audio—in both temporal precision and audio fidelity, while significantly outperforming other training-free approaches.

Technology Category

Application Category

📝 Abstract
Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: https://freeaudio.github.io/FreeAudio/
Problem

Research questions and friction points this paper is trying to address.

Enables precise timing control in text-to-audio generation
Improves long-form audio synthesis without retraining
Ensures local smoothness and global consistency in output
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free timing-controlled T2A framework
Decoupling and Aggregating Attention Control
Contextual Latent Composition for smoothness
🔎 Similar Papers
No similar papers found.
Y
Yuxuan Jiang
Tsinghua University
Zehua Chen
Zehua Chen
PostDoc at Tsinghua University | Ph.D. from Imperial College
Generative ModelsMulti-modal GenerationHealth Monitoring
Zeqian Ju
Zeqian Ju
University of Science and Technology of China
C
Chang Li
Shengshu AI
W
Weibei Dou
Tsinghua University
J
Jun Zhu
Tsinghua University