ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address weak fine-grained controllability and performance bottlenecks caused by scarce high-quality annotated data in text-to-audio (TTA) generation, this paper proposes a progressive diffusion-based multi-task controllable generation framework. Methodologically, we design a diffusion Transformer architecture that jointly models text, phonemes, and precise temporal signals as unified conditional inputs; further, we introduce a progressive guidance strategy that integrates human annotations with synthetic data to enhance control sequences, enabling semantically consistent high-precision temporal alignment and clear speech synthesis. Our key contribution is the first unified diffusion framework that jointly optimizes three fine-grained control signals—semantic, phonemic, and temporal—significantly improving both controllability and naturalness. Experiments demonstrate state-of-the-art performance in timing accuracy and speech clarity, with substantial improvements over prior work in objective metrics (e.g., MCD, forced alignment error) and subjective MOS scores.

Technology Category

Application Category

📝 Abstract

Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: https://control-audio.github.io/Control-Audio.

Problem

Research questions and friction points this paper is trying to address.

Generating audio with precise timing control and intelligible speech content

Overcoming data scarcity in text-to-audio generation with fine-grained control

Developing progressive diffusion modeling for multi-condition audio synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive diffusion modeling for fine-grained audio control

Data construction with text, timing, and phoneme augmentation

Diffusion transformer pretraining with incremental feature integration

🔎 Similar Papers

AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions