AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions

📅 2024-09-19
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-audio (TTA) models rely on coarse-grained textual descriptions, limiting fine-grained control over both content and style; incorporating frame-level conditioning or dedicated control networks introduces architectural complexity and degrades generalization. This paper proposes the first purely natural-language-driven fine-grained TTA framework. First, we design an automated fine-grained data simulation pipeline to synthesize high-quality text–audio pairs with precise semantic–acoustic alignment. Second, we introduce a streaming diffusion Transformer architecture that integrates cross-attention mechanisms to faithfully align linguistic semantics with temporal audio features. Crucially, our method requires no frame-level annotations or auxiliary control modules. Despite its compact model size and faster inference speed, it achieves superior audio fidelity and fine-grained controllability compared to state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style. Some studies try to improve the granularity by incorporating additional frame-level conditions or control networks. However, this usually leads to complex system design and difficulties due to the requirement for reference frame-level conditions. To address these challenges, we propose AudioComposer, a novel TTA generation framework that relies solely on natural language descriptions (NLDs) to provide both content specification and style control information. To further enhance audio generative modeling, we employ flow-based diffusion transformers with the cross-attention mechanism to incorporate text descriptions effectively into audio generation processes, which can not only simultaneously consider the content and style information in the text inputs, but also accelerate generation compared to other architectures. Furthermore, we propose a novel and comprehensive automatic data simulation pipeline to construct data with fine-grained text descriptions, which significantly alleviates the problem of data scarcity in the area. Experiments demonstrate the effectiveness of our framework using solely NLDs as inputs for content specification and style control. The generation quality and controllability surpass state-of-the-art TTA models, even with a smaller model size.
Problem

Research questions and friction points this paper is trying to address.

Enables fine-grained audio generation using natural language descriptions
Improves text-to-audio models without complex control networks
Addresses data scarcity with automated fine-grained text-audio pairing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Natural language descriptions for fine-grained audio control
Flow-based diffusion transformers with cross-attention mechanism
Automatic data simulation pipeline for text-audio pairs
Y
Yuanyuan Wang
The Chinese University of Hong Kong, Hong Kong SAR, China
Hangting Chen
Hangting Chen
Tencent Hunyuan
signal processingspeech separationDCASE
Dongchao Yang
Dongchao Yang
Chinese University of Hong Kong
TTSTTAAudio CodecMulti-modal Audio Fundation Models
Z
Zhiyong Wu
The Chinese University of Hong Kong, Hong Kong SAR, China; Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
H
Helen M. Meng
The Chinese University of Hong Kong, Hong Kong SAR, China
Xixin Wu
Xixin Wu
The Chinese University of Hong Kong