DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

📅 2025-09-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-audio (T2A) models struggle to achieve precise, fine-grained control over acoustic attributes, hindering personalized audio generation. To address this, we propose the first framework tailored for customized T2A synthesis, leveraging reference-audio-guided diffusion modeling integrated with large language models and dual-modality alignment training—enabling joint modeling of semantic fidelity and target acoustic characteristics (e.g., timbre, rhythm, event structure). Our method extracts and faithfully reproduces speaker- or style-specific acoustic features from only a few reference samples. We introduce two dedicated datasets, including the CTTA benchmark with real-world scene annotations. Experiments demonstrate that our model significantly outperforms state-of-the-art methods on customized generation tasks, achieving superior audio quality, semantic alignment, and acoustic feature fidelity—while maintaining competitive performance on general-purpose T2A benchmarks.

Technology Category

Application Category

📝 Abstract
With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short on precisely controlling fine-grained acoustic characteristics of specific sounds. As a result, users that need specific sound content may find it challenging to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the customized systems. The experiments show that the proposed model, DreamAudio, generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.
Problem

Research questions and friction points this paper is trying to address.

Precise control of fine-grained acoustic characteristics in audio generation
Generating customized audio events from user-provided reference samples
Semantic alignment between generated audio and input text prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Customized text-to-audio generation with diffusion models
Framework using reference audio samples for personalized events
Generates audio consistent with customized features and text prompts
🔎 Similar Papers
No similar papers found.
Yi Yuan
Yi Yuan
NetEase Fuxi AI Lab
deep learningcomputer vision
X
Xubo Liu
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guilford, UK
Haohe Liu
Haohe Liu
Research Scientist at Meta AI
Audio GenerationAudio ClassificationSpeech Quality EnhancementMusic Source Separation
X
Xiyuan Kang
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guilford, UK
Z
Zhuo Chen
Seed Group, ByteDance Inc.
Y
Yuxuan Wang
Seed Group, ByteDance Inc.
M
Mark D. Plumbley
Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey, Guilford, UK
Wenwu Wang
Wenwu Wang
Professor, University of Surrey, UK
signal processingmachine learningmachine listeningaudio/speech/audio-visualmultimodal fusion