DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current text-to-audio (T2A) models struggle to achieve precise, fine-grained control over acoustic attributes, hindering personalized audio generation. To address this, we propose the first framework tailored for customized T2A synthesis, leveraging reference-audio-guided diffusion modeling integrated with large language models and dual-modality alignment training—enabling joint modeling of semantic fidelity and target acoustic characteristics (e.g., timbre, rhythm, event structure). Our method extracts and faithfully reproduces speaker- or style-specific acoustic features from only a few reference samples. We introduce two dedicated datasets, including the CTTA benchmark with real-world scene annotations. Experiments demonstrate that our model significantly outperforms state-of-the-art methods on customized generation tasks, achieving superior audio quality, semantic alignment, and acoustic feature fidelity—while maintaining competitive performance on general-purpose T2A benchmarks.

Technology Category

Application Category

📝 Abstract

With the development of large-scale diffusion-based and language-modeling-based generative models, impressive progress has been achieved in text-to-audio generation. Despite producing high-quality outputs, existing text-to-audio models mainly aim to generate semantically aligned sound and fall short on precisely controlling fine-grained acoustic characteristics of specific sounds. As a result, users that need specific sound content may find it challenging to generate the desired audio clips. In this paper, we present DreamAudio for customized text-to-audio generation (CTTA). Specifically, we introduce a new framework that is designed to enable the model to identify auditory information from user-provided reference concepts for audio generation. Given a few reference audio samples containing personalized audio events, our system can generate new audio samples that include these specific events. In addition, two types of datasets are developed for training and testing the customized systems. The experiments show that the proposed model, DreamAudio, generates audio samples that are highly consistent with the customized audio features and aligned well with the input text prompts. Furthermore, DreamAudio offers comparable performance in general text-to-audio tasks. We also provide a human-involved dataset containing audio events from real-world CTTA cases as the benchmark for customized generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Precise control of fine-grained acoustic characteristics in audio generation

Generating customized audio events from user-provided reference samples

Semantic alignment between generated audio and input text prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Customized text-to-audio generation with diffusion models

Framework using reference audio samples for personalized events

Generates audio consistent with customized features and text prompts

🔎 Similar Papers

Towards Diverse and Efficient Audio Captioning via Diffusion Models