Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

📅 2024-07-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-audio (T2A) models suffer from insufficient fidelity and controllability under complex prompts, primarily due to the scarcity of high-quality paired data rich in temporal, scene-level, and environmental structural information. To address this, we introduce Sound-VECaps—the first million-scale (1.66M) audio-visual-enhanced text dataset—featuring an LLM-driven, vision-guided caption enhancement paradigm that jointly incorporates visual descriptions, audio semantics, and fine-grained labels to generate spatiotemporally explicit, multimodal textual annotations. We further design a multi-source semantic alignment synthesis pipeline and establish an end-to-end diffusion-based T2A training and evaluation framework. Experiments demonstrate consistent improvements: BLEU-4 scores increase by ≥3.2 on AudioCaps and Clotho; generation quality under complex prompts is substantially enhanced; and cross-modal retrieval and understanding tasks achieve state-of-the-art generalization performance.

Technology Category

Application Category

📝 Abstract
Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks, showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online from here https://yyua8222.github.io/Sound-VECaps-demo/.
Problem

Research questions and friction points this paper is trying to address.

Audio Generation
Complex Instructions
Dataset Complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sound-VECaps
Automated Dataset Generation
Complex Instruction Understanding
🔎 Similar Papers
No similar papers found.
Yi Yuan
Yi Yuan
NetEase Fuxi AI Lab
deep learningcomputer vision
Dongya Jia
Dongya Jia
ByteDance Seed
Generative ModelLLMAudio Generation
Xiaobin Zhuang
Xiaobin Zhuang
Bytedance
Audio Generation
Y
Yuanzhe Chen
ByteDance
Z
Zhengxi Liu
ByteDance
Z
Zhuo Chen
ByteDance
Y
Yuping Wang
ByteDance
Y
Yuxuan Wang
ByteDance
X
Xubo Liu
CVSSP, University of Surrey
M
Mark D. Plumbley
CVSSP, University of Surrey
Wenwu Wang
Wenwu Wang
Professor, University of Surrey, UK
signal processingmachine learningmachine listeningaudio/speech/audio-visualmultimodal fusion