Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

📅 2024-11-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

To address weak rare-event modeling and poor generalization in zero- and few-shot text-to-audio (TTA) generation, this paper proposes the first retrieval-augmented generation (RAG) framework tailored for TTA. Built upon the AudioBox flow-matching model, it leverages retrieved acoustically relevant segments from an unlabeled audio corpus as conditional inputs. The method integrates cross-modal text–audio retrieval, unsupervised audio similarity matching, and conditional flow-matching generation—enabling acoustic prior guidance without any audio annotations. Crucially, it preserves in-domain semantic alignment while substantially improving zero- and few-shot generalization: it achieves significant gains across multiple metrics on standard benchmarks, validating both the retrieval strategy and the audio source. The core contribution is the first introduction of an unsupervised RAG mechanism into TTA generation.

Technology Category

Application Category

📝 Abstract

Current leading Text-To-Audio (TTA) generation models suffer from degraded performance on zero-shot and few-shot settings. It is often challenging to generate high-quality audio for audio events that are unseen or uncommon in the training set. Inspired by the success of Retrieval-Augmented Generation (RAG) in Large Language Model (LLM)-based knowledge-intensive tasks, we extend the TTA process with additional conditioning contexts. We propose Audiobox TTA-RAG, a novel retrieval-augmented TTA approach based on Audiobox, a conditional flow-matching audio generation model. Unlike the vanilla Audiobox TTA solution which generates audio conditioned on text, we augmented the conditioning input with retrieved audio samples that provide additional acoustic information to generate the target audio. Our retrieval method does not require the external database to have labeled audio, offering more practical use cases. To evaluate our proposed method, we curated test sets in zero-shot and few-shot settings. Our empirical results show that the proposed model can effectively leverage the retrieved audio samples and significantly improve zero-shot and few-shot TTA performance, with large margins on multiple evaluation metrics, while maintaining the ability to generate semantically aligned audio for the in-domain setting. In addition, we investigate the effect of different retrieval methods and data sources.

Problem

Research questions and friction points this paper is trying to address.

Enhancing zero-shot and few-shot Text-To-Audio generation

Using retrieval-augmented conditioning with text and audio

Improving performance without requiring labeled external audio data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Retrieval-augmented TTA with text and audio

Label-free external audio database retrieval

Improved zero-shot and few-shot audio generation

🔎 Similar Papers

AccentBox: Towards High-Fidelity Zero-Shot Accent Generation

2024-09-13arXiv.orgCitations: 1