GD-Retriever: Controllable Generative Text-Music Retrieval with Diffusion Models

📅 2025-06-21

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Text-to-music retrieval suffers from many-to-many semantic mapping due to linguistic ambiguity, and existing contrastive models lack flexibility and user controllability. To address this, we propose the first diffusion-based generative retrieval framework: leveraging a pre-trained diffusion model to synthesize semantically aligned query embeddings in a frozen, non-jointly trained audio latent space, supervised by a contrastive teacher model; further integrating DDIM inversion and negative prompting to enable post-hoc retrieval control and fine-grained interactive intervention. Crucially, our method avoids end-to-end audio generation and operates entirely in the latent space for efficiency. Experiments on multiple benchmarks demonstrate substantial improvements over contrastive baselines, achieving superior retrieval accuracy while enabling explicit, interpretable user guidance during retrieval.

Technology Category

Application Category

📝 Abstract

Multimodal contrastive models have achieved strong performance in text-audio retrieval and zero-shot settings, but improving joint embedding spaces remains an active research area. Less attention has been given to making these systems controllable and interactive for users. In text-music retrieval, the ambiguity of freeform language creates a many-to-many mapping, often resulting in inflexible or unsatisfying results. We introduce Generative Diffusion Retriever (GDR), a novel framework that leverages diffusion models to generate queries in a retrieval-optimized latent space. This enables controllability through generative tools such as negative prompting and denoising diffusion implicit models (DDIM) inversion, opening a new direction in retrieval control. GDR improves retrieval performance over contrastive teacher models and supports retrieval in audio-only latent spaces using non-jointly trained encoders. Finally, we demonstrate that GDR enables effective post-hoc manipulation of retrieval behavior, enhancing interactive control for text-music retrieval tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing controllability in text-music retrieval systems

Addressing ambiguity in freeform language for retrieval

Improving retrieval performance with generative diffusion models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion models generate retrieval-optimized latent queries

Negative prompting and DDIM inversion enhance controllability

Supports audio-only retrieval with non-joint encoders

🔎 Similar Papers

Melody-Guided Music Generation