π€ AI Summary
Text-to-music retrieval suffers from many-to-many semantic mapping due to linguistic ambiguity, and existing contrastive models lack flexibility and user controllability. To address this, we propose the first diffusion-based generative retrieval framework: leveraging a pre-trained diffusion model to synthesize semantically aligned query embeddings in a frozen, non-jointly trained audio latent space, supervised by a contrastive teacher model; further integrating DDIM inversion and negative prompting to enable post-hoc retrieval control and fine-grained interactive intervention. Crucially, our method avoids end-to-end audio generation and operates entirely in the latent space for efficiency. Experiments on multiple benchmarks demonstrate substantial improvements over contrastive baselines, achieving superior retrieval accuracy while enabling explicit, interpretable user guidance during retrieval.
π Abstract
Multimodal contrastive models have achieved strong performance in text-audio retrieval and zero-shot settings, but improving joint embedding spaces remains an active research area. Less attention has been given to making these systems controllable and interactive for users. In text-music retrieval, the ambiguity of freeform language creates a many-to-many mapping, often resulting in inflexible or unsatisfying results.
We introduce Generative Diffusion Retriever (GDR), a novel framework that leverages diffusion models to generate queries in a retrieval-optimized latent space. This enables controllability through generative tools such as negative prompting and denoising diffusion implicit models (DDIM) inversion, opening a new direction in retrieval control. GDR improves retrieval performance over contrastive teacher models and supports retrieval in audio-only latent spaces using non-jointly trained encoders. Finally, we demonstrate that GDR enables effective post-hoc manipulation of retrieval behavior, enhancing interactive control for text-music retrieval tasks.