Scaling Audio-Text Retrieval with Multimodal Large Language Models

📅 2026-02-20

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the limitations of existing audio–text retrieval methods, which rely on small-scale encoders and struggle with complex queries requiring reasoning or world knowledge. The authors propose AuroLA, a novel framework that leverages a multimodal large language model (MLLM) as a unified backbone for the first time. AuroLA integrates a scalable data pipeline, multi-granularity annotations, prompt-driven embedding extraction, and a hybrid contrastive loss (Hybrid-NCE) to achieve effective cross-modal alignment. The framework further incorporates MLLM-driven bidirectional re-ranking and a hard negative re-weighting strategy. Remarkably, AuroLA outperforms the current state-of-the-art method, PE-AV, using only approximately 1% of the training data, while also demonstrating positive scaling trends with respect to both data and model size.

Technology Category

Application Category

📝 Abstract

Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a special token as audio/text embeddings. For model training, we devise a novel Hybrid-NCE loss, which employs multi-granular supervision and hard-negative reweighting to robustly align audio with diverse textual supervision; and (iii) we design an MLLM-based bidirectional re-ranking module that refines retrieval candidates through deep cross-modal interaction. Extensive experiments demonstrate that AuroLA consistently outperforms state-of-the-art models, including the recent PE-AV, while utilizing only approximately 1% of PE-AV's training data. Lastly, we observe clear scaling trends regarding dataset size and model capacity, validating the effectiveness of MLLM as a unified backbone for audio-text retrieval. Code is available at https://github.com/Jazzcharles/AuroLA.

Problem

Research questions and friction points this paper is trying to address.

audio-text retrieval

multimodal large language models

complex query understanding

encoder capacity limitation

cross-modal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models

Audio-Text Retrieval

Hybrid-NCE Loss