ATIR: Towards Audio-Text Interleaved Contextual Retrieval

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the limited support for audio-text interleaved queries in existing information retrieval research, which has predominantly focused on image modalities. We introduce Audio-Text Interleaved Retrieval (ATIR), a novel task that enables contextual retrieval across alternating audio and text inputs. To facilitate this, we construct a unified benchmark integrating automatic speech recognition, question answering, and retrieval datasets, and train a dedicated retriever based on multimodal large language models. A key innovation is an audio token compression mechanism orthogonal to existing approaches, which effectively mitigates the challenge of excessively long audio sequences while jointly aligning audio-text representations within the retrieval framework. Experimental results demonstrate that our ATIR model significantly outperforms multiple strong baselines on the proposed benchmark, validating its effectiveness for semantic audio retrieval.

Technology Category

Application Category

📝 Abstract

Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on images, largely overlooking audio, especially in the setting of interleaved audio-text contextual retrieval. In this work, we introduce the Audio-Text Interleaved contextual Retrieval (ATIR) task, where queries can alternate between audio and text modalities. We construct an ATIR benchmark by integrating several Automatic Speech Recognition (ASR), QA, and retrieval datasets, ultimately unifying four types of contextual retrieval tasks. This benchmark substantially addresses the limitations of existing audio retrieval datasets in semantic retrieval. To study this task, we evaluate several off-the-shelf retrievers and train our ATIR model based on a Multimodal Large Language Model (MLLM). We further introduce a novel token compression mechanism that is orthogonal to existing compression methods, thereby alleviating the issue of excessive audio tokens in MLLM-based ATIR models. Experimental results demonstrate that our ATIR model achieves substantial improvements over strong baselines.

Problem

Research questions and friction points this paper is trying to address.

audio-text retrieval

interleaved retrieval

multimodal information retrieval

contextual retrieval

audio retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Text Interleaved Retrieval

Multimodal Large Language Model

Token Compression