🤖 AI Summary
This work addresses the lack of evaluation for advanced reasoning capabilities in existing text–audio retrieval benchmarks by introducing ReasonAudio, the first reasoning-oriented benchmark in this domain. ReasonAudio encompasses five task types—negation, temporal ordering, concurrency, duration, and mixed temporal reasoning—and includes 1,000 structured textual queries paired with 10,000 composite audio clips. Leveraging human-curated data and a contrastive learning framework, the study systematically evaluates ten state-of-the-art models, revealing significant performance gaps in reasoning-intensive scenarios, particularly in handling negation and duration discrimination. Notably, embedding strategies based on multimodal large language models fail to preserve the reasoning capacities of their underlying architectures. This work pioneers the integration of complex semantic reasoning into text–audio retrieval and exposes critical limitations of current approaches.
📝 Abstract
As multimodal content continues to expand at a rapid pace, audio retrieval has emerged as a key enabling technology for media search, content organization, and intelligent assistants. However, most existing benchmarks concentrate on semantic matching and fail to capture the fact that real-world queries often demand advanced reasoning abilities, including negation understanding, temporal ordering, concurrent event recognition, and duration discrimination. To address this gap, we introduce ReasonAudio, the first reasoning-intensive benchmark for Text-Audio Retrieval, comprising 1,000 queries and 10,000 composite audio clips across five fundamental reasoning tasks: Negation, Order, Overlap, Duration, and Mix. Despite their intuitive nature for humans and straightforward construction, these tasks pose significant challenges to current models. Our evaluation of ten state-of-the-art models reveals the following findings: All models struggle with reasoning-intensive audio retrieval, performing particularly poorly on Negation and Duration while showing relatively better results on Overlap and Order. Moreover, Multimodal Large Language Model-based embedding models fail to inherit the reasoning capabilities of their backbones through contrastive fine-tuning, suggesting that current training paradigms are insufficient to preserve reasoning capacity in retrieval settings