A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing benchmarks evaluate multimodal large language models (MLLMs) only on single audio tasks, making it difficult to assess their cross-category audio reasoning capabilities. To address this limitation, this work proposes the Audio Reasoning Tasks (ART) benchmark, which establishes the first evaluation framework specifically designed for joint reasoning across diverse audio categories. ART introduces composite test tasks that integrate multiple dimensions of audio understanding and reasoning, thereby overcoming the constraints of traditional single-task evaluations. Experimental results demonstrate that ART effectively differentiates models’ reasoning performance in complex audio scenarios, offering a new standard for evaluating audio comprehension in multimodal large language models.

Technology Category

Application Category

📝 Abstract

The present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation. Whether a multimodal model can answer the questions that require reasoning skills to combine audio tasks of different categories, cannot be verified with their use. To address this issue, we propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.

Problem

Research questions and friction points this paper is trying to address.

audio reasoning

multimodal large language models

benchmark

audio tasks

reasoning capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio Reasoning

Multimodal Large Language Models

Benchmark