π€ AI Summary
This work addresses the inefficiencies and opacities in existing powerful multimodal audio-language models, which often introduce redundancy or bottlenecks through indiscriminate external tool invocation. The authors propose an auditable and pluggable audio understanding framework that dynamically assesses the sufficiency of initial evidence and triggers a planner-guided tool call only when an information gap is detected, thereby avoiding unnecessary agent decomposition. By integrating a large audio-language model with a conditional evidence acquisition mechanism, the approach generates traceable reasoning chains that balance computational efficiency with interpretability. Evaluated on MMAR and MSU-Bench, the method achieves accuracies of 80.4% and 82.8%, respectively, significantly outperforming current audio-agent baselines while delivering high-quality, explainable reasoning processes.
π Abstract
Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence acquisition genuinely benefits audio understanding. We propose Audio-Mind, an auditable and pluggable framework for conditional evidence acquisition in audio understanding. Audio-Mind dynamically combines a strong frontend with planner-guided tool use, preserving frontend judgment when initial evidence is sufficient while acquiring bounded external evidence for questions with unresolved evidence gaps. Experiments on MMAR and MSU-Bench show that Audio-Mind outperforms prior audio-agent baselines, reaching 80.4% accuracy on MMAR and 82.8% accuracy on MSU-Bench. A matched-backbone comparison highlights why this design matters: under strong audio frontends, agentic decomposition can become an orchestration bottleneck when the workflow does not preserve the frontend's holistic audio-grounded judgment. Beyond accuracy, Audio-Mind produces higher-quality, auditable reasoning traces that expose uncertainty, tool evidence, and answer rationales, offering a potential basis for more reliable audio-QA annotation and error analysis.