AuTAgent: A Reinforcement Learning Framework for Tool-Augmented Audio Reasoning

📅 2026-02-14

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the challenge that large audio language models (LALMs) struggle to perform precise acoustic measurements in complex audio reasoning tasks, while existing tool-augmented approaches often suffer from information overload or insufficient contextual adaptability. To overcome these limitations, we propose AuTAgent—a reinforcement learning–based agent framework that dynamically decides when and which external audio analysis tools to invoke, guided by sparse feedback and a differential reward mechanism. This approach effectively mitigates redundant tool usage and enhances reasoning accuracy. Experimental results demonstrate that AuTAgent improves the accuracy of both open-source and closed-source base models by 4.20% and 6.20% on the MMAU Test-mini benchmark, and by 9.80% and 8.00% on the MMAR benchmark, respectively, while also exhibiting strong transferability across tasks.

Technology Category

Application Category

📝 Abstract

Large Audio Language Models (LALMs) excel at perception but struggle with complex reasoning requiring precise acoustic measurements. While external tools can extract fine-grained features like exact tempo or pitch, effective integration remains challenging: naively using all tools causes information overload, while prompt-based selection fails to assess context-dependent utility. To address this, we propose AuTAgent (Audio Tool Agent), a reinforcement learning framework that learns when and which tools to invoke. By employing a sparse-feedback training strategy with a novel Differential Reward mechanism, the agent learns to filter out irrelevant tools and invokes external assistance only when it yields a net performance gain over the base model. Experimental results confirm that AuTAgent complements the representation bottleneck of LALMs by providing verifiable acoustic evidence. It improves accuracy by 4.20% / 6.20% and 9.80% / 8.00% for open-source and closed-source backbones on the MMAU Test-mini and the MMAR benchmarks, respectively. In addition, further experiments demonstrate exceptional transferability. We highlight the complementary role of external tools in augmenting audio model reasoning.

Problem

Research questions and friction points this paper is trying to address.

Audio Reasoning

Tool Augmentation

Large Audio Language Models

Acoustic Measurement

Reasoning Bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-Augmented Reasoning

Reinforcement Learning

Audio Language Models