🤖 AI Summary
This work addresses the challenge of training Tool Integration Reasoning (TIR) capabilities in interactive multimodal tool-use agents. Methodologically, we propose Turn-level Adjudication Reinforcement Learning (TARL): (1) an RL sandbox environment supporting interleaved speech–text rollouts; (2) a large language model (LLM) serving as a fine-grained adjudicator for turn-level credit assignment; and (3) a hybrid task curriculum to encourage long-horizon exploration. Our key contribution is the first integration of LLM-based adjudication, multimodal foundation model fine-tuning, and speech–text co-generated rollouts—enabling end-to-end TIR training. Experiments demonstrate that our approach achieves over 6% higher task success rate than strong baselines on the text-only τ-bench. Moreover, it successfully endows multimodal models with speech-interactive tool invocation capabilities.
📝 Abstract
Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $τ$-bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine-tuning a multi-modal foundation model for agentic tasks. By training a base multi-modal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.