Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of training Tool Integration Reasoning (TIR) capabilities in interactive multimodal tool-use agents. Methodologically, we propose Turn-level Adjudication Reinforcement Learning (TARL): (1) an RL sandbox environment supporting interleaved speech–text rollouts; (2) a large language model (LLM) serving as a fine-grained adjudicator for turn-level credit assignment; and (3) a hybrid task curriculum to encourage long-horizon exploration. Our key contribution is the first integration of LLM-based adjudication, multimodal foundation model fine-tuning, and speech–text co-generated rollouts—enabling end-to-end TIR training. Experiments demonstrate that our approach achieves over 6% higher task success rate than strong baselines on the text-only τ-bench. Moreover, it successfully endows multimodal models with speech-interactive tool invocation capabilities.

Technology Category

Application Category

📝 Abstract
Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $τ$-bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine-tuning a multi-modal foundation model for agentic tasks. By training a base multi-modal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.
Problem

Research questions and friction points this paper is trying to address.

Training agents for interactive multimodal tool use
Addressing credit assignment in long-horizon reinforcement learning tasks
Enhancing exploration through mixed-task training curriculum
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process-supervised reinforcement learning for tool-use
LLM as judge for turn-level credit assignment
Multimodal foundation model fine-tuning with speech-text
🔎 Similar Papers
No similar papers found.