Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents

📅 2025-09-17

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of training Tool Integration Reasoning (TIR) capabilities in interactive multimodal tool-use agents. Methodologically, we propose Turn-level Adjudication Reinforcement Learning (TARL): (1) an RL sandbox environment supporting interleaved speech–text rollouts; (2) a large language model (LLM) serving as a fine-grained adjudicator for turn-level credit assignment; and (3) a hybrid task curriculum to encourage long-horizon exploration. Our key contribution is the first integration of LLM-based adjudication, multimodal foundation model fine-tuning, and speech–text co-generated rollouts—enabling end-to-end TIR training. Experiments demonstrate that our approach achieves over 6% higher task success rate than strong baselines on the text-only τ-bench. Moreover, it successfully endows multimodal models with speech-interactive tool invocation capabilities.

Technology Category

Application Category

📝 Abstract

Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $τ$-bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine-tuning a multi-modal foundation model for agentic tasks. By training a base multi-modal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.

Problem

Research questions and friction points this paper is trying to address.

Training agents for interactive multimodal tool use

Addressing credit assignment in long-horizon reinforcement learning tasks

Enhancing exploration through mixed-task training curriculum

Innovation

Methods, ideas, or system contributions that make the work stand out.

Process-supervised reinforcement learning for tool-use

LLM as judge for turn-level credit assignment

Multimodal foundation model fine-tuning with speech-text

🔎 Similar Papers

StepTool: Enhancing Multi-Step Tool Usage in LLMs through Step-Grained Reinforcement Learning