From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenges in post-training multi-turn interactive tool-using agents, which are hindered by the difficulty of scaling high-quality synthetic data and the inefficiency caused by noisy user simulations in reinforcement learning. The authors propose a unified framework that integrates self-evolving synthetic data generation with validator-based reinforcement learning. Specifically, a multi-agent system produces tool-augmented dialogues equipped with executable verifiers, and a closed-loop self-evolution mechanism enhances data reliability. Training proceeds through staged trajectory-level group relative policy optimization (GRPO-style) with a novel verifiable reward mechanism and dynamic filtering strategy, enabling efficient, annotation-free learning. Evaluated on the tau²-bench, the model achieves 73.0% and 98.3% pass¹ rates on the Airline and Telecom tasks, respectively, matching or surpassing current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.

Problem

Research questions and friction points this paper is trying to address.

tool-using agents

multi-turn interaction

synthetic data

reinforcement learning

post-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving synthetic data

verifiable-reward RL

tool-using agents