🤖 AI Summary
This work addresses the scarcity of interactive training environments and the limitations of synthetic approaches in diversity and scalability, which hinder self-exploratory learning for general-purpose agents. The paper introduces, for the first time, a framework for constructing highly diverse, scalable, and verifiable interactive environments from scratch. It ensures environmental reliability through procedural testing, and guarantees task completeness and solvability by integrating tool dependency graph expansion with executable action validation. The proposed method substantially improves agent performance on unseen multi-turn tool-use benchmarks such as τ²-Bench and VitaBench, demonstrating that the scale and diversity of training environments play a critical role in enhancing agent generalization capabilities.
📝 Abstract
Training generalist agents capable of adapting to diverse scenarios requires interactive environments for self-exploration. However, interactive environments remain critically scarce, and existing synthesis methods suffer from significant limitations regarding environmental diversity and scalability. To address these challenges, we introduce ScaleEnv, a framework that constructs fully interactive environments and verifiable tasks entirely from scratch. Specifically, ScaleEnv ensures environment reliability through procedural testing, and guarantees task completeness and solvability via tool dependency graph expansion and executable action verification. By enabling agents to learn through exploration within ScaleEnv, we demonstrate significant performance improvements on unseen, multi-turn tool-use benchmarks such as $\tau^2$-Bench and VitaBench, highlighting strong generalization capabilities. Furthermore, we investigate the relationship between increasing number of domains and model generalization performance, providing empirical evidence that scaling environmental diversity is critical for robust agent learning.