🤖 AI Summary
Training search agents using real-world web APIs is prohibitively expensive, while static data snapshots often introduce reward distortion that undermines reinforcement learning stability. To address this, this work proposes SearchGym—the first simulation framework for search agents that simultaneously achieves low cost and high fidelity. SearchGym leverages a generative pipeline to construct verifiable knowledge graphs and aligned document corpora, enabling the creation of factually consistent and strictly solvable reasoning tasks. Coupled with a curriculum learning strategy, SearchGym-RL facilitates progressive policy optimization from simple interactions to complex planning. Experiments demonstrate that an agent based on Qwen2.5-7B-Base outperforms the ASearcher baseline by an average of 10.6% across nine benchmarks, validating the framework’s effectiveness and scalability in enhancing sim-to-real transfer for search agents.
📝 Abstract
Search agents have emerged as a pivotal paradigm for solving open-ended, knowledge-intensive reasoning tasks. However, training these agents via Reinforcement Learning (RL) faces a critical dilemma: interacting with live commercial Web APIs is prohibitively expensive, while relying on static data snapshots often introduces noise due to data misalignment. This misalignment generates corrupted reward signals that destabilize training by penalizing correct reasoning or rewarding hallucination. To address this, we propose SearchGym, a simulation environment designed to bootstrap robust search agents. SearchGym employs a rigorous generative pipeline to construct a verifiable knowledge graph and an aligned document corpus, ensuring that every reasoning task is factually grounded and strictly solvable. Building on this controllable environment, we introduce SearchGym-RL, a curriculum learning methodology that progressively optimizes agent policies through purified feedback, evolving from basic interactions to complex, long-horizon planning. Extensive experiments across the Llama and Qwen families demonstrate strong Sim-to-Real generalization. Notably, our Qwen2.5-7B-Base model trained within SearchGym surpasses the web-enhanced ASearcher baseline across nine diverse benchmarks by an average relative margin of 10.6%. Our results validate that high-fidelity simulation serves as a scalable and highly cost-effective methodology for developing capable search agents.