🤖 AI Summary
Large language models (LLMs) struggle with reliable, multi-step tool invocation and precise execution in real-world tasks. Method: This paper proposes a tool-augmented reinforcement learning framework that generates executable Python code to orchestrate compositional tool calls, enabling user-defined tool integration and cross-step variable sharing. It introduces a sparse reward function grounded in execution outcomes and a dynamic sample queue mechanism to enhance policy optimization efficiency and reuse high-quality trajectories, thereby significantly reducing online sampling overhead. Contribution/Results: Evaluated on the GAIA benchmark, our approach achieves approximately 10% absolute accuracy improvement over prior methods, demonstrating superior robustness and generalization—particularly on complex, multi-step tasks. It establishes an efficient, scalable paradigm for LLM-driven tool-coordinated reasoning.
📝 Abstract
Large language models (LLMs) have demonstrated strong capabilities in language understanding and reasoning, yet they remain limited when tackling real-world tasks that require up-to-date knowledge, precise operations, or specialized tool use. To address this, we propose Tool-R1, a reinforcement learning framework that enables LLMs to perform general, compositional, and multi-step tool use by generating executable Python code. Tool-R1 supports integration of user-defined tools and standard libraries, with variable sharing across steps to construct coherent workflows. An outcome-based reward function, combining LLM-based answer judgment and code execution success, guides policy optimization. To improve training efficiency, we maintain a dynamic sample queue to cache and reuse high-quality trajectories, reducing the overhead of costly online sampling. Experiments on the GAIA benchmark show that Tool-R1 substantially improves both accuracy and robustness, achieving about 10% gain over strong baselines, with larger improvements on complex multi-step tasks. These results highlight the potential of Tool-R1 for enabling reliable and efficient tool-augmented reasoning in real-world applications. Our code will be available at https://github.com/YBYBZhang/Tool-R1.