Agent RL Scaling Law: Agent RL with Spontaneous Code Execution for Mathematical Problem Solving

πŸ“… 2025-05-12
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) exhibit limited performance in mathematical reasoning requiring precise computation. To address this, we propose Zero-shot Tool-Integrated Reasoning (ZeroTIR), a framework that trains base LLMsβ€”via pure outcome-oriented reinforcement learning (PPO/SAC)β€”to autonomously generate and execute Python code for solving mathematical problems, without any supervised signals for tool usage. Our key contribution is the discovery and quantification of scaling laws in agent-based RL: spontaneous code execution frequency, response length, and answer accuracy all co-evolve predictably with training steps, revealing a deterministic emergence mechanism for tool-augmented reasoning. Experiments demonstrate that ZeroTIR significantly outperforms the tool-free baseline ZeroRL across multiple mathematical benchmarks. A secure sandbox environment ensures safe code execution, and all source code and evaluation benchmarks are publicly released.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) often struggle with mathematical reasoning tasks requiring precise, verifiable computation. While Reinforcement Learning (RL) from outcome-based rewards enhances text-based reasoning, understanding how agents autonomously learn to leverage external tools like code execution remains crucial. We investigate RL from outcome-based rewards for Tool-Integrated Reasoning, ZeroTIR, training base LLMs to spontaneously generate and execute Python code for mathematical problems without supervised tool-use examples. Our central contribution is we demonstrate that as RL training progresses, key metrics scale predictably. Specifically, we observe strong positive correlations where increased training steps lead to increases in the spontaneous code execution frequency, the average response length, and, critically, the final task accuracy. This suggests a quantifiable relationship between computational effort invested in training and the emergence of effective, tool-augmented reasoning strategies. We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show ZeroTIR significantly surpasses non-tool ZeroRL baselines on challenging math benchmarks. Our findings provide a foundational understanding of how autonomous tool use is acquired and scales within Agent RL, offering a reproducible benchmark for future studies. Code is released at href{https://github.com/Anonymize-Author/AgentRL}{https://github.com/Anonymize-Author/AgentRL}.
Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with precise mathematical reasoning tasks
Autonomous learning of tool use like code execution is crucial
Investigating RL for tool-integrated reasoning without supervised examples
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL trains LLMs to autonomously execute Python code
Scaling laws link training steps to code execution frequency
Decoupled execution environment ensures robust tool integration
πŸ”Ž Similar Papers
No similar papers found.
Xinji Mai
Xinji Mai
Fudan University
ReasoningAgent RL
H
Haotian Xu
Xiaohongshu
W
W. Xing
Xiaohongshu
Weinong Wang
Weinong Wang
Xian Jiaotong University
LLM/VLLM/RL
Y
Yingying Zhang
East China Normal University
W
Wenqiang Zhang
Fudan University