Demystifying Reinforcement Learning in Agentic Reasoning

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Design principles for reinforcement learning (RL) in agent reasoning remain poorly understood, limiting the effectiveness of small-language-model-based tool use. Method: We systematically investigate three critical dimensions—data construction, algorithm design, and reasoning patterns—introducing (i) a high-diversity, model-aware dataset derived from real-world end-to-end tool trajectories; (ii) an exploration-friendly training strategy; and (iii) evidence that sparse, deliberate tool invocation outperforms frequent or verbose self-reasoning. Our approach integrates high-quality supervised fine-tuning (SFT), RL, reward shaping, policy entropy regularization, and gradient clipping. Contribution/Results: Experiments demonstrate that our 4B-parameter agent significantly surpasses a 32B baseline across AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6, establishing a new, efficient, and practical RL paradigm for intelligent agents.

Technology Category

Application Category

📝 Abstract

Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: https://github.com/Gen-Verse/Open-AgentRL

Problem

Research questions and friction points this paper is trying to address.

Identifying key design principles for reinforcement learning in agentic reasoning

Optimizing data, algorithm, and reasoning modes to enhance LLM agentic abilities

Establishing effective practices to improve training efficiency and reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using real end-to-end tool-use trajectories for initialization

Applying exploration-friendly techniques to enhance training efficiency

Employing deliberative strategy with fewer tool calls

🔎 Similar Papers

Reinforcement Learning and Machine ethics:a systematic review