RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use

📅 2025-08-31

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited stability and adaptability in multi-turn reasoning tasks requiring external tool invocation. To address this, we propose a plug-and-play reinforcement learning post-training framework featuring an asynchronous invoker and a decoupled tool-training architecture, formalizing the process as an observation-token-driven Markov Decision Process (MDP) with closed-loop control. Our method implements a “generate-parse-invoke-update” workflow and introduces a hybrid reward layer integrating rule-based validation, model-based judgment, and tool-verified feedback to support heterogeneous reward signals. This approach significantly lowers the deployment barrier for multi-turn tool orchestration: on the Search-R1 benchmark, Qwen3-4B achieves an NQ score of 0.486—surpassing larger models—while training throughput improves by 6.8×, demonstrating both efficiency and practicality.

Technology Category

Application Category

📝 Abstract

Large language models excel at basic reasoning but struggle with tasks that require interaction with external tools. We present RLFactory, a plug-and-play reinforcement learning post-training framework for multi-round tool use. RLFactory tackles (i) tool-call stability and adaptability amid tool heterogeneity and interface issues via an asyncio-based asynchronous caller and a decoupled tool/training architecture, and (ii) diverse evaluation needs via a reward layer supporting rule-based, model-judgment, and tool-verification signals. It reconstructs the MDP by introducing observation markers from tool feedback, closing the loop among model, tools, and environment, and implements a generate-parse-invoke-update workflow for dynamic policy optimization. On Search-R1 with Qwen3-4B, RLFactory achieves a 0.486 test score on the Natural Questions (NQ) dataset, surpassing larger models trained with similar techniques (e.g., Qwen2.5-7B-Instruct-GRPO at 0.473), and increases training throughput by 6.8x. RLFactory provides a low-barrier, highly adaptable framework for strengthening multi-round tool use of LLMs in real-world scenarios. Code: https://github.com/Simple-Efficient/RL-Factory.

Problem

Research questions and friction points this paper is trying to address.

Improving tool-call stability and adaptability in LLMs

Addressing diverse evaluation needs for tool-use tasks

Enhancing multi-turn tool interaction with external environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous caller for tool stability

Reward layer with multiple evaluation signals

Generate-parse-invoke-update workflow optimization

🔎 Similar Papers

StepTool: Enhancing Multi-Step Tool Usage in LLMs through Step-Grained Reinforcement Learning