SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the low training efficiency and high computational cost of multi-turn, long-horizon reinforcement learning (RL) agents, this paper proposes SA-SWE—a computationally efficient LLM-agent RL training framework supporting long-horizon reasoning. Our method introduces three key innovations: (1) an asynchronous pipelined scheduler achieving 1.55× training speedup; (2) an AST-based code search tool integrated into a tool-augmented training pipeline, significantly improving code navigation capability and sample utilization efficiency; and (3) a lightweight, backend-agnostic tool integration architecture. Training SA-SWE end-to-end via pure RL on Qwen3-32B yields SA-SWE-32B, which achieves 39.4% Pass@1 on SWE-Bench Verified—surpassing prior RL-based approaches—while reducing training cost by over 2×. Moreover, SA-SWE-32B demonstrates strong cross-task generalization in terminal operation, web browsing, and other complex real-world scenarios.

Technology Category

Application Category

📝 Abstract

We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.

Problem

Research questions and friction points this paper is trying to address.

Efficient reinforcement learning training for multi-turn LLM agents

Optimizing software engineering agent performance on code tasks

Enabling flexible agent training across different specialized backends

Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient asynchronous dispatching for RL training

Lightweight tool integration with AST-based search

Flexible backend interoperability with existing frameworks

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study