SkyRL-Agent: Efficient RL Training for Multi-turn LLM Agent

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low training efficiency and high computational cost of multi-turn, long-horizon reinforcement learning (RL) agents, this paper proposes SA-SWE—a computationally efficient LLM-agent RL training framework supporting long-horizon reasoning. Our method introduces three key innovations: (1) an asynchronous pipelined scheduler achieving 1.55× training speedup; (2) an AST-based code search tool integrated into a tool-augmented training pipeline, significantly improving code navigation capability and sample utilization efficiency; and (3) a lightweight, backend-agnostic tool integration architecture. Training SA-SWE end-to-end via pure RL on Qwen3-32B yields SA-SWE-32B, which achieves 39.4% Pass@1 on SWE-Bench Verified—surpassing prior RL-based approaches—while reducing training cost by over 2×. Moreover, SA-SWE-32B demonstrates strong cross-task generalization in terminal operation, web browsing, and other complex real-world scenarios.

Technology Category

Application Category

📝 Abstract
We introduce SkyRL-Agent, a framework for efficient, multi-turn, long-horizon agent training and evaluation. It provides efficient asynchronous dispatching, lightweight tool integration, and flexible backend interoperability, enabling seamless use with existing RL frameworks such as SkyRL-train, VeRL, and Tinker. Using SkyRL-Agent, we train SA-SWE-32B, a software engineering agent trained from Qwen3-32B (24.4% Pass@1) purely with reinforcement learning. We introduce two key components: an optimized asynchronous pipeline dispatcher that achieves a 1.55x speedup over naive asynchronous batching, and a tool-enhanced training recipe leveraging an AST-based search tool to facilitate code navigation, boost rollout Pass@K, and improve training efficiency. Together, these optimizations enable SA-SWE-32B to reach 39.4% Pass@1 on SWE-Bench Verified with more than 2x cost reduction compared to prior models reaching similar performance. Despite being trained solely on SWE tasks, SA-SWE-32B generalizes effectively to other agentic tasks, including Terminal-Bench, BrowseComp-Plus, and WebArena. We further demonstrate SkyRL-Agent's extensibility through case studies on deep research, computer use, and memory agents, each trained using a different training backend.
Problem

Research questions and friction points this paper is trying to address.

Efficient reinforcement learning training for multi-turn LLM agents
Optimizing software engineering agent performance on code tasks
Enabling flexible agent training across different specialized backends
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient asynchronous dispatching for RL training
Lightweight tool integration with AST-based search
Flexible backend interoperability with existing frameworks