EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

To address out-of-memory (OOM) errors and cross-device communication bottlenecks arising from long contexts in agent-based reinforcement learning (RL) with large language models (LLMs), this paper proposes an efficient, scalable training system. Methodologically, it introduces (1) a dynamic parallelism selector that adaptively schedules model parallelism, data parallelism, and sequence chunking based on input sequence length and real-time hardware state; and (2) a layout-aware decentralized data distributor that eliminates centralized intermediate tensor transfers, thereby reducing communication overhead. The approach lifts the fixed-context-length constraint inherent in conventional systems, enabling stable multi-turn interaction and tool-use capabilities without compromising training stability. Empirically, it significantly mitigates OOM failures, improves training throughput, and enhances system scalability—facilitating robust, large-scale training of intelligent LLM agents.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has become a pivotal component of large language model (LLM) post-training, and agentic RL extends this paradigm to operate as agents through multi-turn interaction and tool use. Scaling such systems exposes two practical bottlenecks: (1) context length grows rapidly during training, inflating memory usage and latency, and triggering out-of-memory (OOM) failures; and (2) intermediate tensors accumulate with context length, making cross-device data movement a major system bottleneck. We present EARL, a scalable system for efficient agentic RL. EARL designs a parallelism selector that dynamically adapts model and training parallelism across RL stages based on sequence length and system load, and a data dispatcher that performs layout-aware, decentralized exchange of intermediate data batches. Together, these components increase throughput, reduce long-context failures, and enable stable large-scale training of agentic LLMs without relying on hard limits or penalties of context length.

Problem

Research questions and friction points this paper is trying to address.

Addresses memory and latency issues in agentic RL training

Reduces cross-device data movement bottlenecks in LLM systems

Enables stable large-scale training without context length penalties

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic parallelism selector adapts to sequence length

Layout-aware data dispatcher enables decentralized data exchange

System increases throughput and reduces context failures

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study