🤖 AI Summary
To address out-of-memory (OOM) errors and cross-device communication bottlenecks arising from long contexts in agent-based reinforcement learning (RL) with large language models (LLMs), this paper proposes an efficient, scalable training system. Methodologically, it introduces (1) a dynamic parallelism selector that adaptively schedules model parallelism, data parallelism, and sequence chunking based on input sequence length and real-time hardware state; and (2) a layout-aware decentralized data distributor that eliminates centralized intermediate tensor transfers, thereby reducing communication overhead. The approach lifts the fixed-context-length constraint inherent in conventional systems, enabling stable multi-turn interaction and tool-use capabilities without compromising training stability. Empirically, it significantly mitigates OOM failures, improves training throughput, and enhances system scalability—facilitating robust, large-scale training of intelligent LLM agents.
📝 Abstract
Reinforcement learning (RL) has become a pivotal component of large language model (LLM) post-training, and agentic RL extends this paradigm to operate as agents through multi-turn interaction and tool use. Scaling such systems exposes two practical bottlenecks: (1) context length grows rapidly during training, inflating memory usage and latency, and triggering out-of-memory (OOM) failures; and (2) intermediate tensors accumulate with context length, making cross-device data movement a major system bottleneck.
We present EARL, a scalable system for efficient agentic RL. EARL designs a parallelism selector that dynamically adapts model and training parallelism across RL stages based on sequence length and system load, and a data dispatcher that performs layout-aware, decentralized exchange of intermediate data batches. Together, these components increase throughput, reduce long-context failures, and enable stable large-scale training of agentic LLMs without relying on hard limits or penalties of context length.