Periodic Asynchrony: An Effective Method for Accelerating On-Policy Reinforcement Learning

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

In on-policy reinforcement learning, tight coupling between inference and training—executed synchronously—induces severe computational bottlenecks and limits throughput. Method: This paper proposes a periodic asynchronous architecture that decouples inference and training via separate deployment, periodic asynchronous data loading, a unified three-model training framework, and shared prompt attention masks. Contribution/Results: The design fully preserves algorithmic accuracy—matching synchronous baselines exactly—while eliminating redundant computation and reducing inter-component communication overhead. Evaluated on an NPU platform, it achieves over 3× higher end-to-end training throughput—the first such performance breakthrough at equivalent accuracy. By enabling elastic scaling of individual components without sacrificing precision, the approach establishes a scalable, high-fidelity asynchronous paradigm for industrial-strength reinforcement learning systems.

Technology Category

Application Category

📝 Abstract

Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.

Problem

Research questions and friction points this paper is trying to address.

Accelerating on-policy reinforcement learning training efficiency

Breaking computational coupling between inference and training phases

Enabling independent elastic scaling of RL components

Innovation

Methods, ideas, or system contributions that make the work stand out.

Periodically asynchronous framework decouples inference and training

Unified tri-model architecture optimizes the training phase

Shared-prompt attention mask reduces repetitive computation overhead

🔎 Similar Papers

No similar papers found.