Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses key challenges in post-training large multimodal models with reinforcement learning, including heterogeneous data streams, robustness at scale, and the trade-off between policy staleness and throughput. To this end, we propose an open-source asynchronous reinforcement learning training engine featuring a tri-level co-design: a full-stack native multimodal architecture supporting text, image, audio, and video inputs; role-level fault isolation to ensure service robustness; and an asynchronous data bus based on TransferQueue coupled with the R3 (Rollout Routing Replay) paradigm for efficient decoupled training. Experiments demonstrate that our approach achieves a 1.20× end-to-end speedup on Qwen3-4B, up to 2.00× acceleration under asynchronous execution, incurs only 1.9% overhead for MoE models, and exhibits stable convergence across multimodal tasks.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) post-training has proven effective at unlocking reasoning, self-reflection, and tool-use capabilities in large language models. As models extend to omni-modal inputs and agentic multi-turn workflows, RL training systems face three interdependent challenges: heterogeneous data flows, operational robustness at scale, and the staleness -- throughput tradeoff. We present \textbf{Relax} (Reinforcement Engine Leveraging Agentic X-modality), an open-source RL training engine that addresses these challenges through three co-designed architectural layers. First, an \emph{omni-native architecture} builds multimodal support into the full stack -- from data preprocessing and modality-aware parallelism to inference generation -- rather than retrofitting it onto a text-centric pipeline. Second, each RL role runs as an independent, fault-isolated service that can be scaled, recovered, and upgraded without global coordination. Third, service-level decoupling enables asynchronous training via the TransferQueue data bus, where a single staleness parameter smoothly interpolates among on-policy, near-on-policy, and fully asynchronous execution. Relax achieves a 1.20$\times$ end-to-end speedup over veRL on Qwen3-4B on-policy training. Its fully async mode delivers a 1.76$\times$ speedup over colocate on Qwen3-4B and a 2.00$\times$ speedup on Qwen3-Omni-30B, while all modes converge to the same reward level. Relax supports R3 (Rollout Routing Replay)~\cite{ma2025r3} for MoE models with only 1.9\% overhead, compared to 32\% degradation in veRL under the same configuration. It further demonstrates stable omni-modal RL convergence on Qwen3-Omni across image, text, and audio, sustaining over 2{,}000 steps on video without degradation. Relax is available at https://github.com/rednote-ai/Relax.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

omni-modal

asynchronous training

scalability

data heterogeneity

Innovation

Methods, ideas, or system contributions that make the work stand out.

asynchronous reinforcement learning

omni-modal training

fault-isolated microservices