GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

๐Ÿ“… 2026-05-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

209K/year
๐Ÿค– AI Summary
Traditional reinforcement learningโ€“based post-training is costly and exhibits limited generalization, hindering its broad applicability. This work proposes a lightweight reinforcement learning from human feedback (RLHF) approach that trains the Qwen3-4B-Base model from scratch using only 5K prompts in an open-ended environment. For the first time, it demonstrates that small-scale RLHF can effectively generalize to complex downstream tasks such as mathematical reasoning and code generation. With merely 22.7 GPU hours, the method boosts average cross-domain performance from 24.1 to 63.1, reducing data and computational requirements by factors of 46ร— and 68ร—, respectively, while matching the performance of far more expensive post-training baselines.
๐Ÿ“ Abstract
Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{https://github.com/SJY8460/GRLO}{https://github.com/SJY8460/GRLO}.
Problem

Research questions and friction points this paper is trying to address.

Generalizable Reinforcement Learning
Open-Ended Environments
Zero-Shot Transfer
Post-training Efficiency
Downstream Task Generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

GRLO
generalizable reinforcement learning
open-ended environments
zero-shot transfer
efficient post-training
๐Ÿ”Ž Similar Papers
No similar papers found.