RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address learning signal degradation in large language model reinforcement learning caused by static data distributions—specifically, training inefficiency arising from task difficulty mismatch—this paper proposes RLVE, a framework leveraging verifiable, programmatically generated reasoning environments and algorithmically verifiable dense rewards to enable adaptive difficulty control. Methodologically, RLVE introduces an environment-adaptive mechanism that overcomes the limitations of static datasets and establishes RLVE-Gym, a scalable suite of environments supporting multi-environment joint training and continual capability improvement. Empirically, after joint training across 400 heterogeneous environments, the model achieves an average absolute performance gain of 3.37% across six mainstream reasoning benchmarks—outperforming baseline methods by +0.49%—while also demonstrating superior computational efficiency.

Technology Category

Application Category

📝 Abstract

We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.

Problem

Research questions and friction points this paper is trying to address.

Scaling reinforcement learning for language models with adaptive environments

Addressing vanishing learning signals from static data distributions

Improving generalizable reasoning through verifiable environment scaling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive verifiable environments scale RL training

Procedurally generated problems with algorithmic rewards

Dynamic difficulty adjustment based on model capabilities

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study