Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

πŸ“… 2026-05-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

200K/year
πŸ€– AI Summary
This work addresses the limitations of conventional self-improving language models, which rely on data generation without structured difficulty control and thus struggle to sustainably enhance reasoning capabilities. The authors propose EvoEnv, a self-evolving reinforcement learning paradigm grounded in environment synthesis, wherein the model autonomously constructs executable and verifiable Python training environments. By replacing data generation with environment construction, EvoEnv maintains a stable asymmetry between problem-solving and verification, ensuring that reward signals remain informative throughout training. The approach integrates staged verification, semantic self-auditing, difficulty calibration, and novelty detection. Evaluated on Qwen3-4B-Thinking, it improves average performance from 72.4 to 74.8 (+3.3%), significantly outperforming RLVR baselines that rely on fixed datasets or handcrafted environments.
πŸ“ Abstract
We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.
Problem

Research questions and friction points this paper is trying to address.

self-improving language models
reasoning reinforcement learning
environment synthesis
solve-verify asymmetry
verifiable environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Environment Synthesis
Solve-Verify Asymmetry
Self-Evolving RL
Verifiable Oracles
Reasoning Reinforcement Learning