RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

📅 2026-02-02

📈 Citations: 1

✨ Influential: 0

career value

250K/year

🤖 AI Summary

This work addresses the limitations of traditional reinforcement learning, where static decoupling among the environment, policy, and reward model hinders adaptability to dynamic tasks—particularly in large language model (LLM) agent settings, where weak learning signals and poor generalization are prevalent. To overcome these challenges, we propose the first fully dynamic co-adaptive reinforcement learning framework that enables closed-loop joint optimization of the environment, policy, and reward model. Our approach integrates step-level and outcome-level feedback for policy training, employs consistency constraints to refine the reward model, and introduces a theory-driven mechanism for automatic environment adaptation. Extensive experiments on OSWorld, AlfWorld, and LiveBench demonstrate substantial performance gains: Qwen3-VL-8B-Thinking improves by 9.1%, while Qwen2.5-7B-Instruct achieves gains of 18.7% and 11.9%, respectively, validating the effectiveness and composability of our framework’s components.

Technology Category

Application Category

📝 Abstract

We propose RLAnything, a reinforcement learning framework that dynamically forges environment, policy, and reward models through closed-loop optimization, amplifying learning signals and strengthening the overall RL system for any LLM or agentic scenarios. Specifically, the policy is trained with integrated feedback from step-wise and outcome signals, while the reward model is jointly optimized via consistency feedback, which in turn further improves policy training. Moreover, our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience. Empirically, each added component consistently improves the overall system, and RLAnything yields substantial gains across various representative LLM and agentic tasks, boosting Qwen3-VL-8B-Thinking by 9.1% on OSWorld and Qwen2.5-7B-Instruct by 18.7% and 11.9% on AlfWorld and LiveBench, respectively. We also that optimized reward-model signals outperform outcomes that rely on human labels. Code: https://github.com/Gen-Verse/Open-AgentRL

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

dynamic environment

reward model

policy optimization

LLM agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic reinforcement learning

closed-loop optimization

reward model co-training