Reward-Conditioned Reinforcement Learning

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional reinforcement learning relies on fixed reward functions, limiting its adaptability to changing preferences and generalization capabilities. This work proposes Reward-Conditioned Reinforcement Learning (RCRL), a framework that conditions the policy on reward parameters, enabling off-policy learning of optimal policies for an entire family of reward objectives from experience collected under a single nominal target. By sharing a unified replay buffer, RCRL achieves, for the first time, efficient support for multiple reward goals within a single policy, combining the simplicity of single-task training with the flexibility of multi-task adaptation. Experimental results demonstrate that RCRL consistently outperforms existing baselines across single-task, multi-task, and visual benchmark settings, not only improving performance under the nominal reward but also enabling rapid generalization to new reward parameters.

Technology Category

Application Category

📝 Abstract
RL agents are typically trained under a single, fixed reward function, which makes them brittle to reward misspecification and limits their ability to adapt to changing task preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications while collecting experience under only one nominal objective. RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors. Across single-task, multi-task, and vision-based benchmarks, we show that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations. Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.
Problem

Research questions and friction points this paper is trying to address.

reward misspecification
task preference adaptation
reinforcement learning
policy robustness
reward conditioning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward-Conditioned Reinforcement Learning
off-policy learning
multi-objective optimization
policy conditioning
reward misspecification robustness
🔎 Similar Papers
No similar papers found.