🤖 AI Summary
Traditional reinforcement learning relies on fixed reward functions, limiting its adaptability to changing preferences and generalization capabilities. This work proposes Reward-Conditioned Reinforcement Learning (RCRL), a framework that conditions the policy on reward parameters, enabling off-policy learning of optimal policies for an entire family of reward objectives from experience collected under a single nominal target. By sharing a unified replay buffer, RCRL achieves, for the first time, efficient support for multiple reward goals within a single policy, combining the simplicity of single-task training with the flexibility of multi-task adaptation. Experimental results demonstrate that RCRL consistently outperforms existing baselines across single-task, multi-task, and visual benchmark settings, not only improving performance under the nominal reward but also enabling rapid generalization to new reward parameters.
📝 Abstract
RL agents are typically trained under a single, fixed reward function, which makes them brittle to reward misspecification and limits their ability to adapt to changing task preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications while collecting experience under only one nominal objective. RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors. Across single-task, multi-task, and vision-based benchmarks, we show that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations. Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.