🤖 AI Summary
In preference-based reinforcement learning (PbRL), reward modeling suffers from slow convergence and objective mismatch between pretraining and preference fine-tuning. To address these issues, this paper proposes the Residual Reward Model (RRM), which decomposes the true reward into a prior reward—derived via inverse reinforcement learning—and a learnable residual term. This is the first application of a residual architecture to reward modeling, enabling compatibility with diverse prior specifications while unifying the optimization objectives across pretraining and preference fine-tuning. RRM employs a dual-stream neural network architecture processing both state and image inputs, supporting end-to-end policy optimization. Empirical evaluation on Meta-World and Franka Panda robotic manipulation tasks demonstrates that RRM significantly improves training convergence speed and stability, outperforming existing baselines with fewer environment interactions. These results validate RRM’s generalizability and practical efficacy in real-world robotic control settings.
📝 Abstract
Preference-based Reinforcement Learning (PbRL) provides a way to learn high-performance policies in environments where the reward signal is hard to specify, avoiding heuristic and time-consuming reward design. However, PbRL can suffer from slow convergence speed since it requires training in a reward model. Prior work has proposed learning a reward model from demonstrations and fine-tuning it using preferences. However, when the model is a neural network, using different loss functions for pre-training and fine-tuning can pose challenges to reliable optimization. In this paper, we propose a method to effectively leverage prior knowledge with a Residual Reward Model (RRM). An RRM assumes that the true reward of the environment can be split into a sum of two parts: a prior reward and a learned reward. The prior reward is a term available before training, for example, a user's ``best guess'' reward function, or a reward function learned from inverse reinforcement learning (IRL), and the learned reward is trained with preferences. We introduce state-based and image-based versions of RRM and evaluate them on several tasks in the Meta-World environment suite. Experimental results show that our method substantially improves the performance of a common PbRL method. Our method achieves performance improvements for a variety of different types of prior rewards, including proxy rewards, a reward obtained from IRL, and even a negated version of the proxy reward. We also conduct experiments with a Franka Panda to show that our method leads to superior performance on a real robot. It significantly accelerates policy learning for different tasks, achieving success in fewer steps than the baseline. The videos are presented at https://sunlighted.github.io/RRM-web/.