SR-Reward: Taking The Path More Traveled

📅 2025-01-04

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This paper addresses the challenge of reward function acquisition in offline reinforcement learning (RL). To circumvent the need for ground-truth rewards or adversarial training as in inverse RL, we propose SR-Reward: a framework that directly learns a reward function from offline expert trajectories, decoupling reward modeling from policy optimization. Methodologically, we introduce Successor Representation (SR) into reward learning for the first time—leveraging SR to encode the future state visitation distribution and integrating it with the Bellman equation to ensure seamless compatibility with standard RL algorithms. We further incorporate a negative sampling mechanism to suppress overestimation of rewards on out-of-distribution (OOD) states, inherently inducing a conservative bias that enhances robustness. On the D4RL benchmark, SR-Reward matches the performance of offline RL methods using true rewards and behavior cloning. Ablation studies confirm its strong adaptability to small-scale and low-quality datasets.

Technology Category

Application Category

📝 Abstract

In this paper, we propose a novel method for learning reward functions directly from offline demonstrations. Unlike traditional inverse reinforcement learning (IRL), our approach decouples the reward function from the learner's policy, eliminating the adversarial interaction typically required between the two. This results in a more stable and efficient training process. Our reward function, called extit{SR-Reward}, leverages successor representation (SR) to encode a state based on expected future states' visitation under the demonstration policy and transition dynamics. By utilizing the Bellman equation, SR-Reward can be learned concurrently with most reinforcement learning (RL) algorithms without altering the existing training pipeline. We also introduce a negative sampling strategy to mitigate overestimation errors by reducing rewards for out-of-distribution data, thereby enhancing robustness. This strategy inherently introduces a conservative bias into RL algorithms that employ the learned reward. We evaluate our method on the D4RL benchmark, achieving competitive results compared to offline RL algorithms with access to true rewards and imitation learning (IL) techniques like behavioral cloning. Moreover, our ablation studies on data size and quality reveal the advantages and limitations of SR-Reward as a proxy for true rewards.

Problem

Research questions and friction points this paper is trying to address.

Automatic Reward Learning

Pre-recorded Videos

Stable and Fast Machine Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

SR-Reward

Prospective Impact Assessment

Conservative Strategy

🔎 Similar Papers

No similar papers found.