🤖 AI Summary
This work addresses the challenge in offline reinforcement learning (RL) of learning from static datasets without explicit reward functions or online environment interaction. We propose Sim-OPRL, the first algorithm enabling fully offline, preference-driven policy learning. Methodologically, Sim-OPRL learns a world model to generate simulated trajectories and actively queries human preference feedback—without real-world interaction—via a pessimistic out-of-distribution estimation mechanism for robustness and an optimistic experimental design for efficient preference acquisition. Theoretically, we establish the first sample-complexity upper bound for offline preference-based RL, expressed in terms of optimal-policy coverage. Empirically, Sim-OPRL achieves significant improvements over state-of-the-art offline RL baselines across multiple simulation domains, using only limited offline data and sparse human preferences. Our approach effectively bridges a critical gap between offline RL and preference-based RL.
📝 Abstract
Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in various environments.