Preference Elicitation for Offline Reinforcement Learning

📅 2024-06-26
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in offline reinforcement learning (RL) of learning from static datasets without explicit reward functions or online environment interaction. We propose Sim-OPRL, the first algorithm enabling fully offline, preference-driven policy learning. Methodologically, Sim-OPRL learns a world model to generate simulated trajectories and actively queries human preference feedback—without real-world interaction—via a pessimistic out-of-distribution estimation mechanism for robustness and an optimistic experimental design for efficient preference acquisition. Theoretically, we establish the first sample-complexity upper bound for offline preference-based RL, expressed in terms of optimal-policy coverage. Empirically, Sim-OPRL achieves significant improvements over state-of-the-art offline RL baselines across multiple simulation domains, using only limited offline data and sparse human preferences. Our approach effectively bridges a critical gap between offline RL and preference-based RL.

Technology Category

Application Category

📝 Abstract
Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in various environments.
Problem

Research questions and friction points this paper is trying to address.

Bridging offline RL and preference-based RL frameworks
Efficiently acquiring preference feedback without online interaction
Developing Sim-OPRL for offline preference-based reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sim-OPRL: offline preference-based RL algorithm
Learned environment model for preference feedback
Pessimistic and optimistic data handling strategies
🔎 Similar Papers
No similar papers found.