🤖 AI Summary
In model-based offline reinforcement learning, distributional shift (DS) causes value overestimation and policy degradation due to the coupled effects of model bias and policy mismatch. This work is the first to systematically decouple these two sources, proposing Shift-Aware Rewards (SAR) grounded in unified probabilistic inference: SAR is approximated via a trainable classifier, enabling end-to-end learning; theoretically, it jointly calibrates value estimation and policy optimization, establishing the first model-based offline RL framework with a unified theoretical foundation. Our method integrates uncertainty-aware value correction with distributionally robust training. Empirically, it achieves state-of-the-art or competitive performance across multiple standard offline benchmarks, significantly mitigating DS-induced estimation bias and policy degradation.
📝 Abstract
Model-based offline reinforcement learning trains policies using pre-collected datasets and learned environment models, eliminating the need for direct real-world environment interaction. However, this paradigm is inherently challenged by distribution shift (DS). Existing methods address this issue by leveraging off-policy mechanisms and estimating model uncertainty, but they often result in inconsistent objectives and lack a unified theoretical foundation. This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift. Our theoretical and empirical investigations reveal how these factors distort value estimation and restrict policy optimization. To tackle these challenges, we derive a novel Shifts-aware Reward (SAR) through a unified probabilistic inference framework, which modifies the vanilla reward to refine value learning and facilitate policy training. Building on this, we introduce Shifts-aware Model-based Offline Reinforcement Learning (SAMBO-RL), a practical framework that efficiently trains classifiers to approximate SAR for policy optimization. Empirical experiments show that SAR effectively mitigates DS, and SAMBO-RL achieves superior or comparable performance across various benchmarks, underscoring its effectiveness and validating our theoretical analysis.