🤖 AI Summary
Offline reinforcement learning (RL) commonly relies on conservatism—e.g., action penalization or myopic planning—to mitigate out-of-distribution risks, yet this paradigm inherently restricts long-horizon planning and generalization. This work challenges the universality of conservative design and introduces the first Bayesian offline RL framework that dispenses with explicit conservatism. Our method employs a Bayesian world model to characterize posterior uncertainty over environment dynamics, coupled with a policy network conditioned on historical state sequences and an adaptive long-horizon planning mechanism; cognitive uncertainty is explicitly modeled during training. This formulation effectively mitigates error accumulation and value overestimation, enabling robust planning over hundreds of steps. Evaluated across seven benchmarks from D4RL and NeoRL, our approach achieves state-of-the-art performance, substantially outperforming mainstream conservative algorithms. Results demonstrate the efficacy and robustness of the non-conservative Bayesian paradigm—particularly under low-quality datasets and complex tasks.
📝 Abstract
Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting planning horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale the principle to realistic tasks, identifying key design choices, such as layer normalization in the world model and adaptive long-horizon planning, that mitigate compounding error and value overestimation. These yield our practical algorithm, Neubay, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, Neubay generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with planning horizons of several hundred steps, challenging common belief. Finally, we characterize when Neubay is preferable to conservatism, laying the foundation for a new direction in offline and model-based RL.