Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Offline reinforcement learning (RL) commonly relies on conservatism—e.g., action penalization or myopic planning—to mitigate out-of-distribution risks, yet this paradigm inherently restricts long-horizon planning and generalization. This work challenges the universality of conservative design and introduces the first Bayesian offline RL framework that dispenses with explicit conservatism. Our method employs a Bayesian world model to characterize posterior uncertainty over environment dynamics, coupled with a policy network conditioned on historical state sequences and an adaptive long-horizon planning mechanism; cognitive uncertainty is explicitly modeled during training. This formulation effectively mitigates error accumulation and value overestimation, enabling robust planning over hundreds of steps. Evaluated across seven benchmarks from D4RL and NeoRL, our approach achieves state-of-the-art performance, substantially outperforming mainstream conservative algorithms. Results demonstrate the efficacy and robustness of the non-conservative Bayesian paradigm—particularly under low-quality datasets and complex tasks.

Technology Category

Application Category

📝 Abstract

Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting planning horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale the principle to realistic tasks, identifying key design choices, such as layer normalization in the world model and adaptive long-horizon planning, that mitigate compounding error and value overestimation. These yield our practical algorithm, Neubay, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, Neubay generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with planning horizons of several hundred steps, challenging common belief. Finally, we characterize when Neubay is preferable to conservatism, laying the foundation for a new direction in offline and model-based RL.

Problem

Research questions and friction points this paper is trying to address.

Develops a Bayesian approach for offline RL without conservatism

Addresses epistemic uncertainty in low-quality datasets via world modeling

Enables long-horizon planning with reduced compounding error and overestimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian posterior modeling over world models

Layer normalization and adaptive long-horizon planning

History-dependent agent maximizing expected rewards

🔎 Similar Papers

State-Constrained Offline Reinforcement Learning