Bellman Calibration for V-Learning in Offline Reinforcement Learning

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

In offline reinforcement learning, value functions often violate Bellman consistency due to distributional shift and model misspecification. To address this, we propose Iterative Bellman Calibration (IBC), the first method to incorporate calibration principles into offline RL value estimation—without requiring Bellman completeness or realizability assumptions. IBC constructs a one-dimensional fitted value iteration framework via histogram calibration, isotonic regression, and doubly robust pseudo-labeling, enabling post-hoc refinement of arbitrary value estimators. Theoretically, we derive finite-sample upper bounds linking calibration error to prediction error. Empirically, IBC significantly improves both robustness and accuracy across diverse offline policy evaluation and selection algorithms.

Technology Category

Application Category

📝 Abstract

We introduce Iterated Bellman Calibration, a simple, model-agnostic, post-hoc procedure for calibrating off-policy value predictions in infinite-horizon Markov decision processes. Bellman calibration requires that states with similar predicted long-term returns exhibit one-step returns consistent with the Bellman equation under the target policy. We adapt classical histogram and isotonic calibration to the dynamic, counterfactual setting by repeatedly regressing fitted Bellman targets onto a model's predictions, using a doubly robust pseudo-outcome to handle off-policy data. This yields a one-dimensional fitted value iteration scheme that can be applied to any value estimator. Our analysis provides finite-sample guarantees for both calibration and prediction under weak assumptions, and critically, without requiring Bellman completeness or realizability.

Problem

Research questions and friction points this paper is trying to address.

Calibrates off-policy value predictions in infinite-horizon MDPs

Handles off-policy data without requiring Bellman completeness

Provides finite-sample guarantees for calibration and prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterated Bellman Calibration for value prediction

Doubly robust pseudo-outcome handles off-policy data

One-dimensional fitted value iteration scheme model-agnostic

🔎 Similar Papers

No similar papers found.