🤖 AI Summary
In offline reinforcement learning, value functions often violate Bellman consistency due to distributional shift and model misspecification. To address this, we propose Iterative Bellman Calibration (IBC), the first method to incorporate calibration principles into offline RL value estimation—without requiring Bellman completeness or realizability assumptions. IBC constructs a one-dimensional fitted value iteration framework via histogram calibration, isotonic regression, and doubly robust pseudo-labeling, enabling post-hoc refinement of arbitrary value estimators. Theoretically, we derive finite-sample upper bounds linking calibration error to prediction error. Empirically, IBC significantly improves both robustness and accuracy across diverse offline policy evaluation and selection algorithms.
📝 Abstract
We introduce Iterated Bellman Calibration, a simple, model-agnostic, post-hoc procedure for calibrating off-policy value predictions in infinite-horizon Markov decision processes. Bellman calibration requires that states with similar predicted long-term returns exhibit one-step returns consistent with the Bellman equation under the target policy. We adapt classical histogram and isotonic calibration to the dynamic, counterfactual setting by repeatedly regressing fitted Bellman targets onto a model's predictions, using a doubly robust pseudo-outcome to handle off-policy data. This yields a one-dimensional fitted value iteration scheme that can be applied to any value estimator. Our analysis provides finite-sample guarantees for both calibration and prediction under weak assumptions, and critically, without requiring Bellman completeness or realizability.