🤖 AI Summary
This paper addresses **return distribution estimation** in off-policy evaluation (OPE), i.e., accurately estimating the full return distribution of a target policy using offline data generated by a behavioral policy. To this end, we propose **Energy-Based Bellman Residual Minimization (EBRM)**, which replaces the intractable supremum distance with a computationally feasible **expected extended Wasserstein distance**, and provide the first rigorous theoretical justification for its validity. EBRM does not require completeness assumptions on the policy class. We further introduce a **multi-step temporal unfolding variant**, which substantially tightens the error bound under non-realizable settings. Our theoretical analysis establishes finite-sample error bounds for EBRM under both realizable and non-realizable regimes, balancing statistical efficiency and modeling robustness. Collectively, this work delivers a more practical and robust theoretical and algorithmic framework for distributional OPE.
📝 Abstract
We study distributional off-policy evaluation (OPE), of which the goal is to learn the distribution of the return for a target policy using offline data generated by a different policy. The theoretical foundation of many existing work relies on the supremum-extended statistical distances such as supremum-Wasserstein distance, which are hard to estimate. In contrast, we study the more manageable expectation-extended statistical distances and provide a novel theoretical justification on their validity for learning the return distribution. Based on this attractive property, we propose a new method called Energy Bellman Residual Minimizer (EBRM) for distributional OPE. We provide corresponding in-depth theoretical analyses. We establish a finite-sample error bound for the EBRM estimator under the realizability assumption. Furthermore, we introduce a variant of our method based on a multi-step extension which improves the error bound for non-realizable settings. Notably, unlike prior distributional OPE methods, the theoretical guarantees of our method do not require the completeness assumption.