🤖 AI Summary
This work addresses off-policy evaluation in finite-horizon Markov decision processes under function approximation and limited data coverage. It proposes a novel method based on recursive reweighting and moment matching, which optimizes scalar weights through a value-function discriminator class in a top-down manner to align the reweighted returns with the expected return under the target policy. The approach unifies and generalizes existing techniques such as importance sampling and linear fitted Q-evaluation. Notably, under the sole assumption that the true Q-function is realizable within the chosen function class, it establishes the first finite-sample error bound that is independent of both the statistical complexity of the function class and the ambient dimensionality. This result advances the theoretical understanding of coverage conditions in offline reinforcement learning and significantly enhances the accuracy and robustness of policy evaluation.
📝 Abstract
We present a novel theoretical framework, Q-MMR, for off-policy evaluation in finite-horizon MDPs. Q-MMR learns a set of scalar weights, one for each data point, such that the reweighted rewards approximate the expected return under the target policy. The weights are learned inductively in a top-down manner via a moment matching objective against a value-function discriminator class. Notably, and perhaps surprisingly, a data-dependent finite-sample guarantee for general function approximation can be established under only the realizability of $Q^π$, with a dimension-free bound -- that is, the error does not depend on the statistical complexity of the function class. We also establish connections to several existing methods, such as importance sampling and linear FQE. Further theoretical analyses shed new light on the nature of coverage, a concept of fundamental importance to offline RL.