๐ค AI Summary
This work addresses the convergence challenges in offline policy learning arising when policy classes exceed Donsker complexity. To overcome this limitation, the authors propose a debiased framework based on cross-fitting that decomposes regret into policy estimation error and environmental perturbation components. By integrating semiparametric inference with functional space analysis, the method achieves, for the first time, a โN regret bound under non-Donsker policy classes. The theoretical analysis demonstrates that when the product-form remainder term is of order O(Nโปยน/ยฒ), the approach effectively balances the complexities of the policy class and the environment dynamics, thereby substantially improving the performance of offline reinforcement learning in high-dimensional or otherwise complex policy settings.
๐ Abstract
We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.