Functional Natural Policy Gradients

๐Ÿ“… 2026-03-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the convergence challenges in offline policy learning arising when policy classes exceed Donsker complexity. To overcome this limitation, the authors propose a debiased framework based on cross-fitting that decomposes regret into policy estimation error and environmental perturbation components. By integrating semiparametric inference with functional space analysis, the method achieves, for the first time, a โˆšN regret bound under non-Donsker policy classes. The theoretical analysis demonstrates that when the product-form remainder term is of order O(Nโปยน/ยฒ), the approach effectively balances the complexities of the policy class and the environment dynamics, thereby substantially improving the performance of offline reinforcement learning in high-dimensional or otherwise complex policy settings.
๐Ÿ“ Abstract
We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.
Problem

Research questions and friction points this paper is trying to address.

offline policy learning
regret bound
policy class complexity
nuisance estimation
debiasing
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-fitted debiasing
offline policy learning
sqrt-N regret
nuisance remainder
policy class complexity
๐Ÿ”Ž Similar Papers
No similar papers found.