🤖 AI Summary
Nonparametric reinforcement learning (RL) policy optimization in reproducing kernel Hilbert spaces (RKHS) is constrained by first-order methods, as direct application of second-order optimization is hindered by the infinite-dimensional nature of the Hessian operator.
Method: This paper introduces the first second-order optimization framework tailored for RKHS policy representations, circumventing explicit computation and inversion of the infinite-dimensional Hessian. We propose the first RKHS-specific second-order algorithm, integrating cubic regularization into the auxiliary objective and leveraging the representer theorem to achieve effective dimensionality reduction.
Contribution/Results: We establish theoretical guarantees of local quadratic convergence. Empirical evaluation on financial portfolio allocation confirms convergence behavior, while benchmark RL tasks demonstrate substantial improvements over existing RKHS-based first-order methods and parametric second-order approaches—achieving faster convergence and higher cumulative rewards.
📝 Abstract
Reinforcement learning (RL) policies represented in Reproducing Kernel Hilbert Spaces (RKHS) offer powerful representational capabilities. While second-order optimization methods like Newton's method demonstrate faster convergence than first-order approaches, current RKHS-based policy optimization remains constrained to first-order techniques. This limitation stems primarily from the intractability of explicitly computing and inverting the infinite-dimensional Hessian operator in RKHS. We introduce Policy Newton in RKHS, the first second-order optimization framework specifically designed for RL policies represented in RKHS. Our approach circumvents direct computation of the inverse Hessian operator by optimizing a cubic regularized auxiliary objective function. Crucially, we leverage the Representer Theorem to transform this infinite-dimensional optimization into an equivalent, computationally tractable finite-dimensional problem whose dimensionality scales with the trajectory data volume. We establish theoretical guarantees proving convergence to a local optimum with a local quadratic convergence rate. Empirical evaluations on a toy financial asset allocation problem validate these theoretical properties, while experiments on standard RL benchmarks demonstrate that Policy Newton in RKHS achieves superior convergence speed and higher episodic rewards compared to established first-order RKHS approaches and parametric second-order methods. Our work bridges a critical gap between non-parametric policy representations and second-order optimization methods in reinforcement learning.