Sampling Complexity of TD and PPO in RKHS

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This paper addresses the sampling efficiency and theoretical guarantees of policy optimization in continuous state-action spaces. Methodologically, it decouples policy evaluation and improvement: a kernelized temporal-difference (TD) critic is designed for high-accuracy value function estimation, while policy updates integrate RKHS natural gradients with KL regularization, yielding an explicit sampling rule that achieves optimal convergence rates. The key contribution is the first unified RKHS-based formulation of both proximal policy optimization (PPO) and TD learning, enabling globally convergent policy improvement with non-asymptotic, instance-adaptive theoretical guarantees. Experiments on standard continuous-control benchmarks demonstrate significantly improved training stability and sample efficiency over baselines; notably, the kernelized critic achieves higher throughput than the generalized advantage estimation (GAE) baseline.

Technology Category

Application Category

📝 Abstract

We revisit Proximal Policy Optimization (PPO) from a function-space perspective. Our analysis decouples policy evaluation and improvement in a reproducing kernel Hilbert space (RKHS): (i) A kernelized temporal-difference (TD) critic performs efficient RKHS-gradient updates using only one-step state-action transition samples; (ii) a KL-regularized, natural-gradient policy step exponentiates the evaluated action-value, recovering a PPO/TRPO-style proximal update in continuous state-action spaces. We provide non-asymptotic, instance-adaptive guarantees whose rates depend on RKHS entropy, unifying tabular, linear, Sobolev, Gaussian, and Neural Tangent Kernel (NTK) regimes, and we derive a sampling rule for the proximal update that ensures the optimal $k^{-1/2}$ convergence rate for stochastic optimization. Empirically, the theory-aligned schedule improves stability and sample efficiency on common control tasks (e.g., CartPole, Acrobot), while our TD-based critic attains favorable throughput versus a GAE baseline. Altogether, our results place PPO on a firmer theoretical footing beyond finite-dimensional assumptions and clarify when RKHS-proximal updates with kernel-TD critics yield global policy improvement with practical efficiency.

Problem

Research questions and friction points this paper is trying to address.

Analyzing PPO's sampling complexity in reproducing kernel Hilbert spaces

Providing non-asymptotic convergence guarantees for kernel TD critics

Establishing theoretical foundations for RKHS-based policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Kernelized TD critic for efficient RKHS-gradient updates

KL-regularized natural-gradient policy step for proximal updates

RKHS entropy-dependent sampling rule ensuring optimal convergence

🔎 Similar Papers

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate