🤖 AI Summary
This work proposes the Sobolev-prox fitted Q-learning algorithm for model-free reinforcement learning in continuous-time Markov diffusion processes, where only discrete observations and actions are available. By leveraging the ellipticity of the underlying diffusion, the method constructs a Bellman operator endowed with positive definiteness and boundedness in a Hilbert space, enabling direct estimation of value and advantage functions via iterative least-squares regression from data. The study establishes, for the first time, that ellipticity is the key structural property rendering continuous-time reinforcement learning as tractable as supervised learning in terms of function approximation. A rigorous theoretical framework is developed through Sobolev-space regularization, local complexity control, and numerical discretization, yielding an oracle inequality that decomposes error into approximation, optimization, discretization, and complexity components, thereby providing strong convergence and generalization guarantees for the algorithm.
📝 Abstract
We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage functions directly from data, without unrealistic structural assumptions on the dynamics. Leveraging the ellipticity of the diffusions, we establish a new class of Hilbert-space positive definiteness and boundedness properties for the Bellman operators. Based on these properties, we propose the Sobolev-prox fitted $q$-learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv) numerical discretization error. These results identify ellipticity as a key structural property that renders reinforcement learning with function approximation for Markov diffusions no harder than supervised learning.