🤖 AI Summary
Discrete-time reinforcement learning (RL) suffers from poor stability, slow convergence, and high sensitivity to time discretization when applied to continuous-time environments. To address these limitations, this paper proposes the Continuous-Time Deterministic Policy Gradient (CT-DDPG) algorithm. Methodologically, CT-DDPG is the first to incorporate the advantage function into a continuous-time deterministic policy gradient framework; it rigorously characterizes the advantage function via martingale theory under a continuous-time Markov process model, thereby establishing the first theoretical foundation for continuous-time policy gradients. This formulation bridges the stability gap between discrete- and continuous-time RL. Empirical evaluations across diverse control tasks demonstrate that CT-DDPG significantly improves convergence speed and robustness, while exhibiting strong insensitivity to time-step selection and environmental noise.
📝 Abstract
The theory of discrete-time reinforcement learning (RL) has advanced rapidly over the past decades. Although primarily designed for discrete environments, many real-world RL applications are inherently continuous and complex. A major challenge in extending discrete-time algorithms to continuous-time settings is their sensitivity to time discretization, often leading to poor stability and slow convergence. In this paper, we investigate deterministic policy gradient methods for continuous-time RL. We derive a continuous-time policy gradient formula based on an analogue of the advantage function and establish its martingale characterization. This theoretical foundation leads to our proposed algorithm, CT-DDPG, which enables stable learning with deterministic policies in continuous-time environments. Numerical experiments show that the proposed CT-DDPG algorithm offers improved stability and faster convergence compared to existing discrete-time and continuous-time methods, across a wide range of control tasks with varying time discretizations and noise levels.