Bridging Discrete and Continuous RL: Stable Deterministic Policy Gradient with Martingale Characterization

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Discrete-time reinforcement learning (RL) suffers from poor stability, slow convergence, and high sensitivity to time discretization when applied to continuous-time environments. To address these limitations, this paper proposes the Continuous-Time Deterministic Policy Gradient (CT-DDPG) algorithm. Methodologically, CT-DDPG is the first to incorporate the advantage function into a continuous-time deterministic policy gradient framework; it rigorously characterizes the advantage function via martingale theory under a continuous-time Markov process model, thereby establishing the first theoretical foundation for continuous-time policy gradients. This formulation bridges the stability gap between discrete- and continuous-time RL. Empirical evaluations across diverse control tasks demonstrate that CT-DDPG significantly improves convergence speed and robustness, while exhibiting strong insensitivity to time-step selection and environmental noise.

Technology Category

Application Category

📝 Abstract
The theory of discrete-time reinforcement learning (RL) has advanced rapidly over the past decades. Although primarily designed for discrete environments, many real-world RL applications are inherently continuous and complex. A major challenge in extending discrete-time algorithms to continuous-time settings is their sensitivity to time discretization, often leading to poor stability and slow convergence. In this paper, we investigate deterministic policy gradient methods for continuous-time RL. We derive a continuous-time policy gradient formula based on an analogue of the advantage function and establish its martingale characterization. This theoretical foundation leads to our proposed algorithm, CT-DDPG, which enables stable learning with deterministic policies in continuous-time environments. Numerical experiments show that the proposed CT-DDPG algorithm offers improved stability and faster convergence compared to existing discrete-time and continuous-time methods, across a wide range of control tasks with varying time discretizations and noise levels.
Problem

Research questions and friction points this paper is trying to address.

Extending discrete RL algorithms to continuous-time settings
Addressing sensitivity to time discretization in RL
Enabling stable deterministic policy learning in continuous environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous-time policy gradient with advantage function analogue
Martingale characterization for stable deterministic learning
CT-DDPG algorithm enables improved stability and convergence
🔎 Similar Papers
Ziheng Cheng
Ziheng Cheng
UC Berkeley
Machine LearningOptimizationStatistics
X
Xin Guo
University of California, Berkeley
Y
Yufei Zhang
Imperial College London