Bridging Discrete and Continuous RL: Stable Deterministic Policy Gradient with Martingale Characterization

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Discrete-time reinforcement learning (RL) suffers from poor stability, slow convergence, and high sensitivity to time discretization when applied to continuous-time environments. To address these limitations, this paper proposes the Continuous-Time Deterministic Policy Gradient (CT-DDPG) algorithm. Methodologically, CT-DDPG is the first to incorporate the advantage function into a continuous-time deterministic policy gradient framework; it rigorously characterizes the advantage function via martingale theory under a continuous-time Markov process model, thereby establishing the first theoretical foundation for continuous-time policy gradients. This formulation bridges the stability gap between discrete- and continuous-time RL. Empirical evaluations across diverse control tasks demonstrate that CT-DDPG significantly improves convergence speed and robustness, while exhibiting strong insensitivity to time-step selection and environmental noise.

Technology Category

Application Category

📝 Abstract

The theory of discrete-time reinforcement learning (RL) has advanced rapidly over the past decades. Although primarily designed for discrete environments, many real-world RL applications are inherently continuous and complex. A major challenge in extending discrete-time algorithms to continuous-time settings is their sensitivity to time discretization, often leading to poor stability and slow convergence. In this paper, we investigate deterministic policy gradient methods for continuous-time RL. We derive a continuous-time policy gradient formula based on an analogue of the advantage function and establish its martingale characterization. This theoretical foundation leads to our proposed algorithm, CT-DDPG, which enables stable learning with deterministic policies in continuous-time environments. Numerical experiments show that the proposed CT-DDPG algorithm offers improved stability and faster convergence compared to existing discrete-time and continuous-time methods, across a wide range of control tasks with varying time discretizations and noise levels.

Problem

Research questions and friction points this paper is trying to address.

Extending discrete RL algorithms to continuous-time settings

Addressing sensitivity to time discretization in RL

Enabling stable deterministic policy learning in continuous environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous-time policy gradient with advantage function analogue

Martingale characterization for stable deterministic learning

CT-DDPG algorithm enables improved stability and convergence

🔎 Similar Papers

Strongly-Polynomial Time and Validation Analysis of Policy Gradient Methods