🤖 AI Summary
To address low sample efficiency and physically inconsistent trajectories in reinforcement learning (RL) for continuous state-action spaces within scientific computing, this paper introduces differential RL—a novel paradigm grounded in continuous-time control theory. It establishes a differential dual framework and Hamiltonian-structured modeling to enable physics-constrained trajectory generation and optimization. We propose the first pointwise and phase-wise differential policy optimization (DPO) algorithms, delivering the first pointwise convergence guarantee and achieving an $O(K^{5/6})$ competitive regret bound. By adaptively updating policies via local trajectory motion operators, our method significantly improves data efficiency and physical fidelity. Empirically, it outperforms standard RL baselines across diverse scientific tasks—including surface modeling, mesh control, and molecular dynamics—under stringent physical constraints and limited training data.
📝 Abstract
Reinforcement learning (RL) in continuous state-action spaces remains challenging in scientific computing due to poor sample efficiency and lack of pathwise physical consistency. We introduce Differential Reinforcement Learning (Differential RL), a novel framework that reformulates RL from a continuous-time control perspective via a differential dual formulation. This induces a Hamiltonian structure that embeds physics priors and ensures consistent trajectories without requiring explicit constraints. To implement Differential RL, we develop Differential Policy Optimization (DPO), a pointwise, stage-wise algorithm that refines local movement operators along the trajectory for improved sample efficiency and dynamic alignment. We establish pointwise convergence guarantees, a property not available in standard RL, and derive a competitive theoretical regret bound of $O(K^{5/6})$. Empirically, DPO outperforms standard RL baselines on representative scientific computing tasks, including surface modeling, grid control, and molecular dynamics, under low-data and physics-constrained conditions.