First-order Sobolev Reinforcement Learning

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Temporal-difference (TD) learning traditionally enforces value-function matching only at the Bellman target’s scalar values, neglecting its local geometric structure—particularly first-order derivatives with respect to state and action. Method: We propose the *first-order Bellman consistency constraint*, requiring the value function to match not only the Bellman target’s value but also its gradients in state and action. Leveraging differentiable environment dynamics, we analytically derive gradient targets and introduce a Sobolev-type loss that jointly optimizes both value and derivative terms. The method is seamlessly integrated into standard Actor-Critic frameworks (e.g., DDPG, SAC) without architectural modification. Contribution/Results: This is the first explicit incorporation of first-order Bellman consistency into TD learning. It significantly enhances the critic’s ability to capture local geometry of the target function, accelerates critic convergence, improves policy gradient stability, and maintains full compatibility with existing algorithms—demonstrating strong practical utility.

Technology Category

Application Category

📝 Abstract
We propose a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also their derivatives with respect to states and actions. By differentiating the Bellman backup through differentiable dynamics, we obtain analytically consistent gradient targets. Incorporating these into the critic objective using a Sobolev-type loss encourages the critic to align with both the value and local geometry of the target function. This first-order TD matching principle can be seamlessly integrated into existing algorithms, such as Q-learning or actor-critic methods (e.g., DDPG, SAC), potentially leading to faster critic convergence and more stable policy gradients without altering their overall structure.
Problem

Research questions and friction points this paper is trying to address.

Enforcing first-order Bellman consistency in value functions
Matching both value targets and their derivatives analytically
Improving critic convergence and policy gradient stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enforcing first-order Bellman consistency in value function
Using Sobolev-type loss to align value and derivatives
Seamlessly integrating gradient targets into existing algorithms
🔎 Similar Papers
No similar papers found.