First-order Sobolev Reinforcement Learning

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Temporal-difference (TD) learning traditionally enforces value-function matching only at the Bellman target’s scalar values, neglecting its local geometric structure—particularly first-order derivatives with respect to state and action. Method: We propose the *first-order Bellman consistency constraint*, requiring the value function to match not only the Bellman target’s value but also its gradients in state and action. Leveraging differentiable environment dynamics, we analytically derive gradient targets and introduce a Sobolev-type loss that jointly optimizes both value and derivative terms. The method is seamlessly integrated into standard Actor-Critic frameworks (e.g., DDPG, SAC) without architectural modification. Contribution/Results: This is the first explicit incorporation of first-order Bellman consistency into TD learning. It significantly enhances the critic’s ability to capture local geometry of the target function, accelerates critic convergence, improves policy gradient stability, and maintains full compatibility with existing algorithms—demonstrating strong practical utility.

Technology Category

Application Category

📝 Abstract

We propose a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also their derivatives with respect to states and actions. By differentiating the Bellman backup through differentiable dynamics, we obtain analytically consistent gradient targets. Incorporating these into the critic objective using a Sobolev-type loss encourages the critic to align with both the value and local geometry of the target function. This first-order TD matching principle can be seamlessly integrated into existing algorithms, such as Q-learning or actor-critic methods (e.g., DDPG, SAC), potentially leading to faster critic convergence and more stable policy gradients without altering their overall structure.

Problem

Research questions and friction points this paper is trying to address.

Enforcing first-order Bellman consistency in value functions

Matching both value targets and their derivatives analytically

Improving critic convergence and policy gradient stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enforcing first-order Bellman consistency in value function

Using Sobolev-type loss to align value and derivatives

Seamlessly integrating gradient targets into existing algorithms

🔎 Similar Papers

No similar papers found.