🤖 AI Summary
To address poor generalization and difficulty in maintaining local consistency of goal-conditioned reinforcement learning (GCRL) in complex dynamic environments, this paper proposes an Eikonal-equation-constrained quasimetric value learning framework operating in continuous time. It introduces the Eikonal partial differential equation (PDE) to RL for the first time, enabling trajectory-agnostic continuous-time modeling. We design a hierarchical Eik-HiQRL architecture that decouples long-horizon goal planning from low-level dynamical control. The method integrates Eikonal PDE-constrained optimization, quasimetric neural network representation, and a hierarchical offline training paradigm. Evaluated on offline goal navigation tasks, our approach achieves state-of-the-art (SOTA) performance; on robotic manipulation tasks, it significantly outperforms baseline QRL methods while matching the stability and accuracy of temporal-difference methods.
📝 Abstract
Goal-Conditioned Reinforcement Learning (GCRL) mitigates the difficulty of reward design by framing tasks as goal reaching rather than maximizing hand-crafted reward signals. In this setting, the optimal goal-conditioned value function naturally forms a quasimetric, motivating Quasimetric RL (QRL), which constrains value learning to quasimetric mappings and enforces local consistency through discrete, trajectory-based constraints. We propose Eikonal-Constrained Quasimetric RL (Eik-QRL), a continuous-time reformulation of QRL based on the Eikonal Partial Differential Equation (PDE). This PDE-based structure makes Eik-QRL trajectory-free, requiring only sampled states and goals, while improving out-of-distribution generalization. We provide theoretical guarantees for Eik-QRL and identify limitations that arise under complex dynamics. To address these challenges, we introduce Eik-Hierarchical QRL (Eik-HiQRL), which integrates Eik-QRL into a hierarchical decomposition. Empirically, Eik-HiQRL achieves state-of-the-art performance in offline goal-conditioned navigation and yields consistent gains over QRL in manipulation tasks, matching temporal-difference methods.