🤖 AI Summary
Offline reinforcement learning struggles to recover the temporal causal structure of controlled Markov processes from static datasets, often yielding representations with symmetric or non-metric distances. This work proposes a hitting-time-based operator-theoretic framework that models latent displacement geometry in a Hilbert space, establishing—for the first time—an isometric isomorphism between expected hitting times and linear functionals of displacement, thereby guaranteeing uniquely identifiable representations. By combining explicit hitting-time regression with a HILP (Hitting-time Induced Linear Policy) consistency objective, the proposed algorithm, IEL, learns task-agnostic foundation policies that significantly enhance long-horizon, multi-stage planning performance in offline maze navigation tasks, achieving state-of-the-art results while providing finite-sample theoretical guarantees.
📝 Abstract
We present a new operator-theoretic representation learning framework for offline reinforcement learning that recovers the directed temporal geometry of a controlled Markov process from hitting time observations. While prior art often produces symmetric distances or fails to satisfy the triangle inequality, our framework learns a Hilbert-space displacement geometry where expected hitting times are realized as linear functionals of latent displacements. We prove that this representation exists under latent linear closure and is uniquely identifiable up to a bounded linear isomorphism. For finite-dimensional implementations, we show that global hitting-time error is bounded by one-step transition error amplified by the environment's transient spectral radius. Furthermore, we provide finite-sample guarantees accounting for approximation, statistical complexity, and trajectory-label mismatch. Derived from this theory, we curate Isomorphic Embedding Learning (IEL) as a new goal-agnostic foundation policy learning algorithm that anchors a HILP-style consistency objective with explicit hitting-time regression to ensure that the learned geometry reflects actual decision-time progress. This asymmetric and compositional structure enables robust graph-based multi-stage planning for long-horizon navigation. Our experiments demonstrate that IEL improves the state of the art of learning foundation policy policies from offline maze locomotion data. Our code can be found on https://github.com/MagnusBoock/IEL