Transitive RL: Value Learning via Divide and Conquer

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In offline goal-conditioned reinforcement learning (GCRL), efficiently learning shortest-path policies between arbitrary state pairs remains a core challenge. This paper proposes Tri-Value, a divide-and-conquer value-learning algorithm grounded in the triangle inequality, marking the first application of divide-and-conquer principles to value-function updates. Tri-Value recursively decomposes long-horizon trajectories (length $T$) into hierarchical subproblems, enabling value propagation in only $O(log T)$ iterations and substantially mitigating temporal-difference (TD) error accumulation. The method unifies dynamic programming’s structural guarantees, Monte Carlo’s low-variance off-policy estimation, and TD learning’s sample efficiency. Evaluated on multiple high-difficulty, long-horizon offline GCRL benchmarks—including AntMaze, Adroit, and FrankaKitchen—Tri-Value consistently outperforms prior state-of-the-art methods, achieving new best performance across all domains.

Technology Category

Application Category

📝 Abstract
In this work, we present Transitive Reinforcement Learning (TRL), a new value learning algorithm based on a divide-and-conquer paradigm. TRL is designed for offline goal-conditioned reinforcement learning (GCRL) problems, where the aim is to find a policy that can reach any state from any other state in the smallest number of steps. TRL converts a triangle inequality structure present in GCRL into a practical divide-and-conquer value update rule. This has several advantages compared to alternative value learning paradigms. Compared to temporal difference (TD) methods, TRL suffers less from bias accumulation, as in principle it only requires $O(log T)$ recursions (as opposed to $O(T)$ in TD learning) to handle a length-$T$ trajectory. Unlike Monte Carlo methods, TRL suffers less from high variance as it performs dynamic programming. Experimentally, we show that TRL achieves the best performance in highly challenging, long-horizon benchmark tasks compared to previous offline GCRL algorithms.
Problem

Research questions and friction points this paper is trying to address.

Develops offline goal-conditioned reinforcement learning algorithm
Reduces bias accumulation in value learning process
Improves performance in long-horizon benchmark tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Divide-and-conquer paradigm for value learning
Triangle inequality structure conversion to update rule
Dynamic programming reducing bias and variance
🔎 Similar Papers
No similar papers found.