Option-aware Temporally Abstracted Value for Offline Goal-Conditioned Reinforcement Learning

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Offline goal-conditioned reinforcement learning (GCRL) faces two key bottlenecks in long-horizon tasks: (1) high-level policies struggle to generate semantically meaningful subgoals, and (2) long-term advantage signals suffer from sign ambiguity due to temporal credit assignment. This paper proposes option-aware temporal abstraction value learning—a novel approach that integrates option-based temporal abstraction directly into TD updates for the first time, enabling both temporal compression of value functions and correction of advantage signal polarity. Built upon the HIQL framework, our method unifies option modeling, temporally abstracted TD learning, and BCQ-style offline policy extraction. Evaluated on the OGBench benchmark—including maze navigation and vision-based robotic manipulation—our method significantly outperforms state-of-the-art baselines (e.g., HIQL) in high-level policy performance, demonstrating superior generalization, training stability, and subgoal generation capability for long-horizon tasks.

Technology Category

Application Category

📝 Abstract
Offline goal-conditioned reinforcement learning (GCRL) offers a practical learning paradigm where goal-reaching policies are trained from abundant unlabeled (reward-free) datasets without additional environment interaction. However, offline GCRL still struggles with long-horizon tasks, even with recent advances that employ hierarchical policy structures, such as HIQL. By identifying the root cause of this challenge, we observe the following insights: First, performance bottlenecks mainly stem from the high-level policy's inability to generate appropriate subgoals. Second, when learning the high-level policy in the long-horizon regime, the sign of the advantage signal frequently becomes incorrect. Thus, we argue that improving the value function to produce a clear advantage signal for learning the high-level policy is essential. In this paper, we propose a simple yet effective solution: Option-aware Temporally Abstracted value learning, dubbed OTA, which incorporates temporal abstraction into the temporal-difference learning process. By modifying the value update to be option-aware, the proposed learning scheme contracts the effective horizon length, enabling better advantage estimates even in long-horizon regimes. We experimentally show that the high-level policy extracted using the OTA value function achieves strong performance on complex tasks from OGBench, a recently proposed offline GCRL benchmark, including maze navigation and visual robotic manipulation environments.
Problem

Research questions and friction points this paper is trying to address.

Improving high-level policy subgoal generation in offline GCRL
Correcting advantage signal signs for long-horizon task learning
Enhancing value function clarity for better advantage estimates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Option-aware value learning for better advantage estimates
Temporal abstraction in temporal-difference learning process
Modified value update for effective horizon contraction
🔎 Similar Papers
No similar papers found.
Hongjoon Ahn
Hongjoon Ahn
Seoul National University
artificial intelligence
H
Heewoong Choi
Department of Electrical and Computer Engineering (ECE), Seoul National University
Jisu Han
Jisu Han
PhD student in SNU
Embodied AIRobotics
T
Taesup Moon
Department of ECE / IPAI / ASRI / INMC, Seoul National University