🤖 AI Summary
This work addresses the challenge of effectively decomposing complex decision-making in offline goal-conditioned reinforcement learning, particularly over long-horizon tasks, where existing hierarchical methods often generate only a single subgoal and thus lack multi-step coordination. Inspired by chain-of-thought reasoning, the paper introduces autoregressive sequence modeling into hierarchical policy design for the first time, proposing a unified architecture that sequentially generates a series of latent subgoals followed by the final action. Built upon MLP-Mixer, the approach enables structured interactions and cross-token communication among states, goals, subgoals, and actions, effectively capturing long-range dependencies. The method achieves significant performance gains over current offline reinforcement learning baselines across multiple navigation and manipulation benchmarks, with especially pronounced improvements in long-horizon scenarios.
📝 Abstract
Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this efficiently, we pioneer the use of an MLP-Mixer backbone, which supports cross-token communication and captures structural relationships among state, goal, latent subgoals, and action. Across challenging navigation and manipulation benchmarks, CoGHP consistently outperforms strong offline baselines, demonstrating improved performance on long-horizon tasks.