🤖 AI Summary
This work addresses the misalignment between high-level task planners (e.g., 2D point flows) and low-level robot action policies. We propose HinFlow (“Hindsight Flow-Conditioned Online Imitation”), a framework that requires no high-quality robot demonstration data. Instead, it collects trajectories via online interaction, infers implicit goals through hindsight goal relabeling, and aligns video-pretrained planners with embodied policies via goal-conditioned behavioral cloning and cross-modal policy distillation. Its core innovation lies in directly modeling planner outputs—specifically point flows—as differentiable, goal-conditioned signals optimized end-to-end within closed-loop interaction. Experiments demonstrate that HinFlow achieves over 2× performance gains over prior methods on both simulated and real-robot manipulation tasks. Moreover, it enables zero-shot policy transfer across diverse video-based planners, without requiring task-specific fine-tuning or expert demonstrations.
📝 Abstract
Recent advances in hierarchical robot systems leverage a high-level planner to propose task plans and a low-level policy to generate robot actions. This design allows training the planner on action-free or even non-robot data sources (e.g., videos), providing transferable high-level guidance. Nevertheless, grounding these high-level plans into executable actions remains challenging, especially with the limited availability of high-quality robot data. To this end, we propose to improve the low-level policy through online interactions. Specifically, our approach collects online rollouts, retrospectively annotates the corresponding high-level goals from achieved outcomes, and aggregates these hindsight-relabeled experiences to update a goal-conditioned imitation policy. Our method, Hindsight Flow-conditioned Online Imitation (HinFlow), instantiates this idea with 2D point flows as the high-level planner. Across diverse manipulation tasks in both simulation and physical world, our method achieves more than $2 imes$ performance improvement over the base policy, significantly outperforming the existing methods. Moreover, our framework enables policy acquisition from planners trained on cross-embodiment video data, demonstrating its potential for scalable and transferable robot learning.