🤖 AI Summary
This work addresses the lack of explicit subgoal modeling in existing imitation learning approaches, which renders robot decision-making opaque in long-horizon tasks. The authors propose a subgoal-aware diffusion policy that leverages foundation models to automatically generate demonstration data annotated with subgoals. The policy is trained to condition action generation on both task and subgoal descriptions, while a lightweight auxiliary head predicts subgoal completion status. By embedding subgoal supervision—derived from foundation models—directly into policy training, this method achieves intrinsic interpretability rather than relying on post-hoc explanations. Experiments on RLBench simulations and a real UR5e robot demonstrate that the approach maintains high task success rates while providing real-time subgoal-level execution signals, effectively enabling progress monitoring and fault diagnosis.
📝 Abstract
Explainable robots require not only successful task execution but also the ability to expose internal decision-making process in a user-friendly manner. However, most imitation learning methods are trained solely on task-level demonstrations, without explicitly modeling subgoal structure or execution progress. This limitation is further exacerbated by the scarcity of subgoal-level supervision in standard robot learning datasets, which restricts the development of robots that can convey the subtasks they are executing during long-horizon manipulation. To address this issue, this paper proposes Subgoal-Aware Diffusion Policy (SADP), a framework that leverages foundation models to autonomously generate subgoal-annotated demonstrations and trains diffusion policies on these datasets. SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions. A lightweight auxiliary head further predicts subgoal completion states, allowing the robot to expose its current execution stage and monitor subgoal progression. Experiments in RLBench simulations and real-world evaluations on a UR5e robot demonstrate that SADP achieves higher task success rates than strong task-conditioned diffusion baselines, while providing subgoal-level execution signals for monitoring progress and diagnosing failures. These results highlight that built-in, rather than post-hoc, interpretability can coexist with high task performance.