🤖 AI Summary
Hierarchical reinforcement learning (HRL) suffers from unstable joint training of high- and low-level policies in long-horizon tasks, primarily due to environmental non-stationarity induced by rapid obsolescence of high-level subgoals as the low-level policy continuously updates. To address this, we propose the Primitive-action-aware Interpretation and Projection (PIP) mechanism, which mitigates low-level non-stationarity via dynamic relabeling of a small set of expert demonstrations. We further introduce a novel subgoal curriculum paradigm grounded in the evolutionary trajectory of low-level primitive-action capabilities, enabling synchronized progression of subgoal difficulty and low-level skill acquisition. Our method integrates goal-conditioned policy learning, imitation learning, and dynamic demonstration relabeling—without requiring full expert trajectories. Evaluated on simulated maze navigation and robotic manipulation tasks, it achieves significantly improved sample efficiency. Real-world robotic experiments demonstrate strong generalization and robustness in complex physical manipulation tasks.
📝 Abstract
Hierarchical reinforcement learning (HRL) is a promising approach that uses temporal abstraction to solve complex long horizon problems. However, simultaneously learning a hierarchy of policies is unstable as it is challenging to train higher-level policy when the lower-level primitive is non-stationary. In this paper, we present CRISP, a novel HRL algorithm that effectively generates a curriculum of achievable subgoals for evolving lower-level primitives using reinforcement learning and imitation learning. CRISP uses the lower level primitive to periodically perform data relabeling on a handful of expert demonstrations, using a novel primitive informed parsing (PIP) approach, thereby mitigating non-stationarity. Since our approach only assumes access to a handful of expert demonstrations, it is suitable for most robotic control tasks. Experimental evaluations on complex robotic maze navigation and robotic manipulation tasks demonstrate that inducing hierarchical curriculum learning significantly improves sample efficiency, and results in efficient goal conditioned policies for solving temporally extended tasks. Additionally, we perform real world robotic experiments on complex manipulation tasks and demonstrate that CRISP demonstrates impressive generalization in real world scenarios.