Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In language-conditioned robotic manipulation tasks, high-level planners often generate infeasible subgoals by overlooking the capabilities of low-level controllers, leading to failures in hierarchical policies. To address this issue, this work proposes the HD-ExpIt framework, which iteratively fine-tunes a hierarchical diffusion policy online through environmental feedback, establishing a self-improving training loop. The approach autonomously discovers successful behaviors via diffusion-based planning and distills them back into the policy, enabling implicit co-optimization of planning and control without requiring explicit surrogate models or static offline datasets. Evaluated on the CALVIN long-horizon benchmark, HD-ExpIt substantially outperforms purely offline training schemes and achieves state-of-the-art performance among methods trained from scratch.

Technology Category

Application Category

📝 Abstract
Hierarchical policies for language-conditioned manipulation decompose tasks into subgoals, where a high-level planner guides a low-level controller. However, these hierarchical agents often fail because the planner generates subgoals without considering the actual limitations of the controller. Existing solutions attempt to bridge this gap via intermediate modules or shared representations, but they remain limited by their reliance on fixed offline datasets. We propose HD-ExpIt, a framework for iterative fine-tuning of hierarchical diffusion policies via environment feedback. HD-ExpIt organizes training into a self-reinforcing cycle: it utilizes diffusion-based planning to autonomously discover successful behaviors, which are then distilled back into the hierarchical policy. This loop enables both components to improve while implicitly grounding the planner in the controller's actual capabilities without requiring explicit proxy models. Empirically, HD-ExpIt significantly improves hierarchical policies trained solely on offline data, achieving state-of-the-art performance on the long-horizon CALVIN benchmark among methods trained from scratch.
Problem

Research questions and friction points this paper is trying to address.

hierarchical policies
language-conditioned manipulation
subgoal planning
controller limitations
planner-controller mismatch
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical diffusion policies
iterative on-policy refinement
language-conditioned manipulation
diffusion-based planning
self-reinforcing training loop
🔎 Similar Papers
No similar papers found.