🤖 AI Summary
To address the limited precision and slow response time of vision–motor imitation learning in complex manipulation tasks, this paper proposes Hybrid-Diffusion: a novel framework that tightly integrates learnable open-loop control routines with a vision–motor diffusion policy, and introduces teleoperation-augmented primitives (TAPs) to enable seamless embedding of action primitives from demonstrations and autonomous triggering during inference. The framework models end-to-end vision–motor policies via diffusion, while TAPs enrich the representation of action sequences; training is performed via imitation learning. Evaluated on real-world pipetting, lid-opening liquid transfer, and container unscrewing tasks, Hybrid-Diffusion achieves significant improvements—+23.5% in task success rate and 1.8× average speedup—demonstrating high precision, rapid response, and strong cross-task generalization.
📝 Abstract
Despite the fact that visuomotor-based policies obtained via imitation learning demonstrate good performances in complex manipulation tasks, they usually struggle to achieve the same accuracy and speed as traditional control based methods. In this work, we introduce Hybrid-Diffusion models that combine open-loop routines with visuomotor diffusion policies. We develop Teleoperation Augmentation Primitives (TAPs) that allow the operator to perform predefined routines, such as locking specific axes, moving to perching waypoints, or triggering task-specific routines seamlessly during demonstrations. Our Hybrid-Diffusion method learns to trigger such TAPs during inference. We validate the method on challenging real-world tasks: Vial Aspiration, Open-Container Liquid Transfer, and container unscrewing. All experimental videos are available on the project's website: https://hybriddiffusion.github.io/