🤖 AI Summary
In vision-based motor policy learning, diffusion models offer multimodal behavior modeling but suffer from slow sampling and difficulty balancing real-time inference with behavioral diversity. This paper proposes a hybrid consistency policy: actions are generated via short random prefix initialization followed by a single-step consistency jump, integrated with an adaptive switching mechanism and time-varying consistency distillation to decouple multimodal preservation from inference acceleration. The method jointly optimizes a diffusion model, stochastic differential equation (SDE) sampler, consistency model, trajectory-level consistency objective, and denoising-matching objective. Experiments on both simulation and real-world robotic platforms show that our approach achieves comparable accuracy and mode coverage to an 80-step DDPM teacher model using only 25 SDE steps plus one consistency jump—significantly reducing latency. To our knowledge, this is the first method enabling efficient, high-coverage multimodal policy generation with single-step jumps while ensuring trajectory coherence, local fidelity, and robust mode diversity.
📝 Abstract
In visuomotor policy learning, diffusion-based imitation learning has become widely adopted for its ability to capture diverse behaviors. However, approaches built on ordinary and stochastic denoising processes struggle to jointly achieve fast sampling and strong multi-modality. To address these challenges, we propose the Hybrid Consistency Policy (HCP). HCP runs a short stochastic prefix up to an adaptive switch time, and then applies a one-step consistency jump to produce the final action. To align this one-jump generation, HCP performs time-varying consistency distillation that combines a trajectory-consistency objective to keep neighboring predictions coherent and a denoising-matching objective to improve local fidelity. In both simulation and on a real robot, HCP with 25 SDE steps plus one jump approaches the 80-step DDPM teacher in accuracy and mode coverage while significantly reducing latency. These results show that multi-modality does not require slow inference, and a switch time decouples mode retention from speed. It yields a practical accuracy efficiency trade-off for robot policies.