Hybrid Consistency Policy: Decoupling Multi-Modal Diversity and Real-Time Efficiency in Robotic Manipulation

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In vision-based motor policy learning, diffusion models offer multimodal behavior modeling but suffer from slow sampling and difficulty balancing real-time inference with behavioral diversity. This paper proposes a hybrid consistency policy: actions are generated via short random prefix initialization followed by a single-step consistency jump, integrated with an adaptive switching mechanism and time-varying consistency distillation to decouple multimodal preservation from inference acceleration. The method jointly optimizes a diffusion model, stochastic differential equation (SDE) sampler, consistency model, trajectory-level consistency objective, and denoising-matching objective. Experiments on both simulation and real-world robotic platforms show that our approach achieves comparable accuracy and mode coverage to an 80-step DDPM teacher model using only 25 SDE steps plus one consistency jump—significantly reducing latency. To our knowledge, this is the first method enabling efficient, high-coverage multimodal policy generation with single-step jumps while ensuring trajectory coherence, local fidelity, and robust mode diversity.

Technology Category

Application Category

📝 Abstract
In visuomotor policy learning, diffusion-based imitation learning has become widely adopted for its ability to capture diverse behaviors. However, approaches built on ordinary and stochastic denoising processes struggle to jointly achieve fast sampling and strong multi-modality. To address these challenges, we propose the Hybrid Consistency Policy (HCP). HCP runs a short stochastic prefix up to an adaptive switch time, and then applies a one-step consistency jump to produce the final action. To align this one-jump generation, HCP performs time-varying consistency distillation that combines a trajectory-consistency objective to keep neighboring predictions coherent and a denoising-matching objective to improve local fidelity. In both simulation and on a real robot, HCP with 25 SDE steps plus one jump approaches the 80-step DDPM teacher in accuracy and mode coverage while significantly reducing latency. These results show that multi-modality does not require slow inference, and a switch time decouples mode retention from speed. It yields a practical accuracy efficiency trade-off for robot policies.
Problem

Research questions and friction points this paper is trying to address.

Achieving fast sampling and strong multi-modality in visuomotor policies
Decoupling mode retention from speed in robotic manipulation
Balancing accuracy and efficiency trade-off in robot policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Consistency Policy decouples multi-modal diversity and real-time efficiency
It combines stochastic prefix with one-step consistency jump for action generation
Uses time-varying consistency distillation to align trajectory and denoising objectives
Q
Qianyou Zhao
Shanghai Jiao Tong University
Y
Yuliang Shen
Shanghai Jiao Tong University
X
Xuanran Zhai
National University of Singapore
Ce Hao
Ce Hao
National University of Singapore
D
Duidi Wu
Shanghai Jiao Tong University
J
Jin Qi
Shanghai Jiao Tong University
J
Jie Hu
Shanghai Jiao Tong University
Qiaojun Yu
Qiaojun Yu
Shanghai Jiao Tong University, Shanghai AI Lab
robotic learning3D visionvla