Hybrid Consistency Policy: Decoupling Multi-Modal Diversity and Real-Time Efficiency in Robotic Manipulation

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In vision-based motor policy learning, diffusion models offer multimodal behavior modeling but suffer from slow sampling and difficulty balancing real-time inference with behavioral diversity. This paper proposes a hybrid consistency policy: actions are generated via short random prefix initialization followed by a single-step consistency jump, integrated with an adaptive switching mechanism and time-varying consistency distillation to decouple multimodal preservation from inference acceleration. The method jointly optimizes a diffusion model, stochastic differential equation (SDE) sampler, consistency model, trajectory-level consistency objective, and denoising-matching objective. Experiments on both simulation and real-world robotic platforms show that our approach achieves comparable accuracy and mode coverage to an 80-step DDPM teacher model using only 25 SDE steps plus one consistency jump—significantly reducing latency. To our knowledge, this is the first method enabling efficient, high-coverage multimodal policy generation with single-step jumps while ensuring trajectory coherence, local fidelity, and robust mode diversity.

Technology Category

Application Category

📝 Abstract

In visuomotor policy learning, diffusion-based imitation learning has become widely adopted for its ability to capture diverse behaviors. However, approaches built on ordinary and stochastic denoising processes struggle to jointly achieve fast sampling and strong multi-modality. To address these challenges, we propose the Hybrid Consistency Policy (HCP). HCP runs a short stochastic prefix up to an adaptive switch time, and then applies a one-step consistency jump to produce the final action. To align this one-jump generation, HCP performs time-varying consistency distillation that combines a trajectory-consistency objective to keep neighboring predictions coherent and a denoising-matching objective to improve local fidelity. In both simulation and on a real robot, HCP with 25 SDE steps plus one jump approaches the 80-step DDPM teacher in accuracy and mode coverage while significantly reducing latency. These results show that multi-modality does not require slow inference, and a switch time decouples mode retention from speed. It yields a practical accuracy efficiency trade-off for robot policies.

Problem

Research questions and friction points this paper is trying to address.

Achieving fast sampling and strong multi-modality in visuomotor policies

Decoupling mode retention from speed in robotic manipulation

Balancing accuracy and efficiency trade-off in robot policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Consistency Policy decouples multi-modal diversity and real-time efficiency

It combines stochastic prefix with one-step consistency jump for action generation

Uses time-varying consistency distillation to align trajectory and denoising objectives

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15

Authors to Follow