๐ค AI Summary
This work addresses the high latency of existing diffusion- or flow-matching-based generative strategies, which rely on iterative ODE solvers and are thus ill-suited for high-frequency closed-loop control, while single-step acceleration methods often suffer from distribution collapse and loss of multimodal behavior. To overcome these limitations, we propose a distribution distillation framework based on Implicit Maximum Likelihood Estimation (IMLE), which distills a conditional flow matching (CFM) teacher model into a single-step student model. By introducing a bidirectional Chamfer distance to optimize set-level objectives, our approach ensures both coverage and fidelity of multimodal action distributions in a single forward pass. Integrated with a geometric-aware encoder that fuses multimodal perception (RGB, depth, point clouds, and proprioception), the method enables real-time replanning at high control frequencies and demonstrates enhanced robustness under dynamic perturbations, effectively mitigating distribution collapse in single-step generation.
๐ Abstract
Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception encoder further integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation. The resulting high-frequency control supports real-time receding-horizon re-planning and improved robustness under dynamic disturbances.