From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the high latency of existing diffusion- or flow-matching-based generative strategies, which rely on iterative ODE solvers and are thus ill-suited for high-frequency closed-loop control, while single-step acceleration methods often suffer from distribution collapse and loss of multimodal behavior. To overcome these limitations, we propose a distribution distillation framework based on Implicit Maximum Likelihood Estimation (IMLE), which distills a conditional flow matching (CFM) teacher model into a single-step student model. By introducing a bidirectional Chamfer distance to optimize set-level objectives, our approach ensures both coverage and fidelity of multimodal action distributions in a single forward pass. Integrated with a geometric-aware encoder that fuses multimodal perception (RGB, depth, point clouds, and proprioception), the method enables real-time replanning at high control frequencies and demonstrates enhanced robustness under dynamic perturbations, effectively mitigating distribution collapse in single-step generation.

Technology Category

Application Category

📝 Abstract

Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception encoder further integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation. The resulting high-frequency control supports real-time receding-horizon re-planning and improved robustness under dynamic disturbances.

Problem

Research questions and friction points this paper is trying to address.

multi-modal trajectory

real-time control

distribution collapse

generative policy

high-frequency planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit Maximum Likelihood Estimation

Conditional Flow Matching

Distribution Distillation