From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation

๐Ÿ“… 2026-03-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the high latency of existing diffusion- or flow-matching-based generative strategies, which rely on iterative ODE solvers and are thus ill-suited for high-frequency closed-loop control, while single-step acceleration methods often suffer from distribution collapse and loss of multimodal behavior. To overcome these limitations, we propose a distribution distillation framework based on Implicit Maximum Likelihood Estimation (IMLE), which distills a conditional flow matching (CFM) teacher model into a single-step student model. By introducing a bidirectional Chamfer distance to optimize set-level objectives, our approach ensures both coverage and fidelity of multimodal action distributions in a single forward pass. Integrated with a geometric-aware encoder that fuses multimodal perception (RGB, depth, point clouds, and proprioception), the method enables real-time replanning at high control frequencies and demonstrates enhanced robustness under dynamic perturbations, effectively mitigating distribution collapse in single-step generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Generative policies based on diffusion and flow matching achieve strong performance in robotic manipulation by modeling multi-modal human demonstrations. However, their reliance on iterative Ordinary Differential Equation (ODE) integration introduces substantial latency, limiting high-frequency closed-loop control. Recent single-step acceleration methods alleviate this overhead but often exhibit distributional collapse, producing averaged trajectories that fail to execute coherent manipulation strategies. We propose a framework that distills a Conditional Flow Matching (CFM) expert into a fast single-step student via Implicit Maximum Likelihood Estimation (IMLE). A bi-directional Chamfer distance provides a set-level objective that promotes both mode coverage and fidelity, enabling preservation of the teacher multi-modal action distribution in a single forward pass. A unified perception encoder further integrates multi-view RGB, depth, point clouds, and proprioception into a geometry-aware representation. The resulting high-frequency control supports real-time receding-horizon re-planning and improved robustness under dynamic disturbances.
Problem

Research questions and friction points this paper is trying to address.

multi-modal trajectory
real-time control
distribution collapse
generative policy
high-frequency planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit Maximum Likelihood Estimation
Conditional Flow Matching
Distribution Distillation
Multi-Modal Trajectory Policies
Real-Time Control
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Ju Dong
TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg, Hamburg, Germany.
L
Liding Zhang
Technical University of Munich, Germany.
Lei Zhang
Lei Zhang
University of Hamburg, Agile Robots SE
Dexterous ManipulationMulti-modal AIEmbodied AI
Y
Yu Fu
Technical University of Munich, Germany.
K
Kaixin Bai
TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg, Hamburg, Germany.
Z
Zoltรกn-Csaba Marton
Agile Robots SE, Munich, Germany.
Zhenshan Bing
Zhenshan Bing
Nanjing University / Technical University of Munich
Robotics
Z
Zhaopeng Chen
Agile Robots SE, Munich, Germany.
A
Alois Christian Knoll
Technical University of Munich, Germany.
J
Jianwei Zhang
TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg, Hamburg, Germany.