ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Addressing the challenge of mapping multimodal inputs—vision, language, and proprioception—to high-dimensional dexterous robot actions in general-purpose manipulation, this paper proposes ManiFlow: an end-to-end manipulation policy based on consistency flow matching and a diffusion Transformer (DiT-X). Its key innovations include adaptive cross-attention and an AdaLN-Zero conditioning mechanism, enabling fine-grained alignment between multimodal features and action tokens; it generates high-fidelity actions in just 1–2 inference steps. Evaluated across simulation and real-world robotic platforms—including single-arm, dual-arm, and humanoid systems—ManiFlow demonstrates significantly improved cross-task generalization, nearly doubling task success rates. Moreover, it exhibits strong robustness to novel objects, varying backgrounds, and scalable data regimes, confirming both reliability and scalability.

Technology Category

Application Category

📝 Abstract

This paper introduces ManiFlow, a visuomotor imitation learning policy for general robot manipulation that generates precise, high-dimensional actions conditioned on diverse visual, language and proprioceptive inputs. We leverage flow matching with consistency training to enable high-quality dexterous action generation in just 1-2 inference steps. To handle diverse input modalities efficiently, we propose DiT-X, a diffusion transformer architecture with adaptive cross-attention and AdaLN-Zero conditioning that enables fine-grained feature interactions between action tokens and multi-modal observations. ManiFlow demonstrates consistent improvements across diverse simulation benchmarks and nearly doubles success rates on real-world tasks across single-arm, bimanual, and humanoid robot setups with increasing dexterity. The extensive evaluation further demonstrates the strong robustness and generalizability of ManiFlow to novel objects and background changes, and highlights its strong scaling capability with larger-scale datasets. Our website: maniflow-policy.github.io.

Problem

Research questions and friction points this paper is trying to address.

Develops a robot manipulation policy for precise high-dimensional actions

Handles diverse visual, language and proprioceptive input modalities

Enables dexterous action generation with minimal inference steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow matching with consistency training

Diffusion transformer with adaptive cross-attention

Multi-modal conditioning for dexterous action generation

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15