🤖 AI Summary
This work addresses the trade-off between inference latency and motion accuracy that limits real-time, high-precision robotic manipulation. We propose a novel three-stage, two-step generation framework that integrates MeanFlow, ReNoise, and ReFlow—marking the first combination of MeanFlow and ReFlow—to generate high-fidelity action sequences with only two function evaluations (2-NFE), achieving minimal computational overhead. Real-world robotic experiments demonstrate that our method reduces inference time from 152 ms to 19 ms (an ~8× speedup, reaching 52 Hz) and improves success rates by 15–25% over a 16-step diffusion policy. The approach achieves 70.0% success in unseen-color grasping and 66.3% in deformable object folding tasks, showcasing its efficacy in complex, real-time interactive scenarios.
📝 Abstract
Limited by inference latency, existing robot manipulation policies lack sufficient real-time interaction capability with the environment. Although faster generation methods such as flow matching are gradually replacing diffusion methods, researchers are pursuing even faster generation suitable for interactive robot control. MeanFlow, as a one-step variant of flow matching, has shown strong potential in image generation, but its precision in action generation does not meet the stringent requirements of robotic manipulation. We therefore propose \textbf{HybridFlow}, a \textbf{3-stage method} with \textbf{2-NFE}: Global Jump in MeanFlow mode, ReNoise for distribution alignment, and Local Refine in ReFlow mode. This method balances inference speed and generation quality by leveraging the rapid advantage of MeanFlow one-step generation while ensuring action precision with minimal generation steps. Through real-world experiments, HybridFlow outperforms the 16-step Diffusion Policy by \textbf{15--25\%} in success rate while reducing inference time from 152ms to 19ms (\textbf{8$\times$ speedup}, \textbf{$\sim$52Hz}); it also achieves 70.0\% success on unseen-color OOD grasping and 66.3\% on deformable object folding. We envision HybridFlow as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies.