🤖 AI Summary
This work addresses the high computational overhead and limited real-time performance of existing optical flow–based robotic manipulation methods, which typically rely on stacked modular pipelines. To overcome these limitations, the authors propose a lightweight, end-to-end optical flow world model that uniquely integrates visual observations with textual instructions to directly predict multi-frame 3D optical flow. This prediction is seamlessly embedded into the action policy, unifying perception and planning within a single framework. Leveraging a slow-fast cooperative mechanism, the approach significantly enhances computational efficiency while maintaining high accuracy. Experimental results demonstrate substantial improvements in task success rates in both simulated and real-world environments, establishing a new paradigm for efficient optical flow–guided planning in embodied intelligence.
📝 Abstract
Planning and acting in 3D environments is a fundamental capability for robotic manipulation in the real world. Although prior work has explored predictive flow planners to guide 3D manipulation, existing approaches often rely on modular pipelines stacking multiple submodels, resulting in high computational overhead and limited real-time performance. To address these challenges, we introduce RoboFlow4D, a lightweight flow world model that unifies perception and planning by estimating temporal motion in physical 3D space. As an end-to-end framework, RoboFlow4D directly predicts multi-frame 3D flows from visual observations and textual instructions, providing explicit flow-based planning to guide action generation. This design allows seamless integration with general action policies, forming an efficient observation-planning-execution closed loop. Through slow-fast collaboration between flow prediction and action control, RoboFlow4D enables real-time and resource-efficient manipulation. Extensive experiments in both simulation and real-world settings demonstrate that RoboFlow4D consistently improves manipulation success rates and computational efficiency, advancing flow-guided planning for embodied intelligence.