🤖 AI Summary
To address the low inference efficiency and insufficient exploration of 3D environmental information in diffusion-based policies for robotic high-precision manipulation, this paper proposes a region-aware flow matching control framework. Our method integrates state space models (SSMs) with conditional flow matching to jointly achieve multimodal modeling capability and linear-time complexity. Key contributions include: (1) a novel dynamic radius scheduling mechanism that enables adaptive perception—from global scene layout to fine-grained geometric details; (2) a region-aware Mamba architecture that enhances 3D spatial feature extraction and long-range dependency modeling; and (3) an end-to-end control pipeline optimized for precision tasks. Evaluated on RLBench, our approach achieves a 12.0% average improvement in task success rate—setting a new state-of-the-art (SOTA). Notably, it excels in high-precision manipulation, generating accurate actions in ≤4 steps with significantly accelerated inference speed.
📝 Abstract
Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhancing information exploration in 3D environments. In response, we propose FlowRAM, a novel framework that leverages generative models to achieve region-aware perception, enabling efficient multimodal information processing. Specifically, we devise a Dynamic Radius Schedule, which allows adaptive perception, facilitating transitions from global scene comprehension to fine-grained geometric details. Furthermore, we integrate state space models to integrate multimodal information, while preserving linear computational complexity. In addition, we employ conditional flow matching to learn action poses by regressing deterministic vector fields, simplifying the learning process while maintaining performance. We verify the effectiveness of the FlowRAM in the RLBench, an established manipulation benchmark, and achieve state-of-the-art performance. The results demonstrate that FlowRAM achieves a remarkable improvement, particularly in high-precision tasks, where it outperforms previous methods by 12.0% in average success rate. Additionally, FlowRAM is able to generate physically plausible actions for a variety of real-world tasks in less than 4 time steps, significantly increasing inference speed.