FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low inference efficiency and insufficient exploration of 3D environmental information in diffusion-based policies for robotic high-precision manipulation, this paper proposes a region-aware flow matching control framework. Our method integrates state space models (SSMs) with conditional flow matching to jointly achieve multimodal modeling capability and linear-time complexity. Key contributions include: (1) a novel dynamic radius scheduling mechanism that enables adaptive perception—from global scene layout to fine-grained geometric details; (2) a region-aware Mamba architecture that enhances 3D spatial feature extraction and long-range dependency modeling; and (3) an end-to-end control pipeline optimized for precision tasks. Evaluated on RLBench, our approach achieves a 12.0% average improvement in task success rate—setting a new state-of-the-art (SOTA). Notably, it excels in high-precision manipulation, generating accurate actions in ≤4 steps with significantly accelerated inference speed.

Technology Category

Application Category

📝 Abstract
Robotic manipulation in high-precision tasks is essential for numerous industrial and real-world applications where accuracy and speed are required. Yet current diffusion-based policy learning methods generally suffer from low computational efficiency due to the iterative denoising process during inference. Moreover, these methods do not fully explore the potential of generative models for enhancing information exploration in 3D environments. In response, we propose FlowRAM, a novel framework that leverages generative models to achieve region-aware perception, enabling efficient multimodal information processing. Specifically, we devise a Dynamic Radius Schedule, which allows adaptive perception, facilitating transitions from global scene comprehension to fine-grained geometric details. Furthermore, we integrate state space models to integrate multimodal information, while preserving linear computational complexity. In addition, we employ conditional flow matching to learn action poses by regressing deterministic vector fields, simplifying the learning process while maintaining performance. We verify the effectiveness of the FlowRAM in the RLBench, an established manipulation benchmark, and achieve state-of-the-art performance. The results demonstrate that FlowRAM achieves a remarkable improvement, particularly in high-precision tasks, where it outperforms previous methods by 12.0% in average success rate. Additionally, FlowRAM is able to generate physically plausible actions for a variety of real-world tasks in less than 4 time steps, significantly increasing inference speed.
Problem

Research questions and friction points this paper is trying to address.

Improving computational efficiency in robotic manipulation policies
Enhancing 3D perception with generative models
Achieving high-precision task performance with faster inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Region-aware Mamba framework for robotic manipulation
Dynamic Radius Schedule for adaptive perception
Conditional flow matching for action poses
🔎 Similar Papers
No similar papers found.
S
Sen Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
L
Le Wang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
S
Sanpin Zhou
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
J
Jingyi Tian
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
J
Jiayi Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Haowen Sun
Haowen Sun
Department of Automation, Tsinghua University
Computer Vision
W
Wei Tang
University of Illinois at Chicago