3D FlowMatch Actor: Unified 3D Policy for Single- and Dual-Arm Manipulation

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the lack of a unified and efficient policy architecture for both single-arm and bimanual robotic manipulation, this paper proposes 3D FlowMatch Actor (3DFA)—the first end-to-end framework that introduces flow matching into 3D embodied policy learning. Its core innovation lies in integrating 3D pretrained visual representations with relative-position-aware attention to enable cross-modal alignment between action and visual features, while unifying single-arm and bimanual control within a single model. Through system-level optimizations, 3DFA achieves over 30× speedup in both training and inference. On the PerAct2 benchmark, it outperforms prior state-of-the-art (SOTA) methods by 41.4%; on the 74-task RLBench suite, it establishes new SOTA performance. Remarkably, it attains superior results with only 0.1% of the parameters of comparable models, demonstrating exceptional parameter efficiency and scalability.

Technology Category

Application Category

📝 Abstract

We present 3D FlowMatch Actor (3DFA), a 3D policy architecture for robot manipulation that combines flow matching for trajectory prediction with 3D pretrained visual scene representations for learning from demonstration. 3DFA leverages 3D relative attention between action and visual tokens during action denoising, building on prior work in 3D diffusion-based single-arm policy learning. Through a combination of flow matching and targeted system-level and architectural optimizations, 3DFA achieves over 30x faster training and inference than previous 3D diffusion-based policies, without sacrificing performance. On the bimanual PerAct2 benchmark, it establishes a new state of the art, outperforming the next-best method by an absolute margin of 41.4%. In extensive real-world evaluations, it surpasses strong baselines with up to 1000x more parameters and significantly more pretraining. In unimanual settings, it sets a new state of the art on 74 RLBench tasks by directly predicting dense end-effector trajectories, eliminating the need for motion planning. Comprehensive ablation studies underscore the importance of our design choices for both policy effectiveness and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Unified 3D policy for single- and dual-arm robot manipulation

Fast training and inference for 3D diffusion-based policies

State-of-the-art performance on bimanual and unimanual tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines flow matching with 3D visual representations

Leverages 3D relative attention for action denoising

Achieves 30x faster training and inference

🔎 Similar Papers

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids

2024-07-16Neural Information Processing SystemsCitations: 16