🤖 AI Summary
Existing vision-language-action (VLA) models rely on 2D visual representations, which struggle to achieve precise 3D spatial understanding and manipulation. This work proposes a dual-system 3D-aware VLA strategy that introduces, for the first time within the VLA framework, a tightly coupled mechanism between hierarchical 3D point clouds and action tokens. By leveraging multi-scale point-action interactions and an efficient bottleneck windowed self-attention architecture, the method jointly models fine-grained local geometry and global scene structure. Evaluated under frozen pre-trained vision-language backbones, the approach significantly outperforms current state-of-the-art methods on both LIBERO and RLBench benchmarks, achieving a 10% absolute improvement in success rate on RLBench-10Tasks, thereby demonstrating its superior capability in 3D manipulation tasks.
📝 Abstract
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it against monolithic and dual-system VLA baselines, including variants augmented with point cloud inputs. PointACT achieves consistent improvements across both benchmarks, increasing success rates by 10% on the challenging RLBench-10Tasks suite over state-of-the-art pretrained VLAs, with even larger gains when the vision-language backbone is frozen and the action expert is trained from scratch. Extensive ablation studies demonstrate that tightly coupling hierarchical 3D geometry with pretrained 2D semantic representations is critical for robust and spatially grounded robot control. Our results also highlight the promise of pretrained 3D representations for 3D-aware VLA policies.