🤖 AI Summary
Current vision-driven robotic manipulation methods suffer from insufficient action inference accuracy in complex dynamic scenes, primarily because the prevailing vision–action (V-A) or vision–3D–action (V-3D-A) paradigms fail to jointly model scene evolution and action generation. To address this, we propose the V-4D-A framework and the Gaussian Action Field (GAF). Our approach introduces the first motion-aware 4D Gaussian field, embedding learnable motion attributes into 3D Gaussian rasterization—enabling unified scene reconstruction, future-frame prediction, and initial action estimation. Furthermore, we design a GAF-guided diffusion model for fine-grained action optimization. Experiments demonstrate substantial improvements: +11.54 dB in PSNR and −0.56 in LPIPS for reconstruction quality, and a +10.33% average success rate gain across robotic manipulation tasks—significantly outperforming state-of-the-art methods.
📝 Abstract
Accurate action inference is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we propose a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing simultaneous modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF supports three key query types: reconstruction of the current scene, prediction of future frames, and estimation of initial action via robot motion. Furthermore, the high-quality current and future frames generated by GAF facilitate manipulation action refinement through a GAF-guided diffusion model. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average success rate in robotic manipulation tasks by 10.33% over state-of-the-art methods. Project page: http://chaiying1.github.io/GAF.github.io/project_page/