GAF: Gaussian Action Field as a Dvnamic World Model for Robotic Mlanipulation

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Current vision-driven robotic manipulation methods suffer from insufficient action inference accuracy in complex dynamic scenes, primarily because the prevailing vision–action (V-A) or vision–3D–action (V-3D-A) paradigms fail to jointly model scene evolution and action generation. To address this, we propose the V-4D-A framework and the Gaussian Action Field (GAF). Our approach introduces the first motion-aware 4D Gaussian field, embedding learnable motion attributes into 3D Gaussian rasterization—enabling unified scene reconstruction, future-frame prediction, and initial action estimation. Furthermore, we design a GAF-guided diffusion model for fine-grained action optimization. Experiments demonstrate substantial improvements: +11.54 dB in PSNR and −0.56 in LPIPS for reconstruction quality, and a +10.33% average success rate gain across robotic manipulation tasks—significantly outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Accurate action inference is critical for vision-based robotic manipulation. Existing approaches typically follow either a Vision-to-Action (V-A) paradigm, predicting actions directly from visual inputs, or a Vision-to-3D-to-Action (V-3D-A) paradigm, leveraging intermediate 3D representations. However, these methods often struggle with action inaccuracies due to the complexity and dynamic nature of manipulation scenes. In this paper, we propose a V-4D-A framework that enables direct action reasoning from motion-aware 4D representations via a Gaussian Action Field (GAF). GAF extends 3D Gaussian Splatting (3DGS) by incorporating learnable motion attributes, allowing simultaneous modeling of dynamic scenes and manipulation actions. To learn time-varying scene geometry and action-aware robot motion, GAF supports three key query types: reconstruction of the current scene, prediction of future frames, and estimation of initial action via robot motion. Furthermore, the high-quality current and future frames generated by GAF facilitate manipulation action refinement through a GAF-guided diffusion model. Extensive experiments demonstrate significant improvements, with GAF achieving +11.5385 dB PSNR and -0.5574 LPIPS improvements in reconstruction quality, while boosting the average success rate in robotic manipulation tasks by 10.33% over state-of-the-art methods. Project page: http://chaiying1.github.io/GAF.github.io/project_page/

Problem

Research questions and friction points this paper is trying to address.

Improving action inference in robotic manipulation

Modeling dynamic scenes and manipulation actions

Enhancing reconstruction quality and task success rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

V-4D-A framework with Gaussian Action Field

Extends 3DGS with learnable motion attributes

GAF-guided diffusion model refines actions

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey