GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Vision-language-action (VLA) models exhibit limited robustness in robotic 3D fine manipulation due to insufficient geometric awareness and lack of forward-looking reasoning. To address this, we propose the first end-to-end VLA framework integrating kinematic priors with explicit 3D geometric modeling. Our method introduces: (1) a novel trajectory-level motion prediction module for proactive joint-space action planning; and (2) a trackable-guided 3D Gaussian geometric prediction module, supervised solely by differentiable depth rendering—eliminating the need for explicit 3D decoding during inference. The framework unifies predictive kinematic modeling, 3D Gaussian spatial representation, trajectory-guided geometric refinement, and a lightweight query mechanism. Evaluated on RoboCasa Human-50, LIBERO, and real-robot tasks, our approach significantly outperforms state-of-the-art VLA baselines, particularly in geometry-intensive and spatially demanding scenarios, achieving substantial gains in both accuracy and robustness.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enhances VLA models with 3D predictive kinematics

Improves precision in geometry-intensive robotic manipulation

Forecasts workspace geometry using predictive 3D Gaussian modules

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts multi-step 3D keypoint trajectories for robot arms

Forecasts workspace geometry using 3D Gaussian models

Uses predictive modules only during training for supervision

🔎 Similar Papers

Stable Object Placement Under Geometric Uncertainty via Differentiable Contact Dynamics