Improving Robotic Manipulation with Efficient Geometry-Aware Vision Encoder

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing RGB-based imitation learning methods rely on generic vision encoders (e.g., ResNet, ViT) that lack explicit 3D geometric modeling capabilities, limiting robotic manipulation performance. To address this, we propose eVGGT—a lightweight, geometry-aware visual encoder distilled from VGGT—retaining strong 3D reasoning while achieving an 8.7× inference speedup and 80% parameter reduction. eVGGT seamlessly integrates into mainstream imitation learning frameworks (e.g., ACT, DP), jointly processing RGB inputs and explicit geometric representations for policy learning. Evaluated in both simulation and real-robot settings, eVGGT improves single- and dual-arm manipulation success rates by up to 6.5% over baseline methods. It thus bridges high-fidelity 3D understanding with real-time deployability, advancing geometrically grounded visuomotor control.

Technology Category

Application Category

📝 Abstract

Existing RGB-based imitation learning approaches typically employ traditional vision encoders such as ResNet or ViT, which lack explicit 3D reasoning capabilities. Recent geometry-grounded vision models, such as VGGT~cite{wang2025vggt}, provide robust spatial understanding and are promising candidates to address this limitation. This work investigates the integration of geometry-aware visual representations into robotic manipulation. Our results suggest that incorporating the geometry-aware vision encoder into imitation learning frameworks, including ACT and DP, yields up to 6.5% improvement over standard vision encoders in success rate across single- and bi-manual manipulation tasks in both simulation and real-world settings. Despite these benefits, most geometry-grounded models require high computational cost, limiting their deployment in practical robotic systems. To address this challenge, we propose eVGGT, an efficient geometry-aware encoder distilled from VGGT. eVGGT is nearly 9 times faster and 5 times smaller than VGGT, while preserving strong 3D reasoning capabilities. Code and pretrained models will be released to facilitate further research in geometry-aware robotics.

Problem

Research questions and friction points this paper is trying to address.

Improving robotic manipulation via geometry-aware vision

Addressing computational cost in geometry-grounded vision models

Enhancing imitation learning with efficient 3D reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometry-aware vision encoder integration

Efficient encoder distillation from VGGT

Improved imitation learning success rates

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey