Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work addresses the challenges of weak spatial understanding and poor cross-view generalization in robotic systems operating under single-view constraints. To this end, the authors propose a unified representation–policy learning framework that introduces a novel single-view 3D pretraining paradigm, integrating point cloud reconstruction with feedforward Gaussian splatting. Through a multi-step knowledge distillation process, geometric-aware representations are effectively transferred to downstream manipulation policies. Evaluated on 12 RLBench tasks, the method achieves an average success rate surpassing the state of the art by 12.7%. Moreover, under large viewpoint shifts, it exhibits only a 29.7% drop in zero-shot success rate—significantly outperforming the state-of-the-art drop of 51.5%—demonstrating its strong viewpoint generalization capability.

Technology Category

Application Category

📝 Abstract

Real-world robotic manipulation demands visuomotor policies capable of robust spatial scene understanding and strong generalization across diverse camera viewpoints. While recent advances in 3D-aware visual representations have shown promise, they still suffer from several key limitations, including reliance on multi-view observations during inference which is impractical in single-view restricted scenarios, incomplete scene modeling that fails to capture holistic and fine-grained geometric structures essential for precise manipulation, and lack of effective policy training strategies to retain and exploit the acquired 3D knowledge. To address these challenges, we present MethodName, a unified representation-policy learning framework for view-generalizable robotic manipulation. MethodName introduces a single-view 3D pretraining paradigm that leverages point cloud reconstruction and feed-forward gaussian splatting under multi-view supervision to learn holistic geometric representations. During policy learning, MethodName performs multi-step distillation to preserve the pretrained geometric understanding and effectively transfer it to manipulation skills. We conduct experiments on 12 RLBench tasks, where our approach outperforms the previous state-of-the-art method by 12.7% in average success rate. Further evaluation on six representative tasks demonstrates strong zero-shot view generalization, with success rate drops of only 22.0% and 29.7% under moderate and large viewpoint shifts respectively, whereas the state-of-the-art method suffers larger decreases of 41.6% and 51.5%.

Problem

Research questions and friction points this paper is trying to address.

view-generalizable manipulation

3D visual representations

single-view 3D reconstruction

geometric scene understanding

visuomotor policy

Innovation

Methods, ideas, or system contributions that make the work stand out.

single-view 3D pretraining

geometric representation learning

view-generalizable manipulation