View-Invariant Policy Learning via Zero-Shot Novel View Synthesis

📅 2024-09-05
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor generalization of visuomotor policies across camera viewpoints, this paper proposes View Synthesis Augmentation (VISTA): the first framework to directly leverage zero-shot, single-image novel-view synthesis—based on implicit 3D representations—as a data augmentation technique for policy learning, without requiring real multi-view demonstrations. By synthesizing diverse virtual observations under varied camera poses, VISTA trains end-to-end viewpoint-robust visuomotor policies. Unlike conventional approaches relying on multi-view annotations or geometric priors, VISTA eliminates such dependencies, achieving significant improvements in out-of-distribution viewpoint generalization in both simulation and real-world robotic manipulation tasks. It consistently outperforms multi-view supervised baselines and standard data augmentation methods. VISTA establishes a new paradigm for lightweight, deployable general-purpose manipulation policies grounded in synthetic view diversity rather than explicit geometric modeling or costly multi-view supervision.

Technology Category

Application Category

📝 Abstract
Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista.
Problem

Research questions and friction points this paper is trying to address.

View-invariant policy learning
Zero-shot novel view synthesis
Generalizable manipulation systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot novel view synthesis
Single-image 3D scene rendering
Viewpoint-invariant policy learning
🔎 Similar Papers
No similar papers found.