🤖 AI Summary
This work addresses the performance degradation of end-to-end robotic manipulation models under viewpoint shifts during testing by proposing a viewpoint-robust closed-loop manipulation framework that achieves cross-view generalization without requiring camera calibration at test time. The approach integrates a feedforward geometric model with a video diffusion model, combining 4D geometry estimation, geometry-aware view synthesis, and an implicit action planner within the ACT and π₀ policy architectures. A key contribution is the introduction of View Generalization Score (VGS), a novel evaluation metric, whose effectiveness is validated across multiple environments. Experiments demonstrate that the proposed method improves VGS by factors of 2.79 and 2.63 over ACT and π₀, respectively, in both simulation and real-world settings, while also generating high-quality novel-view images.
📝 Abstract
Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based ($π_0$) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79$\times$ and 2.63$\times$ over ACT and $π_0$, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.