VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
This work addresses the performance degradation of end-to-end robotic manipulation models under viewpoint shifts during testing by proposing a viewpoint-robust closed-loop manipulation framework that achieves cross-view generalization without requiring camera calibration at test time. The approach integrates a feedforward geometric model with a video diffusion model, combining 4D geometry estimation, geometry-aware view synthesis, and an implicit action planner within the ACT and π₀ policy architectures. A key contribution is the introduction of View Generalization Score (VGS), a novel evaluation metric, whose effectiveness is validated across multiple environments. Experiments demonstrate that the proposed method improves VGS by factors of 2.79 and 2.63 over ACT and π₀, respectively, in both simulation and real-world settings, while also generating high-quality novel-view images.

Technology Category

Application Category

📝 Abstract
Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based ($π_0$) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79$\times$ and 2.63$\times$ over ACT and $π_0$, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.
Problem

Research questions and friction points this paper is trying to address.

view robustness
robot manipulation
camera viewpoint changes
cross-view generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

view-robust manipulation
spatiotemporal-aware view synthesis
4D geometry estimation
latent action learning
View Generalization Score