🤖 AI Summary
This work addresses the challenge of novel view synthesis in autonomous driving, where trajectories beyond the original camera paths lack ground-truth supervision. To tackle this, the authors propose a self-supervised image inpainting framework that reformulates view extrapolation as an interpolation-based inpainting task via a virtual displacement strategy, enabling pixel-level supervision from the original images. The method explicitly models photometric discrepancies and calibration errors across multiple cameras through pseudo-3D seam blending. By integrating monocular depth priors, self-supervised inpainting, and multi-view consistency constraints—all without requiring LiDAR supervision—the approach significantly enhances geometric fidelity and visual quality, enabling scalable, high-fidelity driving simulation.
📝 Abstract
A fundamental bottleneck in Novel View Synthesis (NVS) for autonomous driving is the inherent supervision gap on novel trajectories: models are tasked with synthesizing unseen views during inference, yet lack ground truth images for these shifted poses during training. In this paper, we propose VisionNVS, a camera-only framework that fundamentally reformulates view synthesis from an ill-posed extrapolation problem into a self-supervised inpainting task. By introducing a ``Virtual-Shift'' strategy, we use monocular depth proxies to simulate occlusion patterns and map them onto the original view. This paradigm shift allows the use of raw, recorded images as pixel-perfect supervision, effectively eliminating the domain gap inherent in previous approaches. Furthermore, we address spatial consistency through a Pseudo-3D Seam Synthesis strategy, which integrates visual data from adjacent cameras during training to explicitly model real-world photometric discrepancies and calibration errors. Experiments demonstrate that VisionNVS achieves superior geometric fidelity and visual quality compared to LiDAR-dependent baselines, offering a robust solution for scalable driving simulation.