🤖 AI Summary
Controllable multi-view image generation for autonomous driving is hindered by the scarcity of real-world images from extrapolated (novel) camera viewpoints.
Method: We propose a diffusion-based approach built upon Stable Diffusion that requires no ground-truth supervision for extrapolated views. Our method introduces a hierarchical camera pose matching strategy, an enhanced feature-matching algorithm, a feature-aware adaptive view stitching mechanism, and a cross-view consistency self-supervised objective. Latent-space alignment is guided by clustering analysis, and geometric–photometric consistency is jointly optimized via self-supervised reconstruction loss.
Contribution/Results: To our knowledge, this is the first method enabling high-fidelity, viewpoint-controllable synthesis of arbitrary virtual-camera images across diverse vehicle configurations. It significantly enhances data augmentation and simulation capabilities in complex driving scenarios, without relying on novel-view ground-truth annotations.
📝 Abstract
Arbitrary viewpoint image generation holds significant potential for autonomous driving, yet remains a challenging task due to the lack of ground-truth data for extrapolated views, which hampers the training of high-fidelity generative models. In this work, we propose Arbiviewgen, a novel diffusion-based framework for the generation of controllable camera images from arbitrary points of view. To address the absence of ground-truth data in unseen views, we introduce two key components: Feature-Aware Adaptive View Stitching (FAVS) and Cross-View Consistency Self-Supervised Learning (CVC-SSL). FAVS employs a hierarchical matching strategy that first establishes coarse geometric correspondences using camera poses, then performs fine-grained alignment through improved feature matching algorithms, and identifies high-confidence matching regions via clustering analysis. Building upon this, CVC-SSL adopts a self-supervised training paradigm where the model reconstructs the original camera views from the synthesized stitched images using a diffusion model, enforcing cross-view consistency without requiring supervision from extrapolated data. Our framework requires only multi-camera images and their associated poses for training, eliminating the need for additional sensors or depth maps. To our knowledge, Arbiviewgen is the first method capable of controllable arbitrary view camera image generation in multiple vehicle configurations.