🤖 AI Summary
To address the lack of 3D-consistent modeling in multi-view image generation for autonomous driving, this paper proposes BEV-VAE—a novel framework that unifies multi-view image generation within a bird’s-eye-view (BEV) latent space, enabling compact and structured 3D scene representation. Our method integrates a multi-view variational autoencoder with a latent diffusion Transformer to jointly model geometric constraints and semantic distributions in the BEV latent space, supporting camera-parameter- and 3D-layout-conditioned controllable synthesis of arbitrary views. Evaluated on nuScenes and Argoverse 2, BEV-VAE significantly improves cross-view 3D reconstruction consistency and generation fidelity, demonstrating strong generalization and structural controllability in complex urban scenes. This work establishes a new, interpretable, and editable paradigm for multi-view generation, advancing both perception and simulation in autonomous driving systems.
📝 Abstract
Multi-view image generation in autonomous driving demands consistent 3D scene understanding across camera views. Most existing methods treat this problem as a 2D image set generation task, lacking explicit 3D modeling. However, we argue that a structured representation is crucial for scene generation, especially for autonomous driving applications. This paper proposes BEV-VAE for consistent and controllable view synthesis. BEV-VAE first trains a multi-view image variational autoencoder for a compact and unified BEV latent space and then generates the scene with a latent diffusion transformer. BEV-VAE supports arbitrary view generation given camera configurations, and optionally 3D layouts. Experiments on nuScenes and Argoverse 2 (AV2) show strong performance in both 3D consistent reconstruction and generation. The code is available at: https://github.com/Czm369/bev-vae.