BEV-VAE: Multi-view Image Generation with Spatial Consistency for Autonomous Driving

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

To address the lack of 3D-consistent modeling in multi-view image generation for autonomous driving, this paper proposes BEV-VAE—a novel framework that unifies multi-view image generation within a bird’s-eye-view (BEV) latent space, enabling compact and structured 3D scene representation. Our method integrates a multi-view variational autoencoder with a latent diffusion Transformer to jointly model geometric constraints and semantic distributions in the BEV latent space, supporting camera-parameter- and 3D-layout-conditioned controllable synthesis of arbitrary views. Evaluated on nuScenes and Argoverse 2, BEV-VAE significantly improves cross-view 3D reconstruction consistency and generation fidelity, demonstrating strong generalization and structural controllability in complex urban scenes. This work establishes a new, interpretable, and editable paradigm for multi-view generation, advancing both perception and simulation in autonomous driving systems.

Technology Category

Application Category

📝 Abstract

Multi-view image generation in autonomous driving demands consistent 3D scene understanding across camera views. Most existing methods treat this problem as a 2D image set generation task, lacking explicit 3D modeling. However, we argue that a structured representation is crucial for scene generation, especially for autonomous driving applications. This paper proposes BEV-VAE for consistent and controllable view synthesis. BEV-VAE first trains a multi-view image variational autoencoder for a compact and unified BEV latent space and then generates the scene with a latent diffusion transformer. BEV-VAE supports arbitrary view generation given camera configurations, and optionally 3D layouts. Experiments on nuScenes and Argoverse 2 (AV2) show strong performance in both 3D consistent reconstruction and generation. The code is available at: https://github.com/Czm369/bev-vae.

Problem

Research questions and friction points this paper is trying to address.

Ensures 3D consistency in multi-view image generation

Addresses lack of explicit 3D modeling in existing methods

Enables controllable view synthesis for autonomous driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses BEV latent space for multi-view synthesis

Employs latent diffusion transformer for scene generation

Supports arbitrary view generation with 3D layouts

🔎 Similar Papers

BEVWorld: A Multimodal World Simulator for Autonomous Driving via Scene-Level BEV Latents