SceneCrafter: Controllable Multi-View Driving Scene Editing

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the core challenges in autonomous driving simulation—namely, the lack of realism in synthetically generated scenes and the difficulty of ensuring multi-view 3D consistency during editing of real-world log data—this paper proposes the first controllable, geometrically consistent 3D editing framework tailored for multi-camera driving scenarios. Methodologically, it integrates Prompt-to-Prompt paired generation, alpha-blending-based local editing, and street-layout priors, synergistically leveraging multi-view diffusion models, mask-aware training, multi-view inpainting, and differentiable rendering to achieve geometry-consistent edits under diverse multimodal conditions (e.g., weather, time-of-day, traffic participants). Quantitatively and qualitatively, the framework achieves state-of-the-art performance in realism, controllability, and 3D consistency—significantly outperforming existing approaches. This work establishes a new paradigm for high-fidelity, verifiable autonomous driving simulation.

Technology Category

Application Category

📝 Abstract

Simulation is crucial for developing and evaluating autonomous vehicle (AV) systems. Recent literature builds on a new generation of generative models to synthesize highly realistic images for full-stack simulation. However, purely synthetically generated scenes are not grounded in reality and have difficulty in inspiring confidence in the relevance of its outcomes. Editing models, on the other hand, leverage source scenes from real driving logs, and enable the simulation of different traffic layouts, behaviors, and operating conditions such as weather and time of day. While image editing is an established topic in computer vision, it presents fresh sets of challenges in driving simulation: (1) the need for cross-camera 3D consistency, (2) learning ``empty street" priors from driving data with foreground occlusions, and (3) obtaining paired image tuples of varied editing conditions while preserving consistent layout and geometry. To address these challenges, we propose SceneCrafter, a versatile editor for realistic 3D-consistent manipulation of driving scenes captured from multiple cameras. We build on recent advancements in multi-view diffusion models, using a fully controllable framework that scales seamlessly to multi-modality conditions like weather, time of day, agent boxes and high-definition maps. To generate paired data for supervising the editing model, we propose a novel framework on top of Prompt-to-Prompt to generate geometrically consistent synthetic paired data with global edits. We also introduce an alpha-blending framework to synthesize data with local edits, leveraging a model trained on empty street priors through novel masked training and multi-view repaint paradigm. SceneCrafter demonstrates powerful editing capabilities and achieves state-of-the-art realism, controllability, 3D consistency, and scene editing quality compared to existing baselines.

Problem

Research questions and friction points this paper is trying to address.

Ensuring cross-camera 3D consistency in driving scene editing

Learning empty street priors from occluded driving data

Generating paired image tuples with consistent layout and geometry

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view diffusion models for 3D consistency

Prompt-to-Prompt for synthetic paired data

Alpha-blending with masked training for local edits

🔎 Similar Papers

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes