AutoScape: Geometry-Consistent Long-Horizon Scene Generation

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

To address geometric inconsistency and poor temporal coherence in long-horizon autonomous driving video generation, this paper proposes a two-stage RGB-D diffusion framework. First, RGB and depth modalities are jointly modeled in a shared latent space, with explicit point-cloud representation enforcing scene geometry; a deformation-consistency guidance mechanism ensures sparse keyframes that are strictly geometrically consistent. Second, these keyframes serve as anchors to drive a video diffusion model for dense frame interpolation. This is the first method enabling end-to-end, geometrically consistent synthesis of long-duration (>20 seconds) driving videos. It achieves new state-of-the-art performance, outperforming prior methods by 48.6% on long-horizon FID and 43.0% on FVD, significantly enhancing visual realism and structural stability.

Technology Category

Application Category

📝 Abstract

This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively.

Problem

Research questions and friction points this paper is trying to address.

Generating long-horizon geometrically consistent driving scenes

Maintaining geometric consistency across sparse keyframes and dense videos

Improving realism and coherence in extended driving video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

RGB-D diffusion model generates sparse keyframes

Shared latent space handles image and depth

Video diffusion model interpolates dense video frames

🔎 Similar Papers

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes