DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

📅 2025-03-19

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing generative models struggle to jointly perform temporal extrapolation and novel-view synthesis for dynamic 4D driving scenes without scene-specific fine-tuning. This paper proposes DiST, a decoupled spatiotemporal diffusion framework that employs metric depth as a unified geometric representation. It decomposes modeling into two branches: DiST-T predicts future metric depth and multi-view RGB frames, while DiST-S achieves zero-shot novel-view synthesis via forward-backward rendering cycle consistency. Key contributions include: (i) the first decoupled spatiotemporal diffusion architecture for driving scenes; (ii) metric depth as a generalizable geometric prior; and (iii) a cycle-consistency constraint bridging observed and unobserved viewpoints. Experiments demonstrate state-of-the-art performance on both temporal prediction and novel-view synthesis, competitive planning-relevant metrics, and cross-view and cross-temporal generalization—without per-scene optimization.

Technology Category

Application Category

📝 Abstract

Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.

Problem

Research questions and friction points this paper is trying to address.

Generates dynamic 4D driving scenes without per-scene optimization.

Uses metric depth for temporal and spatial synthesis.

Improves novel view synthesis and temporal prediction accuracy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled spatiotemporal diffusion framework for 4D scenes

Uses metric depth for view-consistent geometric representation

Cycle consistency reduces generalization gap in NVS

🔎 Similar Papers

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes