🤖 AI Summary
Existing generative models struggle to jointly perform temporal extrapolation and novel-view synthesis for dynamic 4D driving scenes without scene-specific fine-tuning. This paper proposes DiST, a decoupled spatiotemporal diffusion framework that employs metric depth as a unified geometric representation. It decomposes modeling into two branches: DiST-T predicts future metric depth and multi-view RGB frames, while DiST-S achieves zero-shot novel-view synthesis via forward-backward rendering cycle consistency. Key contributions include: (i) the first decoupled spatiotemporal diffusion architecture for driving scenes; (ii) metric depth as a generalizable geometric prior; and (iii) a cycle-consistency constraint bridging observed and unobserved viewpoints. Experiments demonstrate state-of-the-art performance on both temporal prediction and novel-view synthesis, competitive planning-relevant metrics, and cross-view and cross-temporal generalization—without per-scene optimization.
📝 Abstract
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.