DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing 4D reconstruction methods for dynamic driving scenes suffer from low efficiency and poor generalization due to per-scene optimization, reliance on pre-estimated camera poses, and short temporal windows. Method: We propose the first end-to-end, pose-agnostic, feed-forward 4D reconstruction framework. Our approach jointly predicts camera poses as network outputs, introduces lightweight dynamic and lifetime heads to decouple motion modeling from temporal consistency enforcement, and integrates 3D Gaussian representations, Transformer-based spatiotemporal modeling, and diffusion-based rendering refinement—enabling arbitrary multi-view, long-sequence inputs without camera calibration. Contributions/Results: Our method achieves state-of-the-art performance on Waymo, nuScenes, and Argoverse2. It enables efficient inference, zero-shot cross-dataset generalization, and linear scalability with input frame count.

Technology Category

Application Category

📝 Abstract

Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce extbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.

Problem

Research questions and friction points this paper is trying to address.

Reconstructs 4D dynamic driving scenes from unposed images

Eliminates reliance on known camera poses for scalability

Handles long sequences with arbitrary sparse input views

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly predicts 3D Gaussians and camera poses from unposed images

Uses lightweight dynamic and lifespan heads for temporal consistency

Applies diffusion-based rendering to reduce artifacts in sparse views

🔎 Similar Papers

OmniRe: Omni Urban Scene Reconstruction