DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 4D reconstruction methods for dynamic driving scenes suffer from low efficiency and poor generalization due to per-scene optimization, reliance on pre-estimated camera poses, and short temporal windows. Method: We propose the first end-to-end, pose-agnostic, feed-forward 4D reconstruction framework. Our approach jointly predicts camera poses as network outputs, introduces lightweight dynamic and lifetime heads to decouple motion modeling from temporal consistency enforcement, and integrates 3D Gaussian representations, Transformer-based spatiotemporal modeling, and diffusion-based rendering refinement—enabling arbitrary multi-view, long-sequence inputs without camera calibration. Contributions/Results: Our method achieves state-of-the-art performance on Waymo, nuScenes, and Argoverse2. It enables efficient inference, zero-shot cross-dataset generalization, and linear scalability with input frame count.

Technology Category

Application Category

📝 Abstract
Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce extbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.
Problem

Research questions and friction points this paper is trying to address.

Reconstructs 4D dynamic driving scenes from unposed images
Eliminates reliance on known camera poses for scalability
Handles long sequences with arbitrary sparse input views
Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly predicts 3D Gaussians and camera poses from unposed images
Uses lightweight dynamic and lifespan heads for temporal consistency
Applies diffusion-based rendering to reduce artifacts in sparse views
🔎 Similar Papers
No similar papers found.
X
Xiaoxue Chen
AIR, Tsinghua University
Z
Ziyi Xiong
AIR, Tsinghua University
Yuantao Chen
Yuantao Chen
The Chinese University of Hong Kong, Shenzhen
Computer VisionRobotics
G
Gen Li
AIR, Tsinghua University
N
Nan Wang
AIR, Tsinghua University
H
Hongcheng Luo
Xiaomi EV
L
Long Chen
Xiaomi EV
H
Haiyang Sun
Xiaomi EV
B
Bing Wang
Xiaomi EV
G
Guang Chen
Xiaomi EV
H
Hangjun Ye
Xiaomi EV
H
Hongyang Li
The University of Hong Kong
Y
Ya-Qin Zhang
AIR, Tsinghua University
H
Hao Zhao
AIR, Tsinghua University