Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of jointly modeling geometry and motion—along with high computational cost—in dynamic-scene 4D reconstruction, this paper proposes D4RT: a unified framework that, from a single input video, end-to-end estimates depth, spatiotemporal correspondences, and full camera parameters. Methodologically, D4RT introduces a lightweight, differentiable spatiotemporal coordinate query mechanism, eliminating conventional dense frame decoding and task-specific decoders; it enables independent 3D position inference for arbitrary spatiotemporal points. A shared Transformer architecture jointly optimizes depth, optical flow, and camera pose. Evaluated on multiple 4D reconstruction benchmarks, D4RT achieves state-of-the-art performance while significantly accelerating training and inference. It reduces model parameters by over 40%, offering superior efficiency, compactness, and scalability.

Technology Category

Application Category

📝 Abstract
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Reconstructs dynamic scenes from video efficiently
Infers depth, correspondence, and camera parameters jointly
Uses a novel querying mechanism for scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified transformer architecture for joint inference
Novel querying mechanism avoiding dense per-frame decoding
Lightweight scalable method enabling efficient training inference
🔎 Similar Papers
No similar papers found.