🤖 AI Summary
This work addresses the challenge of dynamic 4D scene reconstruction, where moving objects often disrupt camera pose estimation, rendering existing methods computationally expensive and unsuitable for real-time applications. To this end, we propose MoRe, a feedforward 4D reconstruction network built upon a static reconstruction backbone. MoRe effectively decouples dynamic and static scene components through an attention-forcing mechanism and introduces grouped causal attention to model temporal dependencies while accommodating variable sequence lengths. To our knowledge, MoRe is the first method to achieve efficient, end-to-end dynamic 4D reconstruction, significantly outperforming optimization-based approaches across multiple benchmarks. It delivers high-quality reconstructions with strong temporal consistency and real-time inference capabilities.
📝 Abstract
Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.