🤖 AI Summary
This paper addresses dynamic 3D human reconstruction from uncalibrated, sparse multi-view videos. We propose a streaming 4D reconstruction framework that jointly leverages 3D Gaussian splatting and dense motion prediction. To enforce temporal coherence, we introduce learnable state tokens; to handle occlusions and enable end-to-end training without ground-truth motion supervision, we design occlusion-aware Gaussian fusion and a self-supervised re-projection loss regularized by optical flow. Our key contributions are: (i) the first method enabling continuous temporal interpolation and novel-view synthesis at arbitrary timestamps; and (ii) significantly improved inter-frame and inter-view geometric consistency via state-token modeling and self-supervised motion matching. Extensive evaluation demonstrates state-of-the-art performance on both in-domain and cross-domain benchmarks for novel-view rendering and temporal interpolation, achieving high fidelity while maintaining real-time inference speed.
📝 Abstract
Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. For novel time synthesis, we design a novel motion prediction module to predict dense motions for each 3D Gaussian between two adjacent frames, coupled with an occlusion-aware Gaussian fusion process to interpolate 3D Gaussians at arbitrary timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets. Project page and code at: https://zhenliuzju.github.io/huyingdong/Forge4D.