Forge4D: Feed-Forward 4D Human Reconstruction and Interpolation from Uncalibrated Sparse-view Videos

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses dynamic 3D human reconstruction from uncalibrated, sparse multi-view videos. We propose a streaming 4D reconstruction framework that jointly leverages 3D Gaussian splatting and dense motion prediction. To enforce temporal coherence, we introduce learnable state tokens; to handle occlusions and enable end-to-end training without ground-truth motion supervision, we design occlusion-aware Gaussian fusion and a self-supervised re-projection loss regularized by optical flow. Our key contributions are: (i) the first method enabling continuous temporal interpolation and novel-view synthesis at arbitrary timestamps; and (ii) significantly improved inter-frame and inter-view geometric consistency via state-token modeling and self-supervised motion matching. Extensive evaluation demonstrates state-of-the-art performance on both in-domain and cross-domain benchmarks for novel-view rendering and temporal interpolation, achieving high fidelity while maintaining real-time inference speed.

Technology Category

Application Category

📝 Abstract
Instant reconstruction of dynamic 3D humans from uncalibrated sparse-view videos is critical for numerous downstream applications. Existing methods, however, are either limited by the slow reconstruction speeds or incapable of generating novel-time representations. To address these challenges, we propose Forge4D, a feed-forward 4D human reconstruction and interpolation model that efficiently reconstructs temporally aligned representations from uncalibrated sparse-view videos, enabling both novel view and novel time synthesis. Our model simplifies the 4D reconstruction and interpolation problem as a joint task of streaming 3D Gaussian reconstruction and dense motion prediction. For the task of streaming 3D Gaussian reconstruction, we first reconstruct static 3D Gaussians from uncalibrated sparse-view images and then introduce learnable state tokens to enforce temporal consistency in a memory-friendly manner by interactively updating shared information across different timestamps. For novel time synthesis, we design a novel motion prediction module to predict dense motions for each 3D Gaussian between two adjacent frames, coupled with an occlusion-aware Gaussian fusion process to interpolate 3D Gaussians at arbitrary timestamps. To overcome the lack of the ground truth for dense motion supervision, we formulate dense motion prediction as a dense point matching task and introduce a self-supervised retargeting loss to optimize this module. An additional occlusion-aware optical flow loss is introduced to ensure motion consistency with plausible human movement, providing stronger regularization. Extensive experiments demonstrate the effectiveness of our model on both in-domain and out-of-domain datasets. Project page and code at: https://zhenliuzju.github.io/huyingdong/Forge4D.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing dynamic 3D humans from uncalibrated sparse-view videos efficiently
Enabling novel view and novel time synthesis for 4D human representations
Predicting dense motions between frames with self-supervised learning methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming 3D Gaussian reconstruction with learnable state tokens
Motion prediction module for dense Gaussian interpolation
Self-supervised retargeting loss for motion optimization
🔎 Similar Papers
No similar papers found.