🤖 AI Summary
Existing feedforward 3D reconstruction methods struggle to balance efficiency and quality under sparse-view conditions, and incorporating generative priors often compromises inference speed. This work proposes a purely feedforward iterative refinement framework that progressively enhances a 3D Gaussian splatting representation through a small number of forward residual updates. Generative priors distilled from a frozen diffusion model are injected in the form of per-Gaussian cues, eliminating the need for test-time optimization or camera pose estimation. By moving beyond the limitations of single-pass prediction, the method achieves up to a 2.1 dB PSNR improvement on benchmarks including DL3DV, RealEstate10K, and DTU, while maintaining sub-second inference times.
📝 Abstract
Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.