๐ค AI Summary
Vision foundation models (e.g., ViT-based architectures such as VGGT) yield features lacking explicit 3D geometric consistency, hindering their effectiveness in uncalibrated novel view synthesis (NVS) and camera pose estimation.
Method: We propose a self-improving 3D reconstruction framework featuring a lightweight feature adapter and a self-supervised feature alignment mechanism. Leveraging auto-generated pseudo-ground-truth depth and poses, it enforces geometric consistency via reprojection consistency lossโenabling end-to-end, 3D-annotation-free distillation of geometry-aware representations.
Contribution/Results: Our method maps VGGT features into a geometrically consistent 3D feature space without requiring real 3D supervision. It establishes new state-of-the-art performance on both NVS and pose estimation benchmarks, significantly improving spatial fidelity and cross-view feature consistency.
๐ Abstract
Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.