Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

๐Ÿ“… 2025-12-09
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Vision foundation models (e.g., ViT-based architectures such as VGGT) yield features lacking explicit 3D geometric consistency, hindering their effectiveness in uncalibrated novel view synthesis (NVS) and camera pose estimation. Method: We propose a self-improving 3D reconstruction framework featuring a lightweight feature adapter and a self-supervised feature alignment mechanism. Leveraging auto-generated pseudo-ground-truth depth and poses, it enforces geometric consistency via reprojection consistency lossโ€”enabling end-to-end, 3D-annotation-free distillation of geometry-aware representations. Contribution/Results: Our method maps VGGT features into a geometrically consistent 3D feature space without requiring real 3D supervision. It establishes new state-of-the-art performance on both NVS and pose estimation benchmarks, significantly improving spatial fidelity and cross-view feature consistency.

Technology Category

Application Category

๐Ÿ“ Abstract
Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.
Problem

Research questions and friction points this paper is trying to address.

Improves 3D feature consistency for novel view synthesis
Enhances camera pose estimation from uncalibrated images
Transforms vision foundation models into precise 3D reconstruction engines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-improving pipeline via feature alignment
Lightweight adapter with reprojection-based consistency loss
Geometrically-aligned feature space from VGGT outputs
๐Ÿ”Ž Similar Papers
No similar papers found.