Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

📅 2025-12-09

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Vision foundation models (e.g., ViT-based architectures such as VGGT) yield features lacking explicit 3D geometric consistency, hindering their effectiveness in uncalibrated novel view synthesis (NVS) and camera pose estimation. Method: We propose a self-improving 3D reconstruction framework featuring a lightweight feature adapter and a self-supervised feature alignment mechanism. Leveraging auto-generated pseudo-ground-truth depth and poses, it enforces geometric consistency via reprojection consistency loss—enabling end-to-end, 3D-annotation-free distillation of geometry-aware representations. Contribution/Results: Our method maps VGGT features into a geometrically consistent 3D feature space without requiring real 3D supervision. It establishes new state-of-the-art performance on both NVS and pose estimation benchmarks, significantly improving spatial fidelity and cross-view feature consistency.

Technology Category

Application Category

📝 Abstract

Novel View Synthesis (NVS) has traditionally relied on models with explicit 3D inductive biases combined with known camera parameters from Structure-from-Motion (SfM) beforehand. Recent vision foundation models like VGGT take an orthogonal approach -- 3D knowledge is gained implicitly through training data and loss objectives, enabling feed-forward prediction of both camera parameters and 3D representations directly from a set of uncalibrated images. While flexible, VGGT features lack explicit multi-view geometric consistency, and we find that improving such 3D feature consistency benefits both NVS and pose estimation tasks. We introduce Selfi, a self-improving 3D reconstruction pipeline via feature alignment, transforming a VGGT backbone into a high-fidelity 3D reconstruction engine by leveraging its own outputs as pseudo-ground-truth. Specifically, we train a lightweight feature adapter using a reprojection-based consistency loss, which distills VGGT outputs into a new geometrically-aligned feature space that captures spatial proximity in 3D. This enables state-of-the-art performance in both NVS and camera pose estimation, demonstrating that feature alignment is a highly beneficial step for downstream 3D reasoning.

Problem

Research questions and friction points this paper is trying to address.

Improves 3D feature consistency for novel view synthesis

Enhances camera pose estimation from uncalibrated images

Transforms vision foundation models into precise 3D reconstruction engines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-improving pipeline via feature alignment

Lightweight adapter with reprojection-based consistency loss

Geometrically-aligned feature space from VGGT outputs

🔎 Similar Papers

Generalizable 3D Scene Reconstruction via Divide and Conquer from a Single View