VFM-Recon: Unlocking Cross-Domain Scene-Level Neural Reconstruction with Scale-Aligned Foundation Priors

📅 2026-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of scene-level neural volume reconstruction from monocular videos in cross-domain settings—such as indoor-to-outdoor transitions—where inconsistent scale estimation and limited generalization often degrade performance. To mitigate these issues, we propose a lightweight scale alignment module that recovers multi-view geometric consistency and, for the first time, integrate transferable vision foundation model (VFM) priors into the neural reconstruction pipeline via task-specific adapters. This design preserves robustness across domains while significantly enhancing reconstruction accuracy. Our method achieves state-of-the-art results on multiple benchmarks, including ScanNet, TUM RGB-D, and Tanks and Temples, with an F1 score of 70.1 on the outdoor Tanks and Temples dataset, substantially outperforming existing approaches.

Technology Category

Application Category

📝 Abstract
Scene-level neural volumetric reconstruction from monocular videos remains challenging, especially under severe domain shifts. Although recent advances in vision foundation models (VFMs) provide transferable generalized priors learned from large-scale data, their scaleambiguous predictions are incompatible with the scale consistency required by volumetric fusion. To address this gap, we present VFMRecon, the first attempt to bridge transferable VFM priors with scaleconsistent requirements in scene-level neural reconstruction. Specifically, we first introduce a lightweight scale alignment stage that restores multiview scale coherence. We then integrate pretrained VFM features into the neural volumetric reconstruction pipeline via lightweight task-specific adapters, which are trained for reconstruction while preserving the crossdomain robustness of pretrained representations. We train our model on ScanNet train split and evaluate on both in-distribution ScanNet test split and out-of-distribution TUM RGB-D and Tanks and Temples datasets. The results demonstrate that our model achieves state-of-theart performance across all datasets domains. In particular, on the challenging outdoor Tanks and Temples dataset, our model achieves an F1 score of 70.1 in reconstructed mesh evaluation, substantially outperforming the closest competitor, VGGT, which only attains 51.8.
Problem

Research questions and friction points this paper is trying to address.

neural reconstruction
domain shift
scale consistency
vision foundation models
monocular video
Innovation

Methods, ideas, or system contributions that make the work stand out.

scale alignment
vision foundation models
neural reconstruction
cross-domain generalization
volumetric fusion
🔎 Similar Papers
No similar papers found.
Yuhang Ming
Yuhang Ming
Lecturer at Hangzhou Dianzi University
SLAMVPRComputer VisionRoboticsSpatial AI
T
Tingkang Xi
Hangzhou Dianzi University, Hangzhou, China
Xingrui Yang
Xingrui Yang
CARDC
3D Vision
L
Lixin Yang
Shanghai Jiao Tong University, Shanghai, China
Y
Yong Peng
Hangzhou Dianzi University, Hangzhou, China
C
Cewu Lu
Shanghai Jiao Tong University, Shanghai, China
W
Wanzeng Kong
Hangzhou Dianzi University, Hangzhou, China