Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

๐Ÿ“… 2025-11-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing feedforward 3D reconstruction models suffer from blurred geometric details and poor robustness due to the absence of high-fidelity depth/pose supervision and geometric misalignment in multi-view point cloud regression. To address this, we propose Fin3Rโ€”a lightweight fine-tuning framework that enhances the image encoder via monocular knowledge distillation: the pre-trained decoder is frozen, while Low-Rank Adaptation (LoRA) adapters are injected solely into the encoder to distill fine-grained geometric priors from a monocular teacher model, leveraging large-scale unlabeled monocular images. This approach requires no multi-view annotations and effectively mitigates inter-view geometric inconsistency. Evaluated on DUSt3R and MASt3R, Fin3R significantly improves boundary sharpness and recovery of complex structures, yielding higher geometric accuracyโ€”while incurring negligible increases in inference memory footprint or latency.

Technology Category

Application Category

๐Ÿ“ Abstract
We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to ( extit{i}) the scarcity of high-fidelity depth and pose supervision and ( extit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: href{http://visual-ai.github.io/fin3r}{https://visual-ai.github.io/fin3r}
Problem

Research questions and friction points this paper is trying to address.

Improves fine geometry details in feed-forward 3D reconstruction models
Addresses scarcity of high-fidelity depth and pose supervision
Resolves geometric misalignment from multi-view pointmap regression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes encoder with monocular teacher knowledge distillation
Uses lightweight LoRA adapter for efficiency
Freezes decoder to preserve view matching functionality
๐Ÿ”Ž Similar Papers
No similar papers found.
Weining Ren
Weining Ren
ETH Zurich
3D VisionNeRFSLAM
H
Hongjun Wang
Visual AI Lab, The University of Hong Kong
X
Xiao Tan
Department of Computer Vision Technology (VIS), Baidu Inc.
K
Kai Han
Visual AI Lab, The University of Hong Kong