Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

๐Ÿ“… 2025-11-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

213K/year
๐Ÿค– AI Summary
Existing feedforward 3D reconstruction models suffer from blurred geometric details and poor robustness due to the absence of high-fidelity depth/pose supervision and geometric misalignment in multi-view point cloud regression. To address this, we propose Fin3Rโ€”a lightweight fine-tuning framework that enhances the image encoder via monocular knowledge distillation: the pre-trained decoder is frozen, while Low-Rank Adaptation (LoRA) adapters are injected solely into the encoder to distill fine-grained geometric priors from a monocular teacher model, leveraging large-scale unlabeled monocular images. This approach requires no multi-view annotations and effectively mitigates inter-view geometric inconsistency. Evaluated on DUSt3R and MASt3R, Fin3R significantly improves boundary sharpness and recovery of complex structures, yielding higher geometric accuracyโ€”while incurring negligible increases in inference memory footprint or latency.

Technology Category

Application Category

๐Ÿ“ Abstract
We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to ( extit{i}) the scarcity of high-fidelity depth and pose supervision and ( extit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: href{http://visual-ai.github.io/fin3r}{https://visual-ai.github.io/fin3r}
Problem

Research questions and friction points this paper is trying to address.

Improves fine geometry details in feed-forward 3D reconstruction models
Addresses scarcity of high-fidelity depth and pose supervision
Resolves geometric misalignment from multi-view pointmap regression
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes encoder with monocular teacher knowledge distillation
Uses lightweight LoRA adapter for efficiency
Freezes decoder to preserve view matching functionality